Retrieving text without dtext option

I have a program which uses HTML::TokeParser to split apart web pages.
Is there a way of making $stream->get_text() return the text without the
entities decoded? I can see options in get_token but is there an
equivalent method with get_text or get_trimmed_text?

Thanks

Kevin.
kevinphilp [ Mi, 14 Februar 2007 23:10 ] [ ID #1629472 ]

Re: Retrieving text without dtext option

kevin <kevinphilp [at] cybercolloids.net> writes:

> I have a program which uses HTML::TokeParser to split apart web
> pages. Is there a way of making $stream->get_text() return the text
> without the entities decoded?

No. Why do you want that?

It is trivial to reimplement a version of get_text that does what you
want based get_token(). You even have the old get_text that you can
use as a starting point and just insert your version into the
HTML::TokeParser namespace.

sub HTML::TokeParser::get_undecoded_text {
...
}

> I can see options in get_token but is there an equivalent method
> with get_text or get_trimmed_text?

get_token always returns the raw undecoded text. There isn't an
option to make it do otherwise.

--Gisle
gisle [ Do, 15 Februar 2007 09:39 ] [ ID #1630671 ]
Perl » perl.libwww » Retrieving text without dtext option

Vorheriges Thema: executing embedded Javascript from a scraped webpage
Nächstes Thema: Attribute quoting with backquotes in HTML::Parser