HTML::TreeBuilder ignore_ignorable problems

HTML::TreeBuilder ignore_ignorable problems

am 13.06.2005 19:38:35 von Das

Hello,

I'm using TreeBuilder and am finding it useful.

I have a few questions.

one is if I turn off ingorable_whitespace as such, i get errors when
using element methods.

Here is an example:

sub get_content {
my $string = shift;
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->no_space_compacting(1);
$tree->ignore_ignorable_whitespace(0);
$tree->parse($string);
$tree->eof;
#$tree->elementify;
my $content = '';
$tree = delete_unwanted_nodes($tree);
my $node = $tree->find_by_tag_name('body');
#$node = $node->nativize_pre_newlines();
my @nodes = $node->content_list();
foreach my $node (@nodes){
my $cont = $node->as_text(skip_dels => 1);



if ($cont){
$content .= $cont;
}
}
$tree = $tree->delete;
return $content;
}

i get the error: Can't call method "as_text" without a package or object
reference at ./test.pl line 152.

which of course goes away if i comment out the ignore_ignorable line.

Also the method nativize_pre_newlines is not implemented, though it is
in the docs of HTML::Element. I've written my own simple nativizer. Just
wanted to point that out.
And I've also written my own as_text_with_newlines, to get around this,
but wanted to comment on it.

Thanks for a great set of modules to Gisle and Sean!


>
>

Possible HTML::Element patch was: ignore_ignorable problems

am 23.06.2005 20:33:43 von Das

I discovered that what I really wanted was not ignore_ignorable

but for HTML::Element as_text to leave a space between child content
segments and not do this
if no space was at the end of the last child text bit.

Current Behavior is:
given following psuedo_html


Joe PerlCamel role model for kids


Hi, my name is and I'm a
good role model for kids



my $string = $node->as_text();
print qq{$string\n};

gives: Joe PerlCamel role model for kidsHi, my name is Joe PerlCamel and
I'm a good role model for kids.


I would like to submit a patch to HTML:Element

proposed method name is:

as_text_w_space


it simply looks like this:


sub as_text_w_space {
# Yet another iteratively implemented traverser
my($this,%options) = @_;
my $skip_dels = $options{'skip_dels'} || 0;
#print "Skip dels: $skip_dels\n";
my(@pile) = ($this);
my $tag;
my $text = '';
while(@pile) {
if(!defined($pile[0])) { # undef!
# no-op
} elsif(!ref($pile[0])) { # text bit! save it!
my $val = shift @pile;
#add a space after each text bit unless already there
unless ($val =~ /\s$/){ $val .= " ";}
$text .= $val;
} else { # it's a ref -- traverse under it
unshift @pile, @{$this->{'_content'} || $nillio}
unless
($tag = ($this = shift @pile)->{'_tag'}) eq 'style'
or $tag eq 'script'
or ($skip_dels and $tag eq 'del');
}
}
return $text;
}


Let me know what you think.

Is Sean around?

Cheers!



deborah sciales wrote:

> Hello,
>
> I'm using TreeBuilder and am finding it useful.
>
> I have a few questions.
>
> one is if I turn off ingorable_whitespace as such, i get errors when
> using element methods.
>
> Here is an example:
>
> sub get_content {
> my $string = shift;
> my $tree = HTML::TreeBuilder->new; # empty tree
> $tree->no_space_compacting(1);
> $tree->ignore_ignorable_whitespace(0);
> $tree->parse($string);
> $tree->eof;
> #$tree->elementify;
> my $content = '';
> $tree = delete_unwanted_nodes($tree);
> my $node = $tree->find_by_tag_name('body');
> #$node = $node->nativize_pre_newlines();
> my @nodes = $node->content_list();
> foreach my $node (@nodes){
> my $cont = $node->as_text(skip_dels => 1);
>
>
>
> if ($cont){
> $content .= $cont;
> }
> }
> $tree = $tree->delete;
> return $content;
> }
>
> i get the error: Can't call method "as_text" without a package or
> object reference at ./test.pl line 152.
>
> which of course goes away if i comment out the ignore_ignorable line.
>
> Also the method nativize_pre_newlines is not implemented, though it is
> in the docs of HTML::Element. I've written my own simple nativizer.
> Just wanted to point that out.
> And I've also written my own as_text_with_newlines, to get around
> this, but wanted to comment on it.
>
> Thanks for a great set of modules to Gisle and Sean!
>
>
>>
>>
>
>

Re: Possible HTML::Element patch was: ignore_ignorable problems

am 23.06.2005 21:49:56 von Das

oops. I hope that makes sense.

as_text_w_space will add a space between text bits if there is no space
at the end of an included child text segment. Current behavior of
as_text is to run those child text segment together, as in example.

deborah sciales wrote:

> I discovered that what I really wanted was not ignore_ignorable
>
> but for HTML::Element as_text to leave a space between child
> content segments and not do this
> if no space was at the end of the last child text bit.
>
> Current Behavior is:
> given following psuedo_html
>
>
>

Joe PerlCamel role model for kids


>
Hi, my name is and I'm a
> good role model for kids

>
>
> my $string = $node->as_text();
> print qq{$string\n};
>
> gives: Joe PerlCamel role model for kidsHi, my name is Joe PerlCamel
> and I'm a good role model for kids.
>
>
> I would like to submit a patch to HTML:Element
>
> proposed method name is:
>
> as_text_w_space
>
>
> it simply looks like this:
>
>
> sub as_text_w_space {
> # Yet another iteratively implemented traverser
> my($this,%options) = @_;
> my $skip_dels = $options{'skip_dels'} || 0;
> #print "Skip dels: $skip_dels\n";
> my(@pile) = ($this);
> my $tag;
> my $text = '';
> while(@pile) {
> if(!defined($pile[0])) { # undef!
> # no-op
> } elsif(!ref($pile[0])) { # text bit! save it!
> my $val = shift @pile;
> #add a space after each text bit unless already there
> unless ($val =~ /\s$/){ $val .= " ";}
> $text .= $val;
> } else { # it's a ref -- traverse under it
> unshift @pile, @{$this->{'_content'} || $nillio}
> unless
> ($tag = ($this = shift @pile)->{'_tag'}) eq 'style'
> or $tag eq 'script'
> or ($skip_dels and $tag eq 'del');
> }
> }
> return $text;
> }
>
>
> Let me know what you think.
>
> Is Sean around?
>
> Cheers!
>
>
>
> deborah sciales wrote:
>
>> Hello,
>>
>> I'm using TreeBuilder and am finding it useful.
>>
>> I have a few questions.
>>
>> one is if I turn off ingorable_whitespace as such, i get errors when
>> using element methods.
>>
>> Here is an example:
>>
>> sub get_content {
>> my $string = shift;
>> my $tree = HTML::TreeBuilder->new; # empty tree
>> $tree->no_space_compacting(1);
>> $tree->ignore_ignorable_whitespace(0);
>> $tree->parse($string);
>> $tree->eof;
>> #$tree->elementify;
>> my $content = '';
>> $tree = delete_unwanted_nodes($tree);
>> my $node = $tree->find_by_tag_name('body');
>> #$node = $node->nativize_pre_newlines();
>> my @nodes = $node->content_list();
>> foreach my $node (@nodes){
>> my $cont = $node->as_text(skip_dels => 1);
>>
>>
>>
>> if ($cont){
>> $content .= $cont;
>> }
>> }
>> $tree = $tree->delete;
>> return $content;
>> }
>>
>> i get the error: Can't call method "as_text" without a package or
>> object reference at ./test.pl line 152.
>>
>> which of course goes away if i comment out the ignore_ignorable line.
>>
>> Also the method nativize_pre_newlines is not implemented, though it
>> is in the docs of HTML::Element. I've written my own simple
>> nativizer. Just wanted to point that out.
>> And I've also written my own as_text_with_newlines, to get around
>> this, but wanted to comment on it.
>>
>> Thanks for a great set of modules to Gisle and Sean!
>>
>>
>>>
>>>
>>
>>
>
>