
Unintended behavior: Range operator inside a while loop continues topattern match on the subsequent
--001636e906fbb282b604a171443b
Content-Type: text/plain; charset=UTF-8
Hi,
I was parsing a collection of HTML files where I wanted to extract a certain
block from each file, like this:
> ./script.pl *.html
my $accumulator;
my $capture_counter;
while ( <> ) {
if ( /<h1>/.../labelsub/ ) {
$accumulator .= $_ unless /labelsub/;
if ( /labelsub/ && !$capture_counter ) {
print $accumulator;
$capture_counter = 1;
}
else {
next;
}
}
else {
next;
}
}
continue { # flush out the variables and clean up
if ( eof ) {
close ARGV;
$accumulator = '';
$capture_counter = '';
}
}
The bit about the $capture_counter is because some of the files have
multiple blocks of text that could be accumulated, and I only want the first
block in the file.
This usually works fine, until I encountered an input file that did not
contain the string 'labelsub' after the first '<h1>' regex pattern match.
Then the conditional if test continued to search in the incoming lines in
the next file (because I am processing a whole batch using the while (<>)
operator), which it eventually found, and then printed nothing, because at
the end-of-file of the previous file, the script flushed the contents of the
accumulator.
One solution is to just run the same script individually on each file, but I
was wondering if there was a way to reset the 'state' of the range operator
pattern match at the end of the physical file (or at any other time for that
matter)?
Thanks,
--Marc
--001636e906fbb282b604a171443b--
Re: Unintended behavior: Range operator inside a while loop continuesto pattern match on the subsequ
--0015175cd8d275293604a1727e21
Content-Type: text/plain; charset=ISO-8859-1
http://www.effectiveperlprogramming.com/blog/314
Brian.
On Thu, Apr 21, 2011 at 2:42 PM, Marc Perry <marcperryster [at] gmail.com> wrote:
> Hi,
>
> I was parsing a collection of HTML files where I wanted to extract a
> certain
> block from each file, like this:
>
> > ./script.pl *.html
>
> my $accumulator;
> my $capture_counter;
>
> while ( <> ) {
> if ( /<h1>/.../labelsub/ ) {
> $accumulator .= $_ unless /labelsub/;
> if ( /labelsub/ && !$capture_counter ) {
> print $accumulator;
> $capture_counter = 1;
> }
> else {
> next;
> }
> }
> else {
> next;
> }
> }
> continue { # flush out the variables and clean up
> if ( eof ) {
> close ARGV;
> $accumulator = '';
> $capture_counter = '';
> }
> }
>
> The bit about the $capture_counter is because some of the files have
> multiple blocks of text that could be accumulated, and I only want the
> first
> block in the file.
>
> This usually works fine, until I encountered an input file that did not
> contain the string 'labelsub' after the first '<h1>' regex pattern match.
> Then the conditional if test continued to search in the incoming lines in
> the next file (because I am processing a whole batch using the while (<>)
> operator), which it eventually found, and then printed nothing, because at
> the end-of-file of the previous file, the script flushed the contents of
> the
> accumulator.
>
> One solution is to just run the same script individually on each file, but
> I
> was wondering if there was a way to reset the 'state' of the range operator
> pattern match at the end of the physical file (or at any other time for
> that
> matter)?
>
> Thanks,
>
> --Marc
>
--0015175cd8d275293604a1727e21--
Re: Unintended behavior: Range operator inside a while loopcontinues to pattern match on the subsequ
On Thu, Apr 21, 2011 at 01:42:42PM -0400, Marc Perry wrote:
> Hi,
>
> I was parsing a collection of HTML files where I wanted to extract a certain
> block from each file, like this:
This is where everyone will tell you to use some dedicated HTML parsing
module.
> > ./script.pl *.html
>
> my $accumulator;
> my $capture_counter;
>
> while ( <> ) {
> if ( /<h1>/.../labelsub/ ) {
> $accumulator .= $_ unless /labelsub/;
> if ( /labelsub/ && !$capture_counter ) {
> print $accumulator;
> $capture_counter = 1;
> }
> else {
> next;
> }
> }
> else {
> next;
> }
> }
> continue { # flush out the variables and clean up
> if ( eof ) {
> close ARGV;
> $accumulator = '';
> $capture_counter = '';
> }
> }
>
> The bit about the $capture_counter is because some of the files have
> multiple blocks of text that could be accumulated, and I only want the first
> block in the file.
>
> This usually works fine, until I encountered an input file that did not
> contain the string 'labelsub' after the first '<h1>' regex pattern match.
> Then the conditional if test continued to search in the incoming lines in
> the next file (because I am processing a whole batch using the while (<>)
> operator), which it eventually found, and then printed nothing, because at
> the end-of-file of the previous file, the script flushed the contents of the
> accumulator.
>
> One solution is to just run the same script individually on each file, but I
> was wondering if there was a way to reset the 'state' of the range operator
> pattern match at the end of the physical file (or at any other time for that
> matter)?
No, there isn't (unless you want to get fancy and use a closure or
something) and so you'll need to find some other way to "end" the range.
The obvious other end point is the end of file, and so you can have your
range operator as:
if ( /<h1>/ ... /labelsub/ || eof ) {
This will ensure that the range operator "ends" by the end of each file,
but you'd need to do extra work because of the logic of the rest of your
program. So let's see if we can do something about that.
Whilst it doesn't make a difference to the logic, I prefer to jump out
of a loop early if I find it doesn't satisfy the conditions I'm looking
for. So I think that:
next unless /<h1>/ .. /labelsub/ || eof;
looks tidier than the if else conditional.
Then there's your logic to ensure you only count the first block in each
file. Perl has the little-known ?? counterpart to // which will only
match once. So making that line:
next unless ?<h1>? .. /labelsub/ || eof;
Allows you to get rid of the $capture_counter variable. But you'll need
to add a reset to the continue block, to reset the ?? at the start of a
new file.
Finally, with this change you may as well just print $accumulator in the
continue block too. So we end up with
my $accumulator;
while ( <> ) {
next unless ?<h1>? .. /labelsub/ || eof;
$accumulator .= $_ unless /labelsub/;
}
continue { # flush out the variables and clean up
if ( eof ) {
print $accumulator;
$accumulator = '';
reset;
}
}
which, I think, does what you are after.
The docs mention that ?? is vaguely deprecated:
This usage is vaguely deprecated, which means it just might possibly
be removed in some distant future version of Perl, perhaps somewhere
around the year 2168.
That doesn't sound too bad, but there was some talk of an earlier
deprecation of the bare ?? syntax, so it might be safer to use m??
instead.
Interestingly (for me), this is the first time in over 20 years that I
have found a legitimate use for ??, and the associated reset.
--
Paul Johnson - paul [at] pjcj.net
http://www.pjcj.net
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: Unintended behavior: Range operator inside a while loop continuesto pattern match on the subsequ
--000e0cd29b94208d7f04a172fbba
Content-Type: text/plain; charset=UTF-8
Beautiful; I should've known that brian d foy would have come up with a
solution--I even have a copy of that book!
Thanks,
--Marc
On Thu, Apr 21, 2011 at 3:10 PM, Brian Fraser <fraserbn [at] gmail.com> wrote:
> http://www.effectiveperlprogramming.com/blog/314
>
> Brian.
>
> On Thu, Apr 21, 2011 at 2:42 PM, Marc Perry <marcperryster [at] gmail.com>wrote:
>
>> Hi,
>>
>> I was parsing a collection of HTML files where I wanted to extract a
>> certain
>> block from each file, like this:
>>
>> > ./script.pl *.html
>>
>> my $accumulator;
>> my $capture_counter;
>>
>> while ( <> ) {
>> if ( /<h1>/.../labelsub/ ) {
>> $accumulator .= $_ unless /labelsub/;
>> if ( /labelsub/ && !$capture_counter ) {
>> print $accumulator;
>> $capture_counter = 1;
>> }
>> else {
>> next;
>> }
>> }
>> else {
>> next;
>> }
>> }
>> continue { # flush out the variables and clean up
>> if ( eof ) {
>> close ARGV;
>> $accumulator = '';
>> $capture_counter = '';
>> }
>> }
>>
>> The bit about the $capture_counter is because some of the files have
>> multiple blocks of text that could be accumulated, and I only want the
>> first
>> block in the file.
>>
>> This usually works fine, until I encountered an input file that did not
>> contain the string 'labelsub' after the first '<h1>' regex pattern match.
>> Then the conditional if test continued to search in the incoming lines in
>> the next file (because I am processing a whole batch using the while (<>)
>> operator), which it eventually found, and then printed nothing, because at
>> the end-of-file of the previous file, the script flushed the contents of
>> the
>> accumulator.
>>
>> One solution is to just run the same script individually on each file, but
>> I
>> was wondering if there was a way to reset the 'state' of the range
>> operator
>> pattern match at the end of the physical file (or at any other time for
>> that
>> matter)?
>>
>> Thanks,
>>
>> --Marc
>>
>
>
--000e0cd29b94208d7f04a172fbba--
Re: Unintended behavior: Range operator inside a while loop continuesto pattern match on the subsequ
--000e0cd29b94cbaf9104a1730150
Content-Type: text/plain; charset=UTF-8
Thanks, Paul. A very thoughtful response--I will try this out (I don't
recall every encountering the ?? operator, but if it works as advertised I
will likely use it a lot).
--Marc
On Thu, Apr 21, 2011 at 3:29 PM, Paul Johnson <paul [at] pjcj.net> wrote:
> On Thu, Apr 21, 2011 at 01:42:42PM -0400, Marc Perry wrote:
> > Hi,
> >
> > I was parsing a collection of HTML files where I wanted to extract a
> certain
> > block from each file, like this:
>
> This is where everyone will tell you to use some dedicated HTML parsing
> module.
>
> > > ./script.pl *.html
> >
> > my $accumulator;
> > my $capture_counter;
> >
> > while ( <> ) {
> > if ( /<h1>/.../labelsub/ ) {
> > $accumulator .= $_ unless /labelsub/;
> > if ( /labelsub/ && !$capture_counter ) {
> > print $accumulator;
> > $capture_counter = 1;
> > }
> > else {
> > next;
> > }
> > }
> > else {
> > next;
> > }
> > }
> > continue { # flush out the variables and clean up
> > if ( eof ) {
> > close ARGV;
> > $accumulator = '';
> > $capture_counter = '';
> > }
> > }
> >
> > The bit about the $capture_counter is because some of the files have
> > multiple blocks of text that could be accumulated, and I only want the
> first
> > block in the file.
> >
> > This usually works fine, until I encountered an input file that did not
> > contain the string 'labelsub' after the first '<h1>' regex pattern match.
> > Then the conditional if test continued to search in the incoming lines in
> > the next file (because I am processing a whole batch using the while (<>)
> > operator), which it eventually found, and then printed nothing, because
> at
> > the end-of-file of the previous file, the script flushed the contents of
> the
> > accumulator.
> >
> > One solution is to just run the same script individually on each file,
> but I
> > was wondering if there was a way to reset the 'state' of the range
> operator
> > pattern match at the end of the physical file (or at any other time for
> that
> > matter)?
>
> No, there isn't (unless you want to get fancy and use a closure or
> something) and so you'll need to find some other way to "end" the range.
> The obvious other end point is the end of file, and so you can have your
> range operator as:
>
> if ( /<h1>/ ... /labelsub/ || eof ) {
>
> This will ensure that the range operator "ends" by the end of each file,
> but you'd need to do extra work because of the logic of the rest of your
> program. So let's see if we can do something about that.
>
> Whilst it doesn't make a difference to the logic, I prefer to jump out
> of a loop early if I find it doesn't satisfy the conditions I'm looking
> for. So I think that:
>
> next unless /<h1>/ .. /labelsub/ || eof;
>
> looks tidier than the if else conditional.
>
> Then there's your logic to ensure you only count the first block in each
> file. Perl has the little-known ?? counterpart to // which will only
> match once. So making that line:
>
> next unless ?<h1>? .. /labelsub/ || eof;
>
> Allows you to get rid of the $capture_counter variable. But you'll need
> to add a reset to the continue block, to reset the ?? at the start of a
> new file.
>
> Finally, with this change you may as well just print $accumulator in the
> continue block too. So we end up with
>
> my $accumulator;
>
> while ( <> ) {
> next unless ?<h1>? .. /labelsub/ || eof;
> $accumulator .= $_ unless /labelsub/;
> }
> continue { # flush out the variables and clean up
> if ( eof ) {
> print $accumulator;
> $accumulator = '';
> reset;
> }
> }
>
> which, I think, does what you are after.
>
> The docs mention that ?? is vaguely deprecated:
>
> This usage is vaguely deprecated, which means it just might possibly
> be removed in some distant future version of Perl, perhaps somewhere
> around the year 2168.
>
> That doesn't sound too bad, but there was some talk of an earlier
> deprecation of the bare ?? syntax, so it might be safer to use m??
> instead.
>
> Interestingly (for me), this is the first time in over 20 years that I
> have found a legitimate use for ??, and the associated reset.
>
> --
> Paul Johnson - paul [at] pjcj.net
> http://www.pjcj.net
>
--000e0cd29b94cbaf9104a1730150--