regex for matching Google URLs
I'm trying to come up with a regex that will match any Google Images
URL such as these:
www.google.com/imgres
www.google.com/images
google.com/imgres
www.google.co.uk/imgres
www.google.nl/imgres
and a second regex for Google Products URLs of which this is one example:
www.google.com/url?sa=t&source=productsearch
but my pathetic regex "skills" aren't cutting it. Can anyone help me
out? Is there a perl module for this?
- Grant
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
> I'm trying to come up with a regex that will match any Google Images
> URL such as these:
>
> www.google.com/imgres
> www.google.com/images
> google.com/imgres
> www.google.co.uk/imgres
> www.google.nl/imgres
>
> and a second regex for Google Products URLs of which this is one example:
>
> www.google.com/url?sa=3Dt&source=3Dproductsearch
>
> but my pathetic regex "skills" aren't cutting it. =A0Can anyone help me
> out? =A0Is there a perl module for this?
>
> - Grant
I came up with these but they don't seem to work reliably:
/\.google\..*\/imgres\?/
/\.google\..*\/images\?/
/\.google\..*\/products\?/
Can anyone point out my mistake?
- Grant
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
1/18/2011, "Grant" <emailgrant [at] gmail.com> =E2=FB =EF=E8=F1=E0=EB=E8:
>> I'm trying to come up with a regex that will match any Google Images
>> URL such as these:
>>
>> www.google.com/imgres
>> www.google.com/images
>> google.com/imgres
>> www.google.co.uk/imgres
>> www.google.nl/imgres
>>
>> and a second regex for Google Products URLs of which this is one example:
>>
>> www.google.com/url?sa=3Dt&source=3Dproductsearch
>>
>> but my pathetic regex "skills" aren't cutting it. Can anyone help me
>> out? Is there a perl module for this?
>>
>> - Grant
>
>I came up with these but they don't seem to work reliably:
>
>/\.google\..*\/imgres\?/
>/\.google\..*\/images\?/
>/\.google\..*\/products\?/
/(www.){0,1}(google\.).*\/(imgres)|(images)|(products)\?{0,1 }/
--
Regards,
Alex
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
1/18/2011, "Alexey Mishustin" <shumkar [at] shumkar.ru> =E2=FB =EF=E8=F1=E0=EB=E8:
>
>1/18/2011, "Grant" <emailgrant [at] gmail.com> =E2=FB =EF=E8=F1=E0=EB=E8:
>
>>> I'm trying to come up with a regex that will match any Google Images
>>> URL such as these:
>>>
>>> www.google.com/imgres
>>> www.google.com/images
>>> google.com/imgres
>>> www.google.co.uk/imgres
>>> www.google.nl/imgres
>>>
>>> and a second regex for Google Products URLs of which this is one example:
>>>
>>> www.google.com/url?sa=3Dt&source=3Dproductsearch
>>>
>>> but my pathetic regex "skills" aren't cutting it. Can anyone help me
>>> out? Is there a perl module for this?
>>>
>>> - Grant
>>
>>I came up with these but they don't seem to work reliably:
>>
>>/\.google\..*\/imgres\?/
>>/\.google\..*\/images\?/
>>/\.google\..*\/products\?/
>
>/(www.){0,1}(google\.).*\/(imgres)|(images)|(products)\?{0, 1}/
Sorry, forgotten 1 backslash:
/(www\.){0,1}(google\.).*\/(imgres)|(images)|(products)\?{0, 1}/
The first regexp works too, but it matches more than necessary.
--
Regards,
Alex
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
>>>>> "AM" == Alexey Mishustin <shumkar [at] shumkar.ru> writes:
AM> /(www.){0,1}(google\.).*\/(imgres)|(images)|(products)\?{0,1 }/
{0,1} is just ? by itself.
you don't need to grab things that are not used later on. also why grab
each trailing word separately which means it will be hard to tell what
word was there.
the . after www needs to be escaped (it is unlikely ever to be other
than a real dot, but it is good practice and correct to escape it).
using alternate delimiters means you don't need to escape / which makes
it easier to read.
finally, when the regex gets this complex, use the /x modifier and
comment the parts (untested):
m{
(www\.)? # optional leading www
google\. # must have google.
.*? # skip some text MINIMALLY
/ # required slash
(imgres|images|products) # grab the following token (is
# it needed?)
\?? # optional url arg separator ?
/ # another required slash
}x
uri
--
Uri Guttman ------ uri [at] stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
1/18/2011, "Uri Guttman" <uri [at] StemSystems.com> =E2=FB =EF=E8=F1=E0=EB=E8:
>>>>>> "AM" =3D=3D Alexey Mishustin <shumkar [at] shumkar.ru> writes:
>
> AM> /(www.){0,1}(google\.).*\/(imgres)|(images)|(products)\?{0,1 }/
>
>{0,1} is just ? by itself.
Yes, I know. But I like the {a,b} syntax more :) It's more uniform than
?,+,* etc.
>you don't need to grab things that are not used later on. also why grab
>each trailing word separately which means it will be hard to tell what
>word was there.
Where did I grab things that are not used later? What do you mean by
trailing word?
>the . after www needs to be escaped (it is unlikely ever to be other
>than a real dot, but it is good practice and correct to escape it).
Yes, it was a mistake.
>using alternate delimiters means you don't need to escape / which makes
>it easier to read.
Sure.
--
Regards,
Alex
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
Alexey Mishustin wrote:
>
> 1/18/2011, "Grant"<emailgrant [at] gmail.com> âû ïèñàëè:
>
>> I came up with these but they don't seem to work reliably:
>>
>> /\.google\..*\/imgres\?/
>> /\.google\..*\/images\?/
>> /\.google\..*\/products\?/
>
> /(www.){0,1}(google\.).*\/(imgres)|(images)|(products)\?{0,1 }/
That says:
(www.){0,1}
Match a four character string, beginning with 'www', zero or one time,
and store the match in $1
(google\.)
Match the string 'google.' and store it in $2, Why? We know it will
always be 'google.'.
..*\/
Match zero or more non-newline characters up to, and including, the last
'/' character.
(imgres)
Match 'imgres' and store it in $3. Why? We know it will always be
'imgres'.
(images)
Match 'images' and store it in $4. Why? We know it will always be
'images'.
(products)
Match 'products' and store it in $5. Why? We know it will always be
'products'.
\?{0,1}
Match a '?' character zero or one times.
And finally, you use alternation which says to match either:
(www.){0,1}(google\.).*\/(imgres)
OR:
(images)
OR:
(products)\?{0,1}
John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction. -- Albert Einstein
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
1/18/2011, "John W. Krahn" <jwkrahn [at] shaw.ca> =E2=FB =EF=E8=F1=E0=EB=E8:
>Alexey Mishustin wrote:
>>
>> 1/18/2011, "Grant"<emailgrant [at] gmail.com> =E2=FB =EF=E8=F1=E0=EB=E8:
>>
>>> I came up with these but they don't seem to work reliably:
>>>
>>> /\.google\..*\/imgres\?/
>>> /\.google\..*\/images\?/
>>> /\.google\..*\/products\?/
>>
>> /(www.){0,1}(google\.).*\/(imgres)|(images)|(products)\?{0,1 }/
>
>That says:
>
>(www.){0,1}
>
>Match a four character string, beginning with 'www', zero or one time,
>and store the match in $1
The point should be escaped as I wrote already. So, match the string
'www.', zero or one time.
>(google\.)
>
>Match the string 'google.' and store it in $2, Why? We know it will
>always be 'google.'.
>
>.*\/
>
>Match zero or more non-newline characters up to, and including, the last
>'/' character.
>
>(imgres)
>
>Match 'imgres' and store it in $3. Why? We know it will always be
>'imgres'.
>
>(images)
>
>Match 'images' and store it in $4. Why? We know it will always be
>'images'.
>
>(products)
>
>Match 'products' and store it in $5. Why? We know it will always be
>'products'.
>
>\?{0,1}
>
>Match a '?' character zero or one times.
I used brackets not for storing but for combining in order to use the
combined patterns in alternation.
>And finally, you use alternation which says to match either:
>
>(www.){0,1}(google\.).*\/(imgres)
>
>OR:
>
>(images)
>
>OR:
>
>(products)\?{0,1}
Oops. Evidently, I was wrong in this combining... I meant
(imgres)
OR
(images)
OR
(products)
--
Regards,
Alex
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
1/18/2011, "Alexey Mishustin" <shumkar [at] shumkar.ru> =E2=FB =EF=E8=F1=E0=EB=E8:
>I meant
>
>(imgres)
>
>OR
>
>(images)
>
>OR
>
>(products)
Uri wrote the correct alternation for that:
(imgres|images|products)
So, I should write
/(www\.){0,1}(google\.).*\/(imgres|images|products)\?{0,1}/
--
Regards,
Alex
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
>>>>> "AM" =3D=3D Alexey Mishustin <shumkar [at] shumkar.ru> writes:
AM> 1/18/2011, "Uri Guttman" <uri [at] StemSystems.com> =D0=B2=D1=8B =D0=BF=D0=
=B8=D1=81=D0=B0=D0=BB=D0=B8:
>>>>>>> "AM" =3D=3D Alexey Mishustin <shumkar [at] shumkar.ru> writes:
>>
AM> /(www.){0,1}(google\.).*\/(imgres)|(images)|(products)\?{0,1 }/
>>
>> {0,1} is just ? by itself.
AM> Yes, I know. But I like the {a,b} syntax more :) It's more uniform th=
an
AM> ?,+,* etc.
it is noisier and more people know the shortcuts. code so other people
can read your code as it is for them, not yourself.
>> you don't need to grab things that are not used later on. also why grab
>> each trailing word separately which means it will be hard to tell what
>> word was there.
AM> Where did I grab things that are not used later? What do you mean by
AM> trailing word?
look in perldoc perlre and look at the difference between (foo) and
(:?foo).
>> using alternate delimiters means you don't need to escape / which makes
>> it easier to read.
AM> Sure.
see, you agree about easier to read. do the same with your use of {a,b}.
uri
--
Uri Guttman ------ uri [at] stemsystems.com -------- http://www.sysarch.com =
--
----- Perl Code Review , Architecture, Development, Training, Support ----=
--
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com -------=
--
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
>>>>> "AM" == Alexey Mishustin <shumkar [at] shumkar.ru> writes:
AM> I used brackets not for storing but for combining in order to use the
AM> combined patterns in alternation.
the point is parens (the correct term. brackets are []) is they will
grab the match inside them and store it in $1 and friends. grouping
without grabbing is more efficient and also tells the reader (that
person again! :) that they shouldn't look for using $1 (or whatver
number) after this regex is used.
AM> Oops. Evidently, I was wrong in this combining... I meant
AM> (imgres)
AM> OR
AM> (images)
AM> OR
AM> (products)
nope. you mean (:?imgres|images|products).
uri
--
Uri Guttman ------ uri [at] stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
1/18/2011, "Uri Guttman" <uri [at] StemSystems.com> =E2=FB =EF=E8=F1=E0=EB=E8:
>>>>>> "AM" =3D=3D Alexey Mishustin <shumkar [at] shumkar.ru> writes:
> AM> /(www.){0,1}(google\.).*\/(imgres)|(images)|(products)\?{0,1 }/
> >>
> >> {0,1} is just ? by itself.
>
> AM> Yes, I know. But I like the {a,b} syntax more :) It's more uniform tha=
n
> AM> ?,+,* etc.
>
>it is noisier and more people know the shortcuts. code so other people
>can read your code as it is for them, not yourself.
It's interesting, do most people here think so?
> >> you don't need to grab things that are not used later on. also why grab
> >> each trailing word separately which means it will be hard to tell what
> >> word was there.
>
> AM> Where did I grab things that are not used later? What do you mean by
> AM> trailing word?
>
>look in perldoc perlre and look at the difference between (foo) and
>(:?foo).
"This may substantially slow your program. Perl uses the same mechanism
to produce $1, $2, etc, so you also pay a price for each pattern that
contains capturing parentheses. (To avoid this cost while retaining the
grouping behaviour, use the extended regular expression (?: ... )
instead.)"
Thanks. I'll know it.
> >> using alternate delimiters means you don't need to escape / which makes
> >> it easier to read.
>
> AM> Sure.
>
>see, you agree about easier to read. do the same with your use of {a,b}.
I can consider the opinion of most people but I won't change my own
opinion because of that. For me, {a,b} is easier. Clearer and easier.
--
Regards,
Alex.
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
1/18/2011, "Uri Guttman" <uri [at] StemSystems.com> =E2=FB =EF=E8=F1=E0=EB=E8:
>>>>>> "AM" =3D=3D Alexey Mishustin <shumkar [at] shumkar.ru> writes:
>
> AM> I used brackets not for storing but for combining in order to use the
> AM> combined patterns in alternation.
>
>the point is parens
>(the correct term. brackets are [])
Eh... Useful correction.
And what is the correct term. for {} ?
>is they will
>grab the match inside them and store it in $1 and friends. grouping
>without grabbing is more efficient and also tells the reader (that
>person again! :) that they shouldn't look for using $1 (or whatver
>number) after this regex is used.
>
> AM> Oops. Evidently, I was wrong in this combining... I meant
>
> AM> (imgres)
>
> AM> OR
>
> AM> (images)
>
> AM> OR
>
> AM> (products)
>
>nope. you mean (:?imgres|images|products).
Yes, I see it now. Thanks again.
--
Regards,
Alex
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
>>>>> "AM" =3D=3D Alexey Mishustin <shumkar [at] shumkar.ru> writes:
AM> 1/18/2011, "Alexey Mishustin" <shumkar [at] shumkar.ru> =D0=B2=D1=8B =D0=
=BF=D0=B8=D1=81=D0=B0=D0=BB=D0=B8:
>> I meant
>>
>> (imgres)
>>
>> OR
>>
>> (images)
>>
>> OR
>>
>> (products)
AM> Uri wrote the correct alternation for that:
AM> (imgres|images|products)
AM> So, I should write
AM> /(www\.){0,1}(google\.).*\/(imgres|images|products)\?{0,1}/
you are still grabbing and not just grouping. and of course you are
still using {0,1} instead of ?
and finally you are using / for the delimiter when {} looks much better
when you have / in the regex.
uri
--
Uri Guttman ------ uri [at] stemsystems.com -------- http://www.sysarch.com =
--
----- Perl Code Review , Architecture, Development, Training, Support ----=
--
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com -------=
--
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
>>>>> "AM" =3D=3D Alexey Mishustin <shumkar [at] shumkar.ru> writes:
AM> 1/18/2011, "Uri Guttman" <uri [at] StemSystems.com> =D0=B2=D1=8B =D0=BF=D0=
=B8=D1=81=D0=B0=D0=BB=D0=B8:
>>>>>>> "AM" =3D=3D Alexey Mishustin <shumkar [at] shumkar.ru> writes:
>>
AM> I used brackets not for storing but for combining in order to use the
AM> combined patterns in alternation.
>>
>> the point is parens
>> (the correct term. brackets are [])
AM> Eh... Useful correction.
AM> And what is the correct term. for {} ?
braces.
() are parentheses or parens for short
[] are (square) brackets
{} are (curly) braces
i don't know the russian versions! :)
uri
--
Uri Guttman ------ uri [at] stemsystems.com -------- http://www.sysarch.com =
--
----- Perl Code Review , Architecture, Development, Training, Support ----=
--
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com -------=
--
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
1/18/2011, "Uri Guttman" <uri [at] StemSystems.com> =E2=FB =EF=E8=F1=E0=EB=E8:
>>>>>> "AM" =3D=3D Alexey Mishustin <shumkar [at] shumkar.ru> writes:
>
> AM> 1/18/2011, "Uri Guttman" <uri [at] StemSystems.com> =D0=B2=D1=8B =D0=BF=D0=
=B8=D1=81=D0=B0=D0=BB=D0=B8:
>
> >>>>>>> "AM" =3D=3D Alexey Mishustin <shumkar [at] shumkar.ru> writes:
> >>
> AM> I used brackets not for storing but for combining in order to use the
> AM> combined patterns in alternation.
> >>
> >> the point is parens
>
> >> (the correct term. brackets are [])
> AM> Eh... Useful correction.
>
> AM> And what is the correct term. for {} ?
>
>braces.
>
>() are parentheses or parens for short
>[] are (square) brackets
>{} are (curly) braces
>
>i don't know the russian versions! :)
() =EA=F0=F3=E3=EB=FB=E5 =F1=EA=EE=E1=EA=E8 - literally, round brackets
[] =EA=E2=E0ä=F0=E0=F2=ED=FB=E5 =F1=EA=EE=E1=EA=E8 - literally, square bra=
ckets
{} =F4=E8=E3=F3=F0=ED=FB=E5 =F1=EA=EE=E1=EA=E8 - literally, figured brackets
:)
--
Regards,
Alex
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: regex for matching Google URLs
From: "Uri Guttman" <uri [at] StemSystems.com>
>>>>>> "AM" == Alexey Mishustin <shumkarshumkar.ru> writes:
>
> AM> I used brackets not for storing but for combining in order to use the
> AM> combined patterns in alternation.
>
> the point is parens (the correct term. brackets are []) is they will
> grab the match inside them and store it in $1 and friends. grouping
> without grabbing is more efficient and also tells the reader (that
> person again! :) that they shouldn't look for using $1 (or whatver
> number) after this regex is used.
>
> AM> Oops. Evidently, I was wrong in this combining... I meant
>
> AM> (imgres)
>
> AM> OR
>
> AM> (images)
>
> AM> OR
>
> AM> (products)
>
> nope. you mean (:?imgres|images|products).
>
> uri
The correct syntax is (?:imgres|images|products).
The wrong syntax appeared in 2 messages and it might cause confusion.
Octavian
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/