Help extracting strings via awk.
Hi,
I need help extracting urls from a large text file. I don't have
control over the format of the file so it is always different, but the
urls are always in <url>..</url> tags. The text is always on the same
line without line breaks.
asdfsdf<url>www.google.com</url>dfgdgdg<url>www.yahoo.com</url>asd
adfsdf sd sdgdfg<url>...
The surrounding text is always different and I need the quickest and
most efficient way to extract just the text between the 2 tags and
output it somewhere. Right now I do this with several commands and it
takes a while for a large file, but I know there is probably a quicker
and better way to do this. Please help.
Re: Help extracting strings via awk.
On 12/6/2007 7:20 PM, oleg.rakhmanchik [at] gmail.com wrote:
> Hi,
>
> I need help extracting urls from a large text file. I don't have
> control over the format of the file so it is always different, but the
> urls are always in <url>..</url> tags. The text is always on the same
> line without line breaks.
>
> asdfsdf<url>www.google.com</url>dfgdgdg<url>www.yahoo.com</url>asd
> adfsdf sd sdgdfg<url>...
>
> The surrounding text is always different and I need the quickest and
> most efficient way to extract just the text between the 2 tags and
> output it somewhere. Right now I do this with several commands and it
> takes a while for a large file, but I know there is probably a quicker
> and better way to do this. Please help.
With GNU awk:
gawk -F'<url>' -v RS='</url>' 'RT{print $NF}' file
Ed.
Re: Help extracting strings via awk.
On Thu, 06 Dec 2007 20:59:01 -0600, Ed Morton wrote:
> On 12/6/2007 7:20 PM, oleg.rakhmanchik [at] gmail.com wrote:
>> Hi,
>>
>> I need help extracting urls from a large text file. I don't have
>> control over the format of the file so it is always different, but the
>> urls are always in <url>..</url> tags. The text is always on the same
>> line without line breaks.
>>
>> asdfsdf<url>www.google.com</url>dfgdgdg<url>www.yahoo.com</url>asd
>> adfsdf sd sdgdfg<url>...
>>
>> The surrounding text is always different and I need the quickest and
>> most efficient way to extract just the text between the 2 tags and
>> output it somewhere. Right now I do this with several commands and it
>> takes a while for a large file, but I know there is probably a quicker
>> and better way to do this. Please help.
>
> With GNU awk:
>
> gawk -F'<url>' -v RS='</url>' 'RT{print $NF}' file
>
> Ed.
With Perl:
perl -ne 'for (/<url>(.*?)<\/url>/g) {print "$_\n"}' file
Regards,
Steffen "goedel" Schuler
Re: Help extracting strings via awk.
On Dec 6, 9:59 pm, Ed Morton <mor... [at] lsupcaemnt.com> wrote:
> On 12/6/2007 7:20 PM, oleg.rakhmanc... [at] gmail.com wrote:
>
> > Hi,
>
> > I need help extracting urls from a large text file. I don't have
> > control over the format of the file so it is always different, but the
> > urls are always in <url>..</url> tags. The text is always on the same
> > line without line breaks.
>
> > asdfsdf<url>www.google.com</url>dfgdgdg<url>www.yahoo.com</url>asd
> > adfsdf sd sdgdfg<url>...
>
> > The surrounding text is always different and I need the quickest and
> > most efficient way to extract just the text between the 2 tags and
> > output it somewhere. Right now I do this with several commands and it
> > takes a while for a large file, but I know there is probably a quicker
> > and better way to do this. Please help.
>
> With GNU awk:
>
> gawk -F'<url>' -v RS='</url>' 'RT{print $NF}' file
>
> Ed.
These work perfectly, thank you.
Re: Help extracting strings via awk.
Steffen Schuler wrote:
>
> On Thu, 06 Dec 2007 20:59:01 -0600, Ed Morton wrote:
>
> > On 12/6/2007 7:20 PM, oleg.rakhmanchik [at] gmail.com wrote:
> >>
> >> I need help extracting urls from a large text file. I don't have
> >> control over the format of the file so it is always different, but the
> >> urls are always in <url>..</url> tags. The text is always on the same
> >> line without line breaks.
> >>
> >> asdfsdf<url>www.google.com</url>dfgdgdg<url>www.yahoo.com</url>asd
> >> adfsdf sd sdgdfg<url>...
> >>
> >> The surrounding text is always different and I need the quickest and
> >> most efficient way to extract just the text between the 2 tags and
> >> output it somewhere. Right now I do this with several commands and it
> >> takes a while for a large file, but I know there is probably a quicker
> >> and better way to do this. Please help.
> >
> > With GNU awk:
> >
> > gawk -F'<url>' -v RS='</url>' 'RT{print $NF}' file
>
> With Perl:
>
> perl -ne 'for (/<url>(.*?)<\/url>/g) {print "$_\n"}' file
The Perl version of that gawk program would be:
perl -F'<url>' -lane'BEGIN{$/="</url>"} print $F[-1]' file
John
--
use Perl;
program
fulfillment