Help extracting strings via awk.

Hi,

I need help extracting urls from a large text file. I don't have
control over the format of the file so it is always different, but the
urls are always in <url>..</url> tags. The text is always on the same
line without line breaks.

asdfsdf<url>www.google.com</url>dfgdgdg<url>www.yahoo.com</url>asd
adfsdf sd sdgdfg<url>...

The surrounding text is always different and I need the quickest and
most efficient way to extract just the text between the 2 tags and
output it somewhere. Right now I do this with several commands and it
takes a while for a large file, but I know there is probably a quicker
and better way to do this. Please help.
oleg.rakhmanchik [ Fr, 07 Dezember 2007 02:20 ] [ ID #1887676 ]

Re: Help extracting strings via awk.

On 12/6/2007 7:20 PM, oleg.rakhmanchik [at] gmail.com wrote:
> Hi,
>
> I need help extracting urls from a large text file. I don't have
> control over the format of the file so it is always different, but the
> urls are always in <url>..</url> tags. The text is always on the same
> line without line breaks.
>
> asdfsdf<url>www.google.com</url>dfgdgdg<url>www.yahoo.com</url>asd
> adfsdf sd sdgdfg<url>...
>
> The surrounding text is always different and I need the quickest and
> most efficient way to extract just the text between the 2 tags and
> output it somewhere. Right now I do this with several commands and it
> takes a while for a large file, but I know there is probably a quicker
> and better way to do this. Please help.

With GNU awk:

gawk -F'<url>' -v RS='</url>' 'RT{print $NF}' file

Ed.
Ed Morton [ Fr, 07 Dezember 2007 03:59 ] [ ID #1887677 ]

Re: Help extracting strings via awk.

On Thu, 06 Dec 2007 20:59:01 -0600, Ed Morton wrote:

> On 12/6/2007 7:20 PM, oleg.rakhmanchik [at] gmail.com wrote:
>> Hi,
>>
>> I need help extracting urls from a large text file. I don't have
>> control over the format of the file so it is always different, but the
>> urls are always in <url>..</url> tags. The text is always on the same
>> line without line breaks.
>>
>> asdfsdf<url>www.google.com</url>dfgdgdg<url>www.yahoo.com</url>asd
>> adfsdf sd sdgdfg<url>...
>>
>> The surrounding text is always different and I need the quickest and
>> most efficient way to extract just the text between the 2 tags and
>> output it somewhere. Right now I do this with several commands and it
>> takes a while for a large file, but I know there is probably a quicker
>> and better way to do this. Please help.
>
> With GNU awk:
>
> gawk -F'<url>' -v RS='</url>' 'RT{print $NF}' file
>
> Ed.

With Perl:

perl -ne 'for (/<url>(.*?)<\/url>/g) {print "$_\n"}' file

Regards,

Steffen "goedel" Schuler
Steffen Schuler [ Fr, 07 Dezember 2007 06:59 ] [ ID #1887679 ]

Re: Help extracting strings via awk.

On Dec 6, 9:59 pm, Ed Morton <mor... [at] lsupcaemnt.com> wrote:
> On 12/6/2007 7:20 PM, oleg.rakhmanc... [at] gmail.com wrote:
>
> > Hi,
>
> > I need help extracting urls from a large text file. I don't have
> > control over the format of the file so it is always different, but the
> > urls are always in <url>..</url> tags. The text is always on the same
> > line without line breaks.
>
> > asdfsdf<url>www.google.com</url>dfgdgdg<url>www.yahoo.com</url>asd
> > adfsdf sd sdgdfg<url>...
>
> > The surrounding text is always different and I need the quickest and
> > most efficient way to extract just the text between the 2 tags and
> > output it somewhere. Right now I do this with several commands and it
> > takes a while for a large file, but I know there is probably a quicker
> > and better way to do this. Please help.
>
> With GNU awk:
>
> gawk -F'<url>' -v RS='</url>' 'RT{print $NF}' file
>
> Ed.

These work perfectly, thank you.
oleg.rakhmanchik [ Fr, 07 Dezember 2007 16:35 ] [ ID #1887697 ]

Re: Help extracting strings via awk.

Steffen Schuler wrote:
>
> On Thu, 06 Dec 2007 20:59:01 -0600, Ed Morton wrote:
>
> > On 12/6/2007 7:20 PM, oleg.rakhmanchik [at] gmail.com wrote:
> >>
> >> I need help extracting urls from a large text file. I don't have
> >> control over the format of the file so it is always different, but the
> >> urls are always in <url>..</url> tags. The text is always on the same
> >> line without line breaks.
> >>
> >> asdfsdf<url>www.google.com</url>dfgdgdg<url>www.yahoo.com</url>asd
> >> adfsdf sd sdgdfg<url>...
> >>
> >> The surrounding text is always different and I need the quickest and
> >> most efficient way to extract just the text between the 2 tags and
> >> output it somewhere. Right now I do this with several commands and it
> >> takes a while for a large file, but I know there is probably a quicker
> >> and better way to do this. Please help.
> >
> > With GNU awk:
> >
> > gawk -F'<url>' -v RS='</url>' 'RT{print $NF}' file
>
> With Perl:
>
> perl -ne 'for (/<url>(.*?)<\/url>/g) {print "$_\n"}' file

The Perl version of that gawk program would be:

perl -F'<url>' -lane'BEGIN{$/="</url>"} print $F[-1]' file


John
--
use Perl;
program
fulfillment
krahnj [ Fr, 07 Dezember 2007 22:25 ] [ ID #1887718 ]
Linux » comp.unix.shell » Help extracting strings via awk.

Vorheriges Thema: Set a variable to every line of a find -exec loop
Nächstes Thema: Running a shell script in an NFS-mounted directory