Extract domain name

How do you fetch just the domain name part of a variable in a script? The
variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
"http://sub.domain.com/blahblah/whatever/page.htm".

What I need is to extract just the "domain.com".
Shabam [ Fr, 12 November 2004 17:02 ] [ ID #480265 ]

Re: Extract domain name

[removed non-existant groups, removed off topic AOL group, set followups
to c.l.p.m.]

"Shabam" <blislecp [at] hotmail.com> wrote in message
news:3u-dnd1_9JRvQAncRVn-ig [at] adelphia.com...
> How do you fetch just the domain name part of a variable in a script?
The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".

Try using the Regexp::Common module from CPAN. I seem to recall it has
a method for parsing URIs

Paul Lalli
Paul Lalli [ Fr, 12 November 2004 17:23 ] [ ID #480267 ]

Re: Extract domain name

[removed non-existant groups, removed off topic AOL group, set followups
to c.l.p.m.]

"Shabam" <blislecp [at] hotmail.com> wrote in message
news:3u-dnd1_9JRvQAncRVn-ig [at] adelphia.com...
> How do you fetch just the domain name part of a variable in a script?
The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".

Try using the Regexp::Common module from CPAN. I seem to recall it has
a method for parsing URIs

Paul Lalli
Paul Lalli [ Fr, 12 November 2004 17:23 ] [ ID #480286 ]

Re: Extract domain name

Look for URI module. IMHO, its a good and simple thing for parsing URLs

use URI;
($domain = URI->new("http://www.domain.com/blahblah/whatever/page.htm") ->authority) =~ s/^www\.//i


Regards,
Andrew

Shabam wrote on 12 Ноябрь 2004 16:02:

> How do you fetch just the domain name part of a variable in a script? The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".

--
Andrew
Andrew Tkachenko [ Fr, 12 November 2004 21:09 ] [ ID #480936 ]

Re: Extract domain name

Look for URI module. IMHO, its a good and simple thing for parsing URLs

use URI;
($domain = URI->new("http://www.domain.com/blahblah/whatever/page.htm") ->authority) =~ s/^www\.//i


Regards,
Andrew

Shabam wrote on 12 Ноябрь 2004 16:02:

> How do you fetch just the domain name part of a variable in a script? The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".

--
Andrew
Andrew Tkachenko [ Fr, 12 November 2004 21:09 ] [ ID #480944 ]

Re: Extract domain name

[ Cross-post trimmed ]

Shabam wrote to :

> How do you fetch just the domain name part of a variable in a script?
> The variable can be "http://www.domain.com/blahblah/whatever/page.htm"
> or "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".

This is definitely a non-trivial problem. Fortunately, it's been
partially solved already. I'm involved in the SpamAssassin and SURBL
projects, where this really became obvious when spammers started
obfuscating URIs, and using domains from many different TLDs where it
takes a lot of research to determine where to chop the hostname to get
the actual registrar domain.

There's much more to it than using a library or regexp.

See get_uri_list() in SpamAssassin 3's PerMsgStatus.pm for one
"industrial strength" solution to this problem, which still has room for
improvement.

- Ryan

--
Ryan Thompson <ryan [at] sasknow.com>

SaskNow Technologies - http://www.sasknow.com
901-1st Avenue North - Saskatoon, SK - S7K 1Y4

Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon
Toll-Free: 877-727-5669 (877-SASKNOW) North America
Ryan Thompson [ Fr, 12 November 2004 18:38 ] [ ID #480947 ]

Re: Extract domain name

Sorry, did'nt pay attention to sub-domains in your example.
So, IMHO, it depends on your task - if it allows to guess possible
TLD values, then just split domain name into parts and leave just matched
TLD and SLD.

Regards,
Andrew

Ryan Thompson wrote on 12 Ноябрь 2004 17:38:

> [ Cross-post trimmed ]
>
> Shabam wrote to :
>
>> How do you fetch just the domain name part of a variable in a script?
>> The variable can be "http://www.domain.com/blahblah/whatever/page.htm"
>> or "http://sub.domain.com/blahblah/whatever/page.htm".
>>
>> What I need is to extract just the "domain.com".
>
> This is definitely a non-trivial problem. Fortunately, it's been
> partially solved already. I'm involved in the SpamAssassin and SURBL
> projects, where this really became obvious when spammers started
> obfuscating URIs, and using domains from many different TLDs where it
> takes a lot of research to determine where to chop the hostname to get
> the actual registrar domain.
>
> There's much more to it than using a library or regexp.
>
> See get_uri_list() in SpamAssassin 3's PerMsgStatus.pm for one
> "industrial strength" solution to this problem, which still has room for
> improvement.
>
> - Ryan
>

--
Andrew
Andrew Tkachenko [ Fr, 12 November 2004 21:40 ] [ ID #480949 ]

Re: Extract domain name

Shabam wrote:

> How do you fetch just the domain name part of a variable in a script? The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".

The problem is not well defined.

For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
"toshiba.com"? For "http://story.news.yahoo.com", is "news" included or not?
You can't just use the last two components in all cases, such as
"http://www.toyota.co.jp" or "http://www.bbc.co.uk".

-Joe
Joe Smith [ So, 14 November 2004 09:22 ] [ ID #482592 ]

Re: Extract domain name

Shabam wrote:

> How do you fetch just the domain name part of a variable in a script? The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".

The problem is not well defined.

For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
"toshiba.com"? For "http://story.news.yahoo.com", is "news" included or not?
You can't just use the last two components in all cases, such as
"http://www.toyota.co.jp" or "http://www.bbc.co.uk".

-Joe
Joe Smith [ So, 14 November 2004 09:22 ] [ ID #482610 ]

Re: Extract domain name

> The problem is not well defined.
>
> For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
> "toshiba.com"? For "http://story.news.yahoo.com", is "news" included or
not?
> You can't just use the last two components in all cases, such as
> "http://www.toyota.co.jp" or "http://www.bbc.co.uk".

What I would need is just the domain name part. In this case it would be
"toshiba.com" only. No subdomains. My domains will be simple
(com/net/org), so complicated situations like "toyota.co.jp" wouldn't apply.
Shabam [ So, 14 November 2004 12:12 ] [ ID #482900 ]

Re: Extract domain name

> The problem is not well defined.
>
> For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
> "toshiba.com"? For "http://story.news.yahoo.com", is "news" included or
not?
> You can't just use the last two components in all cases, such as
> "http://www.toyota.co.jp" or "http://www.bbc.co.uk".

What I would need is just the domain name part. In this case it would be
"toshiba.com" only. No subdomains. My domains will be simple
(com/net/org), so complicated situations like "toyota.co.jp" wouldn't apply.
Shabam [ So, 14 November 2004 12:12 ] [ ID #482906 ]

Re: Extract domain name

Shabam wrote:

>>The problem is not well defined.
>>
>>For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
>>"toshiba.com"? For "http://story.news.yahoo.com", is "news" included or
>
> not?
>
>>You can't just use the last two components in all cases, such as
>>"http://www.toyota.co.jp" or "http://www.bbc.co.uk".
>
>
> What I would need is just the domain name part. In this case it would be
> "toshiba.com" only. No subdomains. My domains will be simple
> (com/net/org), so complicated situations like "toyota.co.jp" wouldn't apply.
>
>
I m not an expert, but the following regex will apply:

$url = "http://www.abc.xyz.toy-0-ota.com";
($domain) = ($url =~ /http:\/\/.*\.([0-9a-zA-Z\-]+\.com|net|org)/);
print $domain . "\n";

Sam
Sam [ Do, 18 November 2004 08:40 ] [ ID #490386 ]

Re: Extract domain name

Shabam wrote:

>>The problem is not well defined.
>>
>>For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
>>"toshiba.com"? For "http://story.news.yahoo.com", is "news" included or
>
> not?
>
>>You can't just use the last two components in all cases, such as
>>"http://www.toyota.co.jp" or "http://www.bbc.co.uk".
>
>
> What I would need is just the domain name part. In this case it would be
> "toshiba.com" only. No subdomains. My domains will be simple
> (com/net/org), so complicated situations like "toyota.co.jp" wouldn't apply.
>
>
I m not an expert, but the following regex will apply:

$url = "http://www.abc.xyz.toy-0-ota.com";
($domain) = ($url =~ /http:\/\/.*\.([0-9a-zA-Z\-]+\.com|net|org)/);
print $domain . "\n";

Sam
Sam [ Do, 18 November 2004 08:40 ] [ ID #490397 ]
Perl » alt.perl » Extract domain name

Vorheriges Thema: simple search script
Nächstes Thema: Help: Threads and queues