Large-scale spidering

Greets,

I've written an industrial-strength search engine library for Perl
(KinoSearch), and now I have clients who want me to work on a large-
scale spidering app for them. Sort of like Nutch for Perl (<http://
lucene.apache.org/nutch>). Putch. :)

What efforts have already been undertaken in this area? A survey of
existing CPAN releases that I should study would be great. I've
written a small-scale spider using LWP::RobotUA. I've scanned over
the WWW::Mechanize docs, but don't yet grasp its full capabilities.
What else?

Thanks,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
marvin [ Fr, 14 April 2006 04:16 ] [ ID #1274648 ]

Re: Large-scale spidering

Plucene maybe? Its up on CPAN.

Justin

Marvin Humphrey wrote:

> Greets,
>
> I've written an industrial-strength search engine library for Perl
> (KinoSearch), and now I have clients who want me to work on a large-
> scale spidering app for them. Sort of like Nutch for Perl (<http://
> lucene.apache.org/nutch>). Putch. :)
>
> What efforts have already been undertaken in this area? A survey of
> existing CPAN releases that I should study would be great. I've
> written a small-scale spider using LWP::RobotUA. I've scanned over
> the WWW::Mechanize docs, but don't yet grasp its full capabilities.
> What else?
>
> Thanks,
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
jcook713 [ Fr, 14 April 2006 13:26 ] [ ID #1274649 ]

Re: Large-scale spidering

On Apr 14, 2006, at 4:26 AM, J Cook wrote:

> Plucene maybe? Its up on CPAN.

I'm intimately acquainted with Plucene. I actually spent a week or
two hacking on it last August before deciding that its performance
issues could not be resolved without a complete overhaul which would
break the API.

http://www.rectangular.com/kinosearch/benchmarks.html

KinoSearch, like Plucene, is a text search engine library. In order
to write an industrial-strength spider a la Nutch, you need a lot
more than that: HTML::Parser, HTML::LinkExtor, LWP::RobotUA... I've
now discovered WWW::RobotRules::AnyDBM_File, which is going to be
very helpful. But there are a lot of other problems to be solved.
Check-summing page content to eliminate duplicate documents available
via multiple URLs. Managing crawl depth so that a spider doesn't
venture too deep into one domain and forget about all the others.
Eventually, if you want to get fancy, link analysis and pagerank.

LWP::Parallel::RobotUA looks interesting. There's a bunch of stuff
under Bundle::LinkController, but it hasn't been updated in a while.
What else?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
marvin [ Sa, 15 April 2006 19:04 ] [ ID #1275496 ]
Perl » perl.libwww » Large-scale spidering

Vorheriges Thema: Mr. Mechanize seems to be offline
Nächstes Thema: Checking URLS across a cluster.