Javascript Execution

Hi,

I am working on a http agent which harvests various products and prices on
a number of websites. The problem I run into is that some sites use
javascript.

After downloading a web page from my perl script, what I want to do is:
1. Execute any existing javascript on the page.
2. Modify the page according to the javascript results.
3. Save the page to a local file.

Hence I want to do the exact thing that a normal web browser does but
instead of writing to a browser window the output shall be written to a file.

Does anyone know of any possible solution or hint?

All suggestions kindly appreciated!

Regards,
Erik Axelkrans
webmaster [ Sa, 17 Dezember 2005 16:19 ] [ ID #1106638 ]

Re: Javascript Execution

> I am working on a http agent which harvests various products and
> prices on
> a number of websites. The problem I run into is that some sites use
> javascript.

First, look at WWW::Mechanize to make most of your job easier.

Second, there is no client that does Javascript. See the
WWW::Mechanize FAQ

http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize/ FAQ.pod

xoxo,
Andy

--
Andy Lester => andy [at] petdance.com => www.petdance.com => AIM:petdance
Andy [ Sa, 17 Dezember 2005 18:05 ] [ ID #1106639 ]

Re: Javascript Execution

* webmaster [at] awwwsol.com wrote:
>I am working on a http agent which harvests various products and prices on
>a number of websites. The problem I run into is that some sites use
>javascript.

There is Win32::IE::Mechanize.

>After downloading a web page from my perl script, what I want to do is:
>1. Execute any existing javascript on the page.
>2. Modify the page according to the javascript results.
>3. Save the page to a local file.

Note that the scripts might not terminate, so you might get the DOM
at a specific point, but there is not necessarily a specific result.
--
Björn Höhrmann · mailto:bjoern [at] hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
derhoermi [ Sa, 17 Dezember 2005 18:08 ] [ ID #1106640 ]

Re: Javascript Execution

There are also JavaScript engines available in C and Java
(SpiderMonkey and Rhino, respectively, available on mozilla.org). You
may be able to leverage those.

Chris

On 12/17/05, Bjoern Hoehrmann <derhoermi [at] gmx.net> wrote:
> * webmaster [at] awwwsol.com wrote:
> >I am working on a http agent which harvests various products and prices =
on
> >a number of websites. The problem I run into is that some sites use
> >javascript.
>
> There is Win32::IE::Mechanize.
>
> >After downloading a web page from my perl script, what I want to do is:
> >1. Execute any existing javascript on the page.
> >2. Modify the page according to the javascript results.
> >3. Save the page to a local file.
>
> Note that the scripts might not terminate, so you might get the DOM
> at a specific point, but there is not necessarily a specific result.
> --
> Björn Höhrmann =B7 mailto:bjoern [at] hoehrmann.de =B7 http://bjoern.hoehr=
mann.de
> Weinh. Str. 22 =B7 Telefon: +49(0)621/4309674 =B7 http://www.bjoernsworld=
..de
> 68309 Mannheim =B7 PGP Pub. KeyID: 0xA4357E78 =B7 http://www.websitedev.d=
e/
>
hartct [ Sa, 17 Dezember 2005 18:16 ] [ ID #1106641 ]

Re: Javascript Execution

* Christopher Hart wrote:
>There are also JavaScript engines available in C and Java
>(SpiderMonkey and Rhino, respectively, available on mozilla.org). You
>may be able to leverage those.

Though note that the engines alone won't help much here, you'd need an
implementation of the various APIs the sites use aswell (e.g., the DOM
APIs to manipulate the document). There are of course several such
implementations available that interact well with the two engines, it
might however be difficult to reuse them.
--
Björn Höhrmann · mailto:bjoern [at] hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
derhoermi [ Sa, 17 Dezember 2005 18:48 ] [ ID #1106642 ]

Re: Javascript Execution

On Sat, Dec 17, 2005 at 12:16:29PM -0500, Christopher Hart (hartct [at] gmail.com) wrote:
> There are also JavaScript engines available in C and Java
> (SpiderMonkey and Rhino, respectively, available on mozilla.org). You
> may be able to leverage those.

I didn't know about SpiderMonkey. I'm going to have a look at it to see
if it will fit into WWW::Mechanize.

--
Andy Lester => andy [at] petdance.com => www.petdance.com => AIM:petdance
Andy [ Sa, 17 Dezember 2005 20:10 ] [ ID #1106643 ]

Re: Javascript Execution

Thank you all for your responses. WWW::Mechanize is new to me and seems
very good.

What about: Mozilla::Mechanize

Does it interpret javascript?

Erik
webmaster [ Sa, 17 Dezember 2005 21:39 ] [ ID #1106644 ]

Re: Javascript Execution

On Sat, 2005-12-17 at 13:10 -0600, Andy Lester wrote:

> I didn't know about SpiderMonkey. I'm going to have a look at it to see
> if it will fit into WWW::Mechanize.

FYI, there's also http://search.cpan.org/dist/JavaScript-SpiderMonkey/
vskytta [ Sa, 17 Dezember 2005 22:39 ] [ ID #1106645 ]

Re: Javascript Execution

You might also look into using a scriptable browser. I think the
Mozilla organization has
something along these lines.

On Dec 17, 2005, at 11:16 AM, Christopher Hart wrote:

> There are also JavaScript engines available in C and Java
> (SpiderMonkey and Rhino, respectively, available on mozilla.org). You
> may be able to leverage those.
>
> Chris
>
> On 12/17/05, Bjoern Hoehrmann <derhoermi [at] gmx.net> wrote:
>> * webmaster [at] awwwsol.com wrote:
>>> I am working on a http agent which harvests various products and
>>> prices on
>>> a number of websites. The problem I run into is that some sites use
>>> javascript.
>>
>> There is Win32::IE::Mechanize.
>>
>>> After downloading a web page from my perl script, what I want to
>>> do is:
>>> 1. Execute any existing javascript on the page.
>>> 2. Modify the page according to the javascript results.
>>> 3. Save the page to a local file.
>>
>> Note that the scripts might not terminate, so you might get the DOM
>> at a specific point, but there is not necessarily a specific result.
>> --
>> Björn Höhrmann =B7 mailto:bjoern [at] hoehrmann.de =B7 http://
>> bjoern.hoehrmann.de
>> Weinh. Str. 22 =B7 Telefon: +49(0)621/4309674 =B7 http://
>> www.bjoernsworld.de
>> 68309 Mannheim =B7 PGP Pub. KeyID: 0xA4357E78 =B7 http://
>> www.websitedev.de/
>>
jalotta [ So, 18 Dezember 2005 00:50 ] [ ID #1107406 ]

Re: Javascript Execution

>
> What about: Mozilla::Mechanize
>
> Does it interpret javascript?

I don't know. I imagine the docs tell you.

--
Andy Lester => andy [at] petdance.com => www.petdance.com => AIM:petdance
Andy [ So, 18 Dezember 2005 01:59 ] [ ID #1107407 ]

Re: Javascript Execution

On Sat, 17 Dec 2005, Andy Lester wrote:

> On Sat, Dec 17, 2005 at 12:16:29PM -0500, Christopher Hart (hartct [at] gmail.com) wrote:
> > There are also JavaScript engines available in C and Java
> > (SpiderMonkey and Rhino, respectively, available on mozilla.org). You
> > may be able to leverage those.
>
> I didn't know about SpiderMonkey. I'm going to have a look at it to see
> if it will fit into WWW::Mechanize.

Hi Andy

As I've posted about here before a few times (search Gmane), I actually
did this with my Python port of WWW::Mechanize a few years back, using
spidermonkey. My implementation was a first-cut half-baked thing, but I
did get it working for a few pages. I decided that was enough excitement
for me ;-) I know a few people used it for projects of their own and
improved on it a bit, though (eg. one guy used it in a college project to
make JS-using pages accessible on non-JS devices, by having a proxy server
and executing the JS there -- nice idea). The code is still available at
wwwsearch.sf.net

I made use of the Perl wrapper of SpiderMonkey to write something very
similar for Python. IIRC, I had to extend it a little over what was in
the Perl thing.

I used an existing HTML DOM, but had to modify both the DOM, and of course
the DOM builder (and add event stuff and browser object model). This is
where the work lies :-) If you intend to try this, and you're not
intimately familiar with the bizarre ways in which people can and do use
<script> tags, I have some email you may want to read (I certainly didn't
understand the issues, so my published code is wrong; a contributor
provided patches & explanations that I never merged in).

Of course:

1. A good, strict, HTML DOM tree builder is not the same as a good browser
DOM builder. It must be very lenient. I'm not up-to-date with current
Perl libraries, but I don't think such a thing exists. Of course, lenient
tree builders like HTML::TreeBuilder exist, but recall that script
execution takes place during DOM building and that script must be able to
access the part-built DOM, so they would need to be 're-targeted' (or even
dynamically mapped, perhaps) to a 'real' DOM tree. Actually, just a week
ago I was looking at reusing the Mozilla DOM & builder in a lightweight
way (ie. without a GUI and probably without Mozilla's URL-fetching code)
-- I'd be interested if other people get this to work (sum total of the
work I did so far was to compile Firefox and run some of its tests, so I
don't know whether it's feasible yet).

2. A generic HTML DOM is not the same thing as a good browser DOM +
browser object model. There are many quirks. And I'm not even sure
there's a good HTML DOM out there for Perl. Anybody know one? AFAIK,
there's no good free browser DOM out there in *any* language other than
C++ (in Firefox and KHTML), though I recall Java's httpunit does some JS
stuff, using Rhino (dunno how well), so clearly whatever DOM they use is
good enough for at least some JS to work.

On the whole, don't underestimate the work, but I think it's not *too*
hard to make something useful, if not perfection.


John
jjl [ So, 18 Dezember 2005 15:45 ] [ ID #1107408 ]
Perl » perl.libwww » Javascript Execution

Vorheriges Thema: Error while runing make test in Crypt-SSLeay and Net_SSLeay
Nächstes Thema: PATCH: Using "plain" content for ->post method