Word frequency analyser

Hi,

Does anyone happen to know if there's a convenient module which will analyse
at least two XML files and list the most frequently-used words?

(It would have to be able to reject tags and certain words such as "the" and
"is")
DVH [ So, 23 Oktober 2005 16:51 ] [ ID #1026249 ]

Re: Word frequency analyser

"DVH" <dvh [at] dvhdvhdvhdvdh.dvh> wrote:

> Hi,
>
> Does anyone happen to know if there's a convenient module which will
> analyse at least two XML files and list the most frequently-used
> words?
>
> (It would have to be able to reject tags

XML::Parser, which a Char handler.

> and certain words such as
> "the" and "is")

split in the handler on non-words, use a hash for counting. Delete
afterwards all occurences of the, is, etc.

Note that this is a very simplistic approach, since it words are hypenated,
it counts them as two different ones.

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
I ploink googlegroups.com :-)
John Bokma [ Mo, 24 Oktober 2005 01:49 ] [ ID #1027842 ]

Re: Word frequency analyser

John Bokma <john [at] castleamber.com> writes:

> "DVH" <dvh [at] dvhdvhdvhdvdh.dvh> wrote:
>
>> Hi,
>>
>> Does anyone happen to know if there's a convenient module which will
>> analyse at least two XML files and list the most frequently-used
>> words?
>>
>> (It would have to be able to reject tags
>
> XML::Parser, which a Char handler.
>
>> and certain words such as
>> "the" and "is")
>
> split in the handler on non-words, use a hash for counting. Delete
> afterwards all occurences of the, is, etc.
>
> Note that this is a very simplistic approach, since it words are hypenated,
> it counts them as two different ones.

Seaching for "word frequency" on search.cpan.org turns up some modules
that are designed for this sort of thing, and may take some of the
trickier issues into account.

----Scott.
Scott W Gifford [ Mo, 24 Oktober 2005 17:30 ] [ ID #1027852 ]

Re: Word frequency analyser

"DVH" <dvh [at] dvhdvhdvhdvdh.dvh> wrote:

> Hi,
>
> Does anyone happen to know if there's a convenient module which will
> analyse at least two XML files and list the most frequently-used
> words?
>
> (It would have to be able to reject tags

XML::Parser, which a Char handler.

> and certain words such as
> "the" and "is")

split in the handler on non-words, use a hash for counting. Delete
afterwards all occurences of the, is, etc.

Note that this is a very simplistic approach, since it words are hypenated,
it counts them as two different ones.

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
I ploink googlegroups.com :-)
John Bokma [ Mo, 24 Oktober 2005 01:49 ] [ ID #1027952 ]

Re: Word frequency analyser

John Bokma <john [at] castleamber.com> writes:

> "DVH" <dvh [at] dvhdvhdvhdvdh.dvh> wrote:
>
>> Hi,
>>
>> Does anyone happen to know if there's a convenient module which will
>> analyse at least two XML files and list the most frequently-used
>> words?
>>
>> (It would have to be able to reject tags
>
> XML::Parser, which a Char handler.
>
>> and certain words such as
>> "the" and "is")
>
> split in the handler on non-words, use a hash for counting. Delete
> afterwards all occurences of the, is, etc.
>
> Note that this is a very simplistic approach, since it words are hypenated,
> it counts them as two different ones.

Seaching for "word frequency" on search.cpan.org turns up some modules
that are designed for this sort of thing, and may take some of the
trickier issues into account.

----Scott.
Scott W Gifford [ Mo, 24 Oktober 2005 17:30 ] [ ID #1027959 ]

Re: Word frequency analyser

On Tue, 25 Oct 2005 01:30:58 +1000, Scott W Gifford wrote:

Hi Folks

A list of stop word, courtesy of MySQL, can be downloaded from:

http://savage.net.au/Ron/mysql-stop-words.txt
Ron Savage [ Di, 25 Oktober 2005 11:37 ] [ ID #1029632 ]

Re: Word frequency analyser

On Tue, 25 Oct 2005 01:30:58 +1000, Scott W Gifford wrote:

Hi Folks

A list of stop word, courtesy of MySQL, can be downloaded from:

http://savage.net.au/Ron/mysql-stop-words.txt
Ron Savage [ Di, 25 Oktober 2005 11:37 ] [ ID #1029773 ]
Perl » alt.perl » Word frequency analyser

Vorheriges Thema: Looking For PHP-Web Developer for Short Term Project
Nächstes Thema: regexp case sensitive / ignore case based on a variable