regular expression for special html characters

--=__Part567A5F19.0__=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit



Hello,

I tried to convert html special characters to their real character.
For example, converting ” to " .

I had the string
$str = "“ test ” ניסיון ";
The string contain also Hebrew letters.

1. first I did:
$str = decode_entities($str);
It convert the special characters okay.
The problem is that the Hebrew came not okay.
So when I print the value of the $str I get the hebrew as יסיון

2. Then I decided to write a regular expression that change only the
html special characters.
I wrote:
$str = "“ test ” ניסיון ";
$str =~ s/(&#(?=[0-9])*.{2,5};)/decode_entities($1)/ge;
Even that it should work only on the matches sub string, it's seem that
it happen also on the Hebrew letters.
The Hebrew letters came again as יסיון
Part 1 and 2 give the same output.

3. I decide to check the regular expression, I remove the 'e' in the
end of the regular expression so I can see the conversion.
I wrote:
$str = "“ test ” ניסיון ";
$str =~ s/(&#(?=[0-9])*.{2,5};)/decode_entities($1)/g;
The output was:
decode_entities(“) test decode_entities(”) ניסיון
The Hebrew came out okay, of course.

4. I can do :
$str =~ s/“|”/"/g;
Which don't effect the Hebrew, and convert the html characters.
The problem that there are other html special characters that exist in
the data.
I would like to do something more generic that will work also for the
future.
Any ideas are welcome!!
Shlomit.

--=__Part567A5F19.0__=--
Shlomit Afgin [ Mi, 02 Februar 2011 10:25 ] [ ID #2054524 ]

Re: regular expression for special html characters

2011/2/2 Shlomit Afgin <Shlomit.Afgin [at] weizmann.ac.il>:
>
>
> Hello,
>
> I tried to convert html special characters to their real character.
> For example, converting =C2=A0 =C2=A0” =C2=A0 =C2=A0 =C2=A0to =C2=
=A0 =C2=A0 " =C2=A0 .
>
> I had the string
> $str =3D "“ test ” =D7=A0=D7=99=D7=A1=D7=99=D7=95=D7=9F ";
> The string contain also Hebrew letters.
>

Could Encode work on it?

use Encode;
$new =3D encode("iso-8859-1",decode("iso-8859-8",$str));

Regards.

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Jeff Pang [ Do, 03 Februar 2011 11:52 ] [ ID #2054526 ]

Re: regular expression for special html characters

On 11-02-02 04:25 AM, Shlomit Afgin wrote:
> I tried to convert html special characters to their real character.
> For example, converting” to " .
>
> I had the string
> $str = "“ test” ניסיון ";
> The string contain also Hebrew letters.

This seems to work:

#!/usr/bin/perl

use strict;
use warnings;

use encoding( 'utf8' );
use HTML::Entities;

my $str = "“ test ” ניסיון ";
$str = decode_entities( $str );
print "$str\n";

__END__


--
Just my 0.00000002 million dollars worth,
Shawn

Confusion is the first step of understanding.

Programming is as much about organization and communication
as it is about coding.

The secret to great software: Fail early & often.

Eliminate software piracy: use only FLOSS.

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Shawn H Corey [ Do, 03 Februar 2011 14:45 ] [ ID #2054527 ]

Re: regular expression for special html characters

At 18:52 +0800 03/02/2011, Jeff Pang wrote:

>2011/2/2 Shlomit Afgin <Shlomit.Afgin [at] weizmann.ac.il>:
>
>
> > I tried to convert html special characters to their real character.
> > For example, converting    ”      to     "   .
> >
> > I had the string
> > $str = "“ test ” ÈÒÈÂÔ†¢ª
> > The string contain also Hebrew letters.
>
>Could Encode work on it?

use Encode;
$new = encode("iso-8859-1",decode("iso-8859-8",$str));

Heaven forbid!

The html entities are Unicode decimal, so all you need to do in this
case is get the number n and then execute chr n in a substitution:


#!/usr/local/bin/perl
use strict;
binmode STDOUT, 'utf8';
$_ = "“א”";
s~&#([\d]+);~chr $1~eg;
print; # -=> “א”

JD

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
John Delacour [ Do, 03 Februar 2011 15:56 ] [ ID #2054529 ]
Perl » gmane.comp.lang.perl.beginners » regular expression for special html characters

Vorheriges Thema: reference noob
Nächstes Thema: how do make certain that no input (keyboard + mouse paste) is outsideof 7-bit ASCII in a perl script