utf8_decode

(PHP 4, PHP 5, PHP 7, PHP 8)

utf8_decode β€” ΠŸΡ€Π΅ΠΎΠ±Ρ€Π°Π·ΠΎΠ²Ρ‹Π²Π°Π΅Ρ‚ строку ΠΈΠ· ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠΈ UTF-8 Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΡƒ ISO-8859-1, замСняя нСдопустимыС ΠΈΠ»ΠΈ нСпрСдставимыС символы

Π’Π½ΠΈΠΌΠ°Π½ΠΈΠ΅

Начиная с PHP 8.2.0 функция УБВАРЕЛА. ΠŸΠΎΠ»Π°Π³Π°Ρ‚ΡŒΡΡ Π½Π° Ρ„ΡƒΠ½ΠΊΡ†ΠΈΡŽ Π½Π°ΡΡ‚ΠΎΡΡ‚Π΅Π»ΡŒΠ½ΠΎ Π½Π΅ Ρ€Π΅ΠΊΠΎΠΌΠ΅Π½Π΄ΡƒΡŽΡ‚.

ОписаниС

#[\Deprecated]
function utf8_decode(string $string): string

Ѐункция ΠΏΡ€Π΅ΠΎΠ±Ρ€Π°Π·ΠΎΠ²Ρ‹Π²Π°Π΅Ρ‚ строку string ΠΈΠ· ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠΈ UTF-8 Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΡƒ ISO-8859-1. Π‘Π°ΠΉΡ‚Ρ‹ Π² строкС, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ Π½Π΅ ΡΠΎΠΎΡ‚Π²Π΅Ρ‚ΡΡ‚Π²ΡƒΡŽΡ‚ ΠΊΠΎΡ€Ρ€Π΅ΠΊΡ‚Π½Ρ‹ΠΌ символам ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠΈ UTF-8 ΠΈ UTF-8-символам, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ Π½Π΅ содСрТатся Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠ΅ ISO-8859-1 β€” ΠΊΠΎΠ΄ΠΎΠ²Ρ‹Π΅ Ρ‚ΠΎΡ‡ΠΊΠΈ Π²Ρ‹ΡˆΠ΅ U+00FF, Π·Π°ΠΌΠ΅Π½ΡΡŽΡ‚ΡΡ Π½Π° символ ?.

Π—Π°ΠΌΠ΅Ρ‡Π°Π½ΠΈΠ΅:

Часто Π²Π΅Π±-страницы, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ ΠΎΡ‚ΠΌΠ΅Ρ‚ΠΈΠ»ΠΈ ΠΊΠ°ΠΊ страницы Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠ΅ ISO-8859-1, ΠΊΠΎΠ΄ΠΈΡ€ΡƒΡŽΡ‚ΡΡ ΠΏΠΎΡ…ΠΎΠΆΠ΅ΠΉ ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠΎΠΉ β€” Windows-1252, ΠΈ Π±Ρ€Π°ΡƒΠ·Π΅Ρ€Ρ‹ ΠΈΠ½Ρ‚Π΅Ρ€ΠΏΡ€Π΅Ρ‚ΠΈΡ€ΡƒΡŽΡ‚ страницы Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠ΅ ISO-8859-1 ΠΊΠ°ΠΊ страницы Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠ΅ Windows-1252. Однако ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠ° Windows-1252 вмСсто ΡƒΠΏΡ€Π°Π²Π»ΡΡŽΡ‰ΠΈΡ… ΠΊΠΎΠ΄ΠΎΠ² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠΈ ISO-8859-1 содСрТит Π΄ΠΎΠΏΠΎΠ»Π½ΠΈΡ‚Π΅Π»ΡŒΠ½Ρ‹Π΅ ΠΏΠ΅Ρ‡Π°Ρ‚Π½Ρ‹Π΅ символы Π½Π°ΠΏΠΎΠ΄ΠΎΠ±ΠΈΠ΅ Π·Π½Π°ΠΊΠ° Π΅Π²Ρ€ΠΎ € ΠΈ английских Π΄Π²ΠΎΠΉΠ½Ρ‹Ρ… ΠΊΠ°Π²Ρ‹Ρ‡Π΅ΠΊ β€œ ”. Ѐункция Π½Π΅ ΠΊΠΎΠ½Π²Π΅Ρ€Ρ‚ΠΈΡ€ΡƒΠ΅Ρ‚ Ρ‚Π°ΠΊΠΈΠ΅ символы ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠΈ Windows-1252 ΠΊΠΎΡ€Ρ€Π΅ΠΊΡ‚Π½ΠΎ. Для ΠΊΠΎΠ½Π²Π΅Ρ€Ρ‚Π°Ρ†ΠΈΠΈ Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΡƒ Windows-1252 ΠΏΠΎΠ»ΡŒΠ·ΡƒΡŽΡ‚ΡΡ Π°Π»ΡŒΡ‚Π΅Ρ€Π½Π°Ρ‚ΠΈΠ²Π½Ρ‹ΠΌΠΈ функциями.

Бписок ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ²

string

Π‘Ρ‚Ρ€ΠΎΠΊΠ° Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠ΅ UTF-8.

Π’ΠΎΠ·Π²Ρ€Π°Ρ‰Π°Π΅ΠΌΡ‹Π΅ значСния

Ѐункция Π²ΠΎΠ·Π²Ρ€Π°Ρ‰Π°Π΅Ρ‚ Π΄Π°Π½Π½Ρ‹Π΅ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Π° string, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ ΠΏΡ€Π΅ΠΎΠ±Ρ€Π°Π·ΠΎΠ²Π°Π»Π° Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΡƒ ISO-8859-1.

Бписок измСнСний

ВСрсия ОписаниС
8.2.0 Ѐункция устарСла.
7.2.0 Π€ΡƒΠ½ΠΊΡ†ΠΈΡŽ пСрСнСсли ΠΈΠ· модуля XML Π² ядро PHP. Π’ ΠΏΡ€Π΅Π΄Ρ‹Π΄ΡƒΡ‰ΠΈΡ… вСрсиях функция Π±Ρ‹Π»Π° доступна Ρ‚ΠΎΠ»ΡŒΠΊΠΎ ΠΏΡ€ΠΈ установлСнном ΠΌΠΎΠ΄ΡƒΠ»Π΅ XML.

ΠŸΡ€ΠΈΠΌΠ΅Ρ€Ρ‹

ΠŸΡ€ΠΈΠΌΠ΅Ρ€ #1 ΠŸΡ€ΠΎΡΡ‚ΠΎΠΉ ΠΏΡ€ΠΈΠΌΠ΅Ρ€ прСобразования строки ΠΈΠ· ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠΈ UTF-8 Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΡƒ ISO-8859-1

<?php

// ΠŸΡ€Π΅ΠΎΠ±Ρ€Π°Π·ΠΎΠ²Π°Π½ΠΈΠ΅ строки 'ZoΓ«' ΠΈΠ· ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠΈ UTF-8 Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΡƒ ISO 8859-1
$utf8_string = "\x5A\x6F\xC3\xAB";
$iso8859_1_string = utf8_decode($utf8_string);
echo
bin2hex($iso8859_1_string), "\n";

// ΠΠ΅ΠΏΡ€Π°Π²ΠΈΠ»ΡŒΠ½Ρ‹Π΅ для ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠΈ UTF-8 ΠΏΠΎΡΠ»Π΅Π΄ΠΎΠ²Π°Ρ‚Π΅Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ Π·Π°ΠΌΠ΅Π½ΡΡŽΡ‚ΡΡ Π½Π° '?'
$invalid_utf8_string = "\xC3";
$iso8859_1_string = utf8_decode($invalid_utf8_string);
var_dump($iso8859_1_string);

// Π‘ΠΈΠΌΠ²ΠΎΠ»Ρ‹, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ Π½Π΅ содСрТатся Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠ΅ ISO 8859-1 Π½Π°ΠΏΠΎΠ΄ΠΎΠ±ΠΈΠ΅
// Π·Π½Π°ΠΊΠ° Π΅Π²Ρ€ΠΎ '€', Ρ‚ΠΎΠΆΠ΅ Π·Π°ΠΌΠ΅Π½ΡΡŽΡ‚ΡΡ символом '?'
$utf8_string = "\xE2\x82\xAC";
$iso8859_1_string = utf8_decode($utf8_string);
var_dump($iso8859_1_string);

?>

Π Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ выполнСния ΠΏΡ€ΠΈΠ²Π΅Π΄Ρ‘Π½Π½ΠΎΠ³ΠΎ ΠΏΡ€ΠΈΠΌΠ΅Ρ€Π°:

5a6feb
string(1) "?"
string(1) "?"

ΠŸΡ€ΠΈΠΌΠ΅Ρ‡Π°Π½ΠΈΡ

Π—Π°ΠΌΠ΅Ρ‡Π°Π½ΠΈΠ΅: УстарСваниС ΠΈ Π°Π»ΡŒΡ‚Π΅Ρ€Π½Π°Ρ‚ΠΈΠ²Ρ‹

Начиная с PHP 8.2.0 функция устарСла ΠΈ Π΅Ρ‘ удалят Π² Π±ΡƒΠ΄ΡƒΡ‰Π΅ΠΉ вСрсии. Π Π°Π·Ρ€Π°Π±ΠΎΡ‚Ρ‡ΠΈΠΊΠΈ языка Ρ€Π΅ΠΊΠΎΠΌΠ΅Π½Π΄ΡƒΡŽΡ‚ Π·Π°ΠΌΠ΅Π½ΠΈΡ‚ΡŒ Π²Ρ‹Π·ΠΎΠ²Ρ‹ Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠΈ Π² ΠΊΠΎΠ΄Π΅ Π°Π»ΡŒΡ‚Π΅Ρ€Π½Π°Ρ‚ΠΈΠ²Π°ΠΌΠΈ.

ΠΠ½Π°Π»ΠΎΠ³ΠΈΡ‡Π½ΡƒΡŽ Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠΎΠ½Π°Π»ΡŒΠ½ΠΎΡΡ‚ΡŒ Π΄Π°Ρ‘Ρ‚ функция mb_convert_encoding(), которая ΠΏΠΎΠ΄Π΄Π΅Ρ€ΠΆΠΈΠ²Π°Π΅Ρ‚ ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΡƒ ISO-8859-1 ΠΈ Π½Π°Π±ΠΎΡ€ Π΄Ρ€ΡƒΠ³ΠΈΡ… ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΎΠΊ символов.

<?php

$utf8_string
= "\xC3\xAB"; // 'Γ«' β€” Π±ΡƒΠΊΠ²Π° Β«eΒ» с диСрСзисом Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠ΅ UTF-8
$iso8859_1_string = mb_convert_encoding($utf8_string, 'ISO-8859-1', 'UTF-8');
echo
bin2hex($iso8859_1_string), "\n";

$utf8_string = "\xCE\xBB"; // 'Ξ»' β€” грСчСская строчная лямбда Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠ΅ UTF-8
$iso8859_7_string = mb_convert_encoding($utf8_string, 'ISO-8859-7', 'UTF-8');
echo
bin2hex($iso8859_7_string), "\n";

$utf8_string = "\xE2\x82\xAC"; // '€' β€” символ Π΅Π²Ρ€ΠΎ Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠ΅ UTF-8, Π½Π΅ содСрТится Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠ΅ ISO-8859-1
$windows_1252_string = mb_convert_encoding($utf8_string, 'Windows-1252', 'UTF-8');
echo
bin2hex($windows_1252_string), "\n";

?>

Π Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ выполнСния ΠΏΡ€ΠΈΠ²Π΅Π΄Ρ‘Π½Π½ΠΎΠ³ΠΎ ΠΏΡ€ΠΈΠΌΠ΅Ρ€Π°:

eb
eb
80

Π”Ρ€ΡƒΠ³ΠΈΠ΅ способы, Π΄ΠΎΡΡ‚ΡƒΠΏΠ½ΠΎΡΡ‚ΡŒ ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Ρ… зависит ΠΎΡ‚ Π·Π°Π³Ρ€ΡƒΠΆΠ΅Π½Π½Ρ‹Ρ… ΠΌΠΎΠ΄ΡƒΠ»Π΅ΠΉ, β€” ΠΌΠ΅Ρ‚ΠΎΠ΄ UConverter::transcode() ΠΈ функция iconv().

ΠšΠ°ΠΆΠ΄Ρ‹ΠΉ ΡΠ»Π΅Π΄ΡƒΡŽΡ‰ΠΈΠΉ способ Π΄Π°Ρ‘Ρ‚ ΠΎΠ΄ΠΈΠ½ ΠΈ Ρ‚ΠΎΡ‚ ΠΆΠ΅ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚:

<?php

$utf8_string
= "\x5A\x6F\xC3\xAB"; // 'ZoΓ«' Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠ΅ UTF-8

$iso8859_1_string = utf8_decode($utf8_string);
echo
bin2hex($iso8859_1_string), "\n";

$iso8859_1_string = mb_convert_encoding($utf8_string, 'ISO-8859-1', 'UTF-8');
echo
bin2hex($iso8859_1_string), "\n";

$iso8859_1_string = iconv('UTF-8', 'ISO-8859-1', $utf8_string);
echo
bin2hex($iso8859_1_string), "\n";

$iso8859_1_string = UConverter::transcode($utf8_string, 'ISO-8859-1', 'UTF8');
echo
bin2hex($iso8859_1_string), "\n";

?>

Π Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ выполнСния ΠΏΡ€ΠΈΠ²Π΅Π΄Ρ‘Π½Π½ΠΎΠ³ΠΎ ΠΏΡ€ΠΈΠΌΠ΅Ρ€Π°:

5a6feb
5a6feb
5a6feb
5a6feb
Π‘ΠΈΠΌΠ²ΠΎΠ» '?' ΠΊΠ°ΠΊ Π·Π½Π°Ρ‡Π΅Π½ΠΈΠ΅ элСмСнта 'to_subst' Π² массивС ΠΎΠΏΡ†ΠΈΠΉ ΠΌΠ΅Ρ‚ΠΎΠ΄Π° UConverter::transcode() Π΄Π°Ρ‘Ρ‚ Ρ‚ΠΎΡ‚ ΠΆΠ΅ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚, Ρ‡Ρ‚ΠΎ ΠΈ функция utf8_decode() для нСдопустимых строк ΠΈΠ»ΠΈ строк, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ Π½Π΅Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎ ΠΏΡ€Π΅Π΄ΡΡ‚Π°Π²ΠΈΡ‚ΡŒ Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠ΅ ISO 8859-1.
<?php

$utf8_string
= "\xE2\x82\xAC"; // € β€” символ Π΅Π²Ρ€ΠΎ, отсутствуСт Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠ΅ ISO-8859-1

$iso8859_1_string = UConverter::transcode(
$utf8_string,
'ISO-8859-1',
'UTF-8',
[
'to_subst' => '?']
);

var_dump($iso8859_1_string);

?>

Π Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ выполнСния ΠΏΡ€ΠΈΠ²Π΅Π΄Ρ‘Π½Π½ΠΎΠ³ΠΎ ΠΏΡ€ΠΈΠΌΠ΅Ρ€Π°:

string(1) "?"

Π‘ΠΌΠΎΡ‚Ρ€ΠΈΡ‚Π΅ Ρ‚Π°ΠΊΠΆΠ΅

  • utf8_encode() - ΠŸΡ€Π΅ΠΎΠ±Ρ€Π°Π·ΠΎΠ²Ρ‹Π²Π°Π΅Ρ‚ строку ΠΈΠ· ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠΈ ISO-8859-1 Π² ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΡƒ UTF-8
  • mb_convert_encoding() - ΠŸΡ€Π΅ΠΎΠ±Ρ€Π°Π·ΠΎΠ²Ρ‹Π²Π°Π΅Ρ‚ строку ΠΈΠ· ΠΎΠ΄Π½ΠΎΠΉ ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠΈ символов Π² Π΄Ρ€ΡƒΠ³ΡƒΡŽ
  • UConverter::transcode() - ΠŸΡ€Π΅ΠΎΠ±Ρ€Π°Π·ΠΎΠ²Ρ‹Π²Π°Π΅Ρ‚ строку ΠΈΠ· ΠΎΠ΄Π½ΠΎΠΉ ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠΈ символов Π² Π΄Ρ€ΡƒΠ³ΡƒΡŽ
  • iconv() - ΠŸΡ€Π΅ΠΎΠ±Ρ€Π°Π·ΠΎΠ²Ρ‹Π²Π°Π΅Ρ‚ строку ΠΈΠ· ΠΎΠ΄Π½ΠΎΠΉ ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²ΠΊΠΈ символов Π² Π΄Ρ€ΡƒΠ³ΡƒΡŽ
οΌ‹Π”ΠΎΠ±Π°Π²ΠΈΡ‚ΡŒ

ΠŸΡ€ΠΈΠΌΠ΅Ρ‡Π°Π½ΠΈΡ ΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚Π΅Π»Π΅ΠΉ 31 notes

up
11
info at vanylla dot it ΒΆ
17 years ago
IMPORTANT: when converting UTF8 data that contains the EURO sign DON'T USE utf_decode function.

utf_decode converts the data into ISO-8859-1 charset. But ISO-8859-1 charset does not contain the EURO sign, therefor the EURO sign will be converted into a question mark character '?'

In order to convert properly UTF8 data with EURO sign you must use:

iconv("UTF-8", "CP1252", $data)
up
4
alexlevin at kvadro dot net ΒΆ
19 years ago
If you running Gentoo Linux and encounter problems with some PHP4 applications saying:
Call to undefined function: utf8_decode()
Try reemerge PHP4 with 'expat' flag enabled.
up
9
deceze at gmail dot com ΒΆ
14 years ago
Please note that utf8_decode simply converts a string encoded in UTF-8 to ISO-8859-1. A more appropriate name for it would be utf8_to_iso88591. If your text is already encoded in ISO-8859-1, you do not need this function. If you don't want to use ISO-8859-1, you do not need this function.

Note that UTF-8 can represent many more characters than ISO-8859-1. Trying to convert a UTF-8 string that contains characters that can't be represented in ISO-8859-1 to ISO-8859-1 will garble your text and/or cause characters to go missing. Trying to convert text that is not encoded in UTF-8 using this function will most likely garble the text.

If you need to convert any text from any encoding to any other encoding, look at iconv() instead.
up
4
sam ΒΆ
19 years ago
In addition to yannikh's note, to convert a hex utf8 string

<?php

echo utf8_decode("\x61\xc3\xb6\x61");
// works as expected

$abc="61c3b661";
$newstr = "";
$l = strlen($abc);
for ($i=0;$i<$l;$i+=2){
    $newstr .= "\x".$abc[$i].$abc[$i+1];
}
echo utf8_decode($newstr);
// or varieties  of "\x": "\\x" etc does NOT output what you want

echo utf8_decode(pack('H*',$abc));
// this outputs the correct string, like the first line.

?>
up
4
gabriel arobase gabsoftware dot com ΒΆ
14 years ago
If you want to retrieve some UTF-8 data from your database, you don't need utf8_decode().

Simply do the following query before any SELECT :

$result = mysql_query("SET NAMES utf8");
up
2
lukasz dot mlodzik at gmail dot com ΒΆ
18 years ago
Update to MARC13 function utf2iso()
I'm using it to handle AJAX POST calls. 
Despite using 
http.setRequestHeader('Content-Type', 'application/x-www-form-urlencoded'; charset='utf-8'); 
it still code Polish letters using UTF-16

This is only for Polish letters:
 
<?php
function utf16_2_utf8 ($nowytekst) {
        $nowytekst = str_replace('%u0104','Δ„',$nowytekst);    //Δ„
        $nowytekst = str_replace('%u0106','Δ†',$nowytekst);    //Δ†
        $nowytekst = str_replace('%u0118','Ę',$nowytekst);    //Ę
        $nowytekst = str_replace('%u0141','Ł',$nowytekst);    //Ł
        $nowytekst = str_replace('%u0143','Εƒ',$nowytekst);    //Εƒ
        $nowytekst = str_replace('%u00D3','Γ“',$nowytekst);    //Γ“
        $nowytekst = str_replace('%u015A','Ś',$nowytekst);    //Ś
        $nowytekst = str_replace('%u0179','ΕΉ',$nowytekst);    //ΕΉ
        $nowytekst = str_replace('%u017B','Ε»',$nowytekst);    //Ε»
       
        $nowytekst = str_replace('%u0105','Δ…',$nowytekst);    //Δ…
        $nowytekst = str_replace('%u0107','Δ‡',$nowytekst);    //Δ‡
        $nowytekst = str_replace('%u0119','Δ™',$nowytekst);    //Δ™
        $nowytekst = str_replace('%u0142','Ε‚',$nowytekst);    //Ε‚
        $nowytekst = str_replace('%u0144','Ε„',$nowytekst);    //Ε„
        $nowytekst = str_replace('%u00F3','Γ³',$nowytekst);    //Γ³
        $nowytekst = str_replace('%u015B','Ε›',$nowytekst);    //Ε›
        $nowytekst = str_replace('%u017A','ΕΊ',$nowytekst);    //ΕΊ
        $nowytekst = str_replace('%u017C','ΕΌ',$nowytekst);    //ΕΌ
   return ($nowytekst);
   }    
?>

Everything goes smooth, but it doesn't change '%u00D3','Γ“' and '%u00F3','Γ³'. I dont have idea what to do with that.

Remember! File must be saved in UTF-8 coding.
up
2
Aidan Kehoe <php-manual at parhasard dot net> ΒΆ
22 years ago
The fastest way I've found to check if something is valid UTF-8 is 
<?php 
if (iconv('UTF-8', 'UTF-8', $input) != $input) { 
        /* It's not UTF-8--for me, it's probably CP1252, the Windows
           version of Latin 1, with directed quotation marks and
           the Euro sign.  */
}
 ?>. 
The iconv() C library fails if it's told a string is UTF-8 and it isn't; the PHP one doesn't, it just returns the conversion up to the point of failure, so you have to compare the result to the input to find out if the conversion succeeded.
up
1
Aleksandr ΒΆ
8 years ago
In addition to note by yannikh at gmeil dot com, another way to decode strings with non-latin chars from unix console like

C=RU, L=\xD0\x9C\xD0\xBE\xD1\x81\xD0\xBA\xD0\xB2\xD0\xB0,

<?php preg_replace_callback('/\\\\x([0-9A-F]{2})/', function($a){ return pack('H*', $a[1]); }, $str); ?>

The code above will output:
C=RU, L=Москва,
up
1
christoffer ΒΆ
13 years ago
The preferred way to use this on an array would be with the built in PHP function "array_map()", as for example:
$array = array_map("utf8_decode", $array);
up
1
thierry.bo # netcourrier point com ΒΆ
20 years ago
In response to fhoech (22-Sep-2005 11:55), I just tried a simultaneous test with the file UTF-8-test.txt using your regexp, 'j dot dittmer' (20-Sep-2005 06:30) regexp (message #56962), `php-note-2005` (17-Feb-2005 08:57) regexp in his message on `mb-detect-encoding` page (http://us3.php.net/manual/en/function.mb-detect-encoding.php#50087) who is using a regexp from the W3C (http://w3.org/International/questions/qa-forms-utf-8.html), and PHP mb_detect_encoding function.

Here are a summarize of the results :

201 lines are valid UTF8 strings using phpnote regexp
203 lines are valid UTF8 strings using j.dittmer regexp
200 lines are valid UTF8 strings using fhoech regexp
239 lines are valid  UTF8 strings using using mb_detect_encoding

Here are the lines with differences (left to right, phpnote, j.dittmer and fhoech) :

Line #70 : NOT UTF8|IS UTF8!|IS UTF8! :2.1.1 1 byte (U-00000000): "" 
Line #79 : NOT UTF8|IS UTF8!|IS UTF8! :2.2.1 1 byte (U-0000007F): "" 
Line #81 : IS UTF8!|IS UTF8!|NOT UTF8 :2.2.3 3 bytes (U-0000FFFF): "&#65535;" | 
Line #267 : IS UTF8!|IS UTF8!|NOT UTF8 :5.3.1 U+FFFE = ef bf be = "&#65534;" |
Line #268 : IS UTF8!|IS UTF8!|NOT UTF8 :5.3.2 U+FFFF = ef bf bf = "&#65535;" | 

Interesting is that you said that your regexp corrected j.dittmer regexp that failed on 5.3 section, but it my test I have the opposite result ?!

I ran this test on windows XP with PHP 4.3.11dev. Maybe these differences come from operating system, or PHP version. 

For mb_detect_encoding I used the command :

mb_detect_encoding($line, 'UTF-8, ISO-8859-1, ASCII');
up
0
jamalmarlone at gmail dot com ΒΆ
3 years ago
$string = "Bjørn Johansen";

echo mb_convert_encoding($string, 'ISO-8859-1', 'UTF-8');

----
prints: "BjΓΈrn Johansen"
up
0
okx dot oliver dot koenig at gmail dot com ΒΆ
11 years ago
// This finally helped me to do the job, thanks to Blackbit, had to modify deprecated ereg:
// original comment: "Squirrelmail contains a nice function in the sources to convert unicode to entities:"

function charset_decode_utf_8 ($string) {
    /* Only do the slow convert if there are 8-bit characters */
    /* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
    if (!preg_match("/[\200-\237]/", $string)
     && !preg_match("/[\241-\377]/", $string)
    ) {
        return $string;
    }

    // decode three byte unicode characters
    $string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
        "'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
        $string
    );

    // decode two byte unicode characters
    $string = preg_replace("/([\300-\337])([\200-\277])/e",
        "'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
        $string
    );

    return $string;
}
up
0
sashott at gmail dot com ΒΆ
11 years ago
Use of utf8_decode was not enough for me by get page content from another site. Problem appear by different alphabet from standard latin. As example some chars (corresponding to HTML codes &bdquo; , &nbsp; and others) are converted to "?" or "xA0" (hex value). You need to make some conversion before execute utf8_decode. And you can not replace simple, that they can be part of 2 bytes code for a char (UTF-8 use 2 bytes). Next is for cyrillic alphabet, but for other must be very close.

function convertMethod($text){
    //Problem is that utf8_decode convert HTML chars for &bdquo; and other to ? or &nbsp; to \xA0. And you can not replace, that they are in some char bytes and you broke cyrillic (or other alphabet) chars.
    $problem_enc=array(
        'euro',
        'sbquo',
        'bdquo',
        'hellip',
        'dagger',
        'Dagger',
        'permil',
        'lsaquo',
        'lsquo',
        'rsquo',
        'ldquo',
        'rdquo',
        'bull',
        'ndash',
        'mdash',
        'trade',
        'rsquo',
        'brvbar',
        'copy',
        'laquo',
        'reg',
        'plusmn',
        'micro',
        'para',
        'middot',
        'raquo',
        'nbsp'
    );
    $text=mb_convert_encoding($text,'HTML-ENTITIES','UTF-8');
    $text=preg_replace('#(?<!\&ETH;)\&('.implode('|',$problem_enc).');#s','--amp{$1}',$text);
    $text=mb_convert_encoding($text,'UTF-8','HTML-ENTITIES');
    $text=utf8_decode($text);
    $text=mb_convert_encoding($text,'HTML-ENTITIES','UTF-8');
    $text=preg_replace('#\-\-amp\{([^\}]+)\}#su','&$1;',$text);
    $text=mb_convert_encoding($text,'UTF-8','HTML-ENTITIES');
    return $text;
}

If this don't work, try to set "die($text);" on some places to look, what is happen to this row. Is better to test with long text. It is very possible to broke other alphabet character. In this case, it is very possible, that for you alphabet set "&ETH;" is not the right one. You need to set "die($text);" after this preg_replace and look HTML code for character before set "--amp".
up
0
punchivan at gmail dot com ΒΆ
17 years ago
EY! the bug is not in the function 'utf8_decode'. The bug is in the function 'mb_detect_encoding'. If you put a word with a special char at the end like this 'accentuΓ©', that will lead to a wrong result (UTF-8) but if you put another char at the end like this 'accentuΓ©e' you will get it right. So you should always add a ISO-8859-1 character to your string for this check. My advise is to use a blank space.
IΒ΄ve tried it and it works! 

function ISO_convert($array)
{
    $array_temp = array();
     
    foreach($array as $name => $value)
    {
        if(is_array($value))
          $array_temp[(mb_detect_encoding($name." ",'UTF-8,ISO-8859-1') == 'UTF-8' ? utf8_decode($name) : $name )] = ISO_convert($value);
        else
          $array_temp[(mb_detect_encoding($name." ",'UTF-8,ISO-8859-1') == 'UTF-8' ? utf8_decode($name) : $name )] = (mb_detect_encoding($value." ",'UTF-8,ISO-8859-1') == 'UTF-8' ? utf8_decode($value) : $value );
    }

    return $array_temp; 
}
up
0
ludvig dot ericson at gmail dot com ΒΆ
18 years ago
A better way to convert would be to use iconv, see http://www.php.net/iconv -- example:

<?php
$myUnicodeString = "ÅÀâ";
echo iconv("UTF-8", "ISO-8859-1", $myUnicodeString);
?>

Above would echo out the given variable in ISO-8859-1 encoding, you may replace it with whatever you prefer.

Another solution to the issue of misdisplayed glyphs is to simply send the document as UTF-8, and of course send UTF-8 data:

<?php
# Replace text/html with whatever MIME-type you prefer.
header("Content-Type: text/html; charset=utf-8");
?>
up
0
MARC13 ΒΆ
18 years ago
I did this function to convert data from AJAX call to insert to my database.
It converts UTF-8 from XMLHttpRequest() to ISO-8859-2 that I use in LATIN2 MySQL database.

<?php
function utf2iso($tekst)
{
        $nowytekst = str_replace("%u0104","\xA1",$tekst);    //Δ„
        $nowytekst = str_replace("%u0106","\xC6",$nowytekst);    //Δ†
        $nowytekst = str_replace("%u0118","\xCA",$nowytekst);    //Ę
        $nowytekst = str_replace("%u0141","\xA3",$nowytekst);    //Ł
        $nowytekst = str_replace("%u0143","\xD1",$nowytekst);    //Εƒ
        $nowytekst = str_replace("%u00D3","\xD3",$nowytekst);    //Γ“
        $nowytekst = str_replace("%u015A","\xA6",$nowytekst);    //Ś
        $nowytekst = str_replace("%u0179","\xAC",$nowytekst);    //ΕΉ
        $nowytekst = str_replace("%u017B","\xAF",$nowytekst);    //Ε»
        
        $nowytekst = str_replace("%u0105","\xB1",$nowytekst);    //Δ…
        $nowytekst = str_replace("%u0107","\xE6",$nowytekst);    //Δ‡
        $nowytekst = str_replace("%u0119","\xEA",$nowytekst);    //Δ™
        $nowytekst = str_replace("%u0142","\xB3",$nowytekst);    //Ε‚
        $nowytekst = str_replace("%u0144","\xF1",$nowytekst);    //Ε„
        $nowytekst = str_replace("%u00D4","\xF3",$nowytekst);    //Γ³
        $nowytekst = str_replace("%u015B","\xB6",$nowytekst);    //Ε›
        $nowytekst = str_replace("%u017A","\xBC",$nowytekst);    //ΕΊ
        $nowytekst = str_replace("%u017C","\xBF",$nowytekst);    //ΕΌ
        
    return ($nowytekst);
}
?>

In my case also the code file that deals with AJAX calls must be in UTF-8 coding.
up
0
luka8088 at gmail dot com ΒΆ
18 years ago
simple UTF-8 to HTML conversion:

function utf8_to_html ($data)
    {
    return preg_replace("/([\\xC0-\\xF7]{1,1}[\\x80-\\xBF]+)/e", '_utf8_to_html("\\1")', $data);
    }

function _utf8_to_html ($data)
    {
    $ret = 0;
    foreach((str_split(strrev(chr((ord($data{0}) % 252 % 248 % 240 % 224 % 192) + 128) . substr($data, 1)))) as $k => $v)
        $ret += (ord($v) % 128) * pow(64, $k);
    return "&#$ret;";
    }

Example:
echo utf8_to_html("a b č Δ‡ ΕΎ こ に け わ ()[]{}!#$?*");

Output:
a b &#269; &#263; &#382; &#12371; &#12395; &#12385; &#12431; ()[]{}!#$?*
up
0
j dot dittmer at portrix dot net ΒΆ
20 years ago
The regex in the last comment has some typos. This is a
syntactically valid one, don't know if it's correct though.
You've to concat the expression in one long line.

^(
[\x00-\x7f]|
[\xc2-\xdf][\x80-\xbf]|
[\xe0][\xa0-\xbf][\x80-\xbf]|
[\xe1-\xec][\x80-\xbf]{2}|
[\xed][\x80-\x9f][\x80-\xbf]|
[\xee-\xef][\x80-\xbf]{2}|
[\xf0][\x90-\xbf][\x80-\xbf]{2}|
[\xf1-\xf3][\x80-\xbf]{3}|
[\xf4][\x80-\x8f][\x80-\xbf]{2}
)*$
up
0
gto at interia dot pl ΒΆ
22 years ago
Correction to function converting utf82iso88592 and iso88592tutf8.
Janusz forgot about "&#324;", and "&#380;" exchanged from "&#378;" here and there.

GTo

function utf82iso88592($tekscik) {
     $tekscik = str_replace("\xC4\x85", "&#261;", $tekscik);
     $tekscik = str_replace("\xC4\x84", '&#260;', $tekscik);
     $tekscik = str_replace("\xC4\x87", '&#263;', $tekscik);
     $tekscik = str_replace("\xC4\x86", '&#262;', $tekscik);
     $tekscik = str_replace("\xC4\x99", '&#281;', $tekscik);
     $tekscik = str_replace("\xC4\x98", '&#280;', $tekscik);
     $tekscik = str_replace("\xC5\x82", '&#322;', $tekscik);
     $tekscik = str_replace("\xC5\x81", '&#321;', $tekscik);
     $tekscik = str_replace("\xC5\x84", '&#324;', $tekscik);     
     $tekscik = str_replace("\xC5\x83", '&#323;', $tekscik);
     $tekscik = str_replace("\xC3\xB3", '?', $tekscik);
     $tekscik = str_replace("\xC3\x93", '?', $tekscik);
     $tekscik = str_replace("\xC5\x9B", '&#347;', $tekscik);
     $tekscik = str_replace("\xC5\x9A", '&#346;', $tekscik);
     $tekscik = str_replace("\xC5\xBC", '&#380;', $tekscik);
     $tekscik = str_replace("\xC5\xBB", '&#379;', $tekscik);
     $tekscik = str_replace("\xC5\xBA", '&#378;', $tekscik);
     $tekscik = str_replace("\xC5\xB9", '&#377;', $tekscik);
     return $tekscik;
} // utf82iso88592

function iso885922utf8($tekscik) {
     $tekscik = str_replace("&#261;", "\xC4\x85", $tekscik);
     $tekscik = str_replace('&#260;', "\xC4\x84", $tekscik);
     $tekscik = str_replace('&#263;', "\xC4\x87", $tekscik);
     $tekscik = str_replace('&#262;', "\xC4\x86", $tekscik);
     $tekscik = str_replace('&#281;', "\xC4\x99", $tekscik);
     $tekscik = str_replace('&#280;', "\xC4\x98", $tekscik);
     $tekscik = str_replace('&#322;', "\xC5\x82", $tekscik);
     $tekscik = str_replace('&#321;', "\xC5\x81", $tekscik);
     $tekscik = str_replace('&#324;', "\xC5\x84", $tekscik);
     $tekscik = str_replace('&#323;',"\xC5\x83", $tekscik);
     $tekscik = str_replace('?', "\xC3\xB3", $tekscik);
     $tekscik = str_replace('?', "\xC3\x93", $tekscik);
     $tekscik = str_replace('&#347;', "\xC5\x9B", $tekscik);
     $tekscik = str_replace('&#346;', "\xC5\x9A", $tekscik);
     $tekscik = str_replace('&#380;', "\xC5\xBC", $tekscik);
     $tekscik = str_replace('&#379;', "\xC5\xBB", $tekscik);
     $tekscik = str_replace('&#378;', "\xC5\xBA", $tekscik);
     $tekscik = str_replace('&#377;', "\xC5\xB9", $tekscik);     
     return $tekscik;
} // iso885922utf8
up
-1
kode68 ΒΆ
10 years ago
Update Answer from okx dot oliver dot koenig at gmail dot com for PHP 5.6 since e/ modifier is depreciated

// This finally helped me to do the job, thanks to Blackbit, had to modify deprecated ereg:
// original comment: "Squirrelmail contains a nice function in the sources to convert unicode to entities:"

function charset_decode_utf_8($string)
    {
        /* Only do the slow convert if there are 8-bit characters */
        if ( !preg_match("/[\200-\237]/", $string) && !preg_match("/[\241-\377]/", $string) )
               return $string;

        // decode three byte unicode characters
          $string = preg_replace_callback("/([\340-\357])([\200-\277])([\200-\277])/",
                    create_function ('$matches', 'return \'&#\'.((ord($matches[1])-224)*4096+(ord($matches[2])-128)*64+(ord($matches[3])-128)).\';\';'),
                    $string);

        // decode two byte unicode characters
          $string = preg_replace_callback("/([\300-\337])([\200-\277])/",
                    create_function ('$matches', 'return \'&#\'.((ord($matches[1])-192)*64+(ord($matches[2])-128)).\';\';'),
                    $string);

        return $string;
    }

Enjoy
up
-1
visus at portsonline dot net ΒΆ
18 years ago
Following code helped me with mixed (UTF8+ISO-8859-1(x)) encodings. In this case, I have template files made and maintained by designers who do not care about encoding and MySQL data in utf8_binary_ci encoded tables.

<?php

class Helper
{
    function strSplit($text, $split = 1)
    {
        if (!is_string($text)) return false;
        if (!is_numeric($split) && $split < 1) return false;

        $len = strlen($text);

        $array = array();

        $i = 0;

        while ($i < $len)
        {
            $key = NULL;

            for ($j = 0; $j < $split; $j += 1)
            {
                $key .= $text{$i};

                $i += 1;
            }

            $array[] = $key;
        }

        return $array;
    }

    function UTF8ToHTML($str)
    {
        $search = array();
        $search[] = "/([\\xC0-\\xF7]{1,1}[\\x80-\\xBF]+)/e";
        $search[] = "/&#228;/";
        $search[] = "/&#246;/";
        $search[] = "/&#252;/";
        $search[] = "/&#196;/";
        $search[] = "/&#214;/";
        $search[] = "/&#220;/";
        $search[] = "/&#223;/";

        $replace = array();
        $replace[] = 'Helper::_UTF8ToHTML("\\1")';
        $replace[] = "Γ€";
        $replace[] = "ΓΆ";
        $replace[] = "ΓΌ";
        $replace[] = "Γ„";
        $replace[] = "Γ–";
        $replace[] = "ΓΌ";
        $replace[] = "ß";

        $str = preg_replace($search, $replace, $str);

        return $str;
    }

    function _UTF8ToHTML($str)
    {
        $ret = 0;

        foreach((Helper::strSplit(strrev(chr((ord($str{0}) % 252 % 248 % 240 % 224 % 192) + 128).substr($str, 1)))) as $k => $v)
            $ret += (ord($v) % 128) * pow(64, $k);
        return "&#".$ret.";";
    }
}

// Usage example:

$tpl = file_get_contents("template.tpl");
/* ... */
$row = mysql_fetch_assoc($result);

print(Helper::UTF8ToHTML(str_replace("{VAR}", $row['var'], $tpl)));

?>
up
-1
paul.hayes at entropedia.co.uk ΒΆ
19 years ago
I noticed that the utf-8 to html functions below are only for 2 byte long codes. Well I wanted 3 byte support (sorry haven't done 4, 5 or 6). Also I noticed the concatination of the character codes did have the hex prefix 0x and so failed with the large 2 byte codes)

<?
  public function utf2html (&$str) {
    
    $ret = "";
    $max = strlen($str);
    $last = 0;  // keeps the index of the last regular character
    for ($i=0; $i<$max; $i++) {
        $c = $str{$i};
        $c1 = ord($c);
        if ($c1>>5 == 6) {  // 110x xxxx, 110 prefix for 2 bytes unicode
            $ret .= substr($str, $last, $i-$last); // append all the regular characters we've passed
            $c1 &= 31; // remove the 3 bit two bytes prefix
            $c2 = ord($str{++$i}); // the next byte
            $c2 &= 63;  // remove the 2 bit trailing byte prefix
            $c2 |= (($c1 & 3) << 6); // last 2 bits of c1 become first 2 of c2
            $c1 >>= 2; // c1 shifts 2 to the right
            $ret .= "&#" . ($c1 * 0x100 + $c2) . ";"; // this is the fastest string concatenation
            $last = $i+1;       
        }
        elseif ($c1>>4 == 14) {  // 1110 xxxx, 110 prefix for 3 bytes unicode
            $ret .= substr($str, $last, $i-$last); // append all the regular characters we've passed
            $c2 = ord($str{++$i}); // the next byte
            $c3 = ord($str{++$i}); // the third byte
            $c1 &= 15; // remove the 4 bit three bytes prefix
            $c2 &= 63;  // remove the 2 bit trailing byte prefix
            $c3 &= 63;  // remove the 2 bit trailing byte prefix
            $c3 |= (($c2 & 3) << 6); // last 2 bits of c2 become first 2 of c3
            $c2 >>=2; //c2 shifts 2 to the right
            $c2 |= (($c1 & 15) << 4); // last 4 bits of c1 become first 4 of c2
            $c1 >>= 4; // c1 shifts 4 to the right
            $ret .= '&#' . (($c1 * 0x10000) + ($c2 * 0x100) + $c3) . ';'; // this is the fastest string concatenation
            $last = $i+1;       
        }
    }
    $str=$ret . substr($str, $last, $i); // append the last batch of regular characters
} 
?>
up
-1
php-net at ---NOSPAM---lc dot yi dot org ΒΆ
20 years ago
I've just created this code snippet to improve the user-customizable emails sent by one of my websites.

The goal was to use UTF-8 (Unicode) so that non-english users have all the Unicode benefits, BUT also make life seamless for English (or specifically, English MS-Outlook users).  The niggle: Outlook prior to 2003 (?)  does not properly detect unicode emails.  When "smart quotes" from MS Word were pasted into a rich text area and saved in Unicode, then sent by email to an Outlook user, more often than not, these characters were wrongly rendered as "greek". 

So, the following code snippet replaces a few strategic characters into html entities which Outlook XP (and possibly earlier) will render as expected.  [Code based on bits of code from previous posts on this and the htmlenties page]
<?php
    $badwordchars=array(
        "\xe2\x80\x98", // left single quote
        "\xe2\x80\x99", // right single quote
        "\xe2\x80\x9c", // left double quote
        "\xe2\x80\x9d", // right double quote
        "\xe2\x80\x94", // em dash
        "\xe2\x80\xa6" // elipses
    );
    $fixedwordchars=array(
        "&#8216;",
        "&#8217;",
        '&#8220;',
        '&#8221;',
        '&mdash;',
        '&#8230;'
    );
    $html=str_replace($badwordchars,$fixedwordchars,$html);
?>
up
-2
Blackbit ΒΆ
17 years ago
Squirrelmail contains a nice function in the sources to convert unicode to entities:

<?php
function charset_decode_utf_8 ($string) {
      /* Only do the slow convert if there are 8-bit characters */
    /* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
    if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
        return $string;

    // decode three byte unicode characters
    $string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",        \
    "'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",    \
    $string);

    // decode two byte unicode characters
    $string = preg_replace("/([\300-\337])([\200-\277])/e", \
    "'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'", \
    $string);

    return $string;
}
?>
up
-1
Sadi ΒΆ
18 years ago
Once again about polish letters. If you use fananf's solution, make sure that PHP file is coded with cp1250 or else it won't work. It's quite obvious, however I spent some time before I finally figured that out, so I thought I post it here.
up
-2
rasmus at flajm dot se ΒΆ
21 years ago
If you don't have the multibyte extension installed, here's a function to decode UTF-16 encoded strings. It support both BOM-less and BOM'ed strings, (big- and little-endian byte order.)

<?php
/**
 * Decode UTF-16 encoded strings.
 * 
 * Can handle both BOM'ed data and un-BOM'ed data. 
 * Assumes Big-Endian byte order if no BOM is available.
 * 
 * @param   string  $str  UTF-16 encoded data to decode.
 * @return  string  UTF-8 / ISO encoded data.
 * @access  public
 * @version 0.1 / 2005-01-19
 * @author  Rasmus Andersson {@link http://rasmusandersson.se/}
 * @package Groupies
 */
function utf16_decode( $str ) {
    if( strlen($str) < 2 ) return $str;
    $bom_be = true;
    $c0 = ord($str{0});
    $c1 = ord($str{1});
    if( $c0 == 0xfe && $c1 == 0xff ) { $str = substr($str,2); }
    elseif( $c0 == 0xff && $c1 == 0xfe ) { $str = substr($str,2); $bom_be = false; }
    $len = strlen($str);
    $newstr = '';
    for($i=0;$i<$len;$i+=2) {
        if( $bom_be ) { $val = ord($str{$i})   << 4; $val += ord($str{$i+1}); }
        else {        $val = ord($str{$i+1}) << 4; $val += ord($str{$i}); }
        $newstr .= ($val == 0x228) ? "\n" : chr($val);
    }
    return $newstr;
}
?>
up
-2
Ajgor ΒΆ
19 years ago
small upgrade for polish decoding:

function utf82iso88592($text) {
 $text = str_replace("\xC4\x85", 'Δ…', $text);
 $text = str_replace("\xC4\x84", 'Δ„', $text);
 $text = str_replace("\xC4\x87", 'Δ‡', $text);
 $text = str_replace("\xC4\x86", 'Δ†', $text);
 $text = str_replace("\xC4\x99", 'Δ™', $text);
 $text = str_replace("\xC4\x98", 'Ę', $text);
 $text = str_replace("\xC5\x82", 'Ε‚', $text);
 $text = str_replace("\xC5\x81", 'Ł', $text);
 $text = str_replace("\xC3\xB3", 'Γ³', $text);
 $text = str_replace("\xC3\x93", 'Γ“', $text);
 $text = str_replace("\xC5\x9B", 'Ε›', $text);
 $text = str_replace("\xC5\x9A", 'Ś', $text);
 $text = str_replace("\xC5\xBC", 'ΕΌ', $text);
 $text = str_replace("\xC5\xBB", 'Ε»', $text);
 $text = str_replace("\xC5\xBA", 'ΕΌ', $text);
 $text = str_replace("\xC5\xB9", 'Ε»', $text);
 $text = str_replace("\xc5\x84", 'Ε„', $text);
 $text = str_replace("\xc5\x83", 'Εƒ', $text);

return $text;
} // utf82iso88592
up
-2
2ge at NO2geSPAM dot us ΒΆ
20 years ago
Hello all,

I like to use COOL (nice) URIs, example: http://example.com/try-something
I'm using UTF8 as input, so I have to write a function UTF8toASCII to have nice URI. Here is what I come with:

<?php
function urlize($url) {
 $search = array('/[^a-z0-9]/', '/--+/', '/^-+/', '/-+$/' );
 $replace = array( '-', '-', '', '');
 return preg_replace($search, $replace, utf2ascii($url));
}     

function utf2ascii($string) {
 $iso88591  = "\\xE0\\xE1\\xE2\\xE3\\xE4\\xE5\\xE6\\xE7";
 $iso88591 .= "\\xE8\\xE9\\xEA\\xEB\\xEC\\xED\\xEE\\xEF";
 $iso88591 .= "\\xF0\\xF1\\xF2\\xF3\\xF4\\xF5\\xF6\\xF7";
 $iso88591 .= "\\xF8\\xF9\\xFA\\xFB\\xFC\\xFD\\xFE\\xFF";
 $ascii = "aaaaaaaceeeeiiiidnooooooouuuuyyy";
 return strtr(mb_strtolower(utf8_decode($string), 'ISO-8859-1'),$iso88591,$ascii);
}

echo urlize("Fucking ?m?l");

?>

I hope this helps someone.
up
-5
haugas at gmail dot com ΒΆ
18 years ago
If you don't know exactly, how many times your string is encoded, you can use this function:

<?php

function _utf8_decode($string)
{
  $tmp = $string;
  $count = 0;
  while (mb_detect_encoding($tmp)=="UTF-8")
  {
    $tmp = utf8_decode($tmp);
    $count++;
  }
  
  for ($i = 0; $i < $count-1 ; $i++)
  {
    $string = utf8_decode($string);
    
  }
  return $string;
  
}

?>
up
-3
yannikh at gmeil dot com ΒΆ
20 years ago
I had to tackle a very interesting problem:

I wanted to replace all \xXX in a text by it's letters. Unfortunatelly XX were ASCII and not utf8. I solved my problem that way:
<?php preg_replace ('/\\\\x([0-9a-fA-F]{2})/e', "pack('H*',utf8_decode('\\1'))",$v); ?>
up
-4
fhoech ΒΆ
20 years ago
Sorry, I had a typo in my last comment. Corrected regexp:

^([\\x00-\\x7f]|
[\\xc2-\\xdf][\\x80-\\xbf]|
\\xe0[\\xa0-\\xbf][\\x80-\\xbf]|
[\\xe1-\\xec][\\x80-\\xbf]{2}|
\\xed[\\x80-\\x9f][\\x80-\\xbf]|
\\xef[\\x80-\\xbf][\\x80-\\xbd]|
\\xee[\\x80-\\xbf]{2}|
\xf0[\\x90-\\xbf][\\x80-\\xbf]{2}|
[\\xf1-\\xf3][\\x80-\\xbf]{3}|
\\xf4[\\x80-\\x8f][\\x80-\\xbf]{2})*$