Perl / PHP 檢測變數有否 UTF-8 方式

採用 w3c FAQ 的 regular expression 檢驗方式

- Perl

#!/usr/bin/perl
 
$a="測試";
$b="yyy";
 
$ca=is_utf8($a)?"UTF-8":"ASCII";
$cb=is_utf8($b)?"UTF-8":"ASCII";
 
print("[$a] : $ca\n");
print("[$b] : $cb\n");
exit;
 
# 判別是否 UTF-8 字串
sub is_utf8 {
  local($p_string) = @_;
 
	#From http://w3.org/International/questions/qa-forms-utf-8.html
	# It will return true if $p_string is UTF-8, and false otherwise.
	return($p_string =~ m/\A(
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*\z/x);
}

[測試] : UTF-8
[yyy] : UTF-8

採用 use utf8; 與 is_utf8($string) 來判別..

這個方式主要的問題是必須要設定輸出格式<code perl>binmode(STDOUT, ':encoding(utf-8)');</code>，否則會出現警告訊息

#!/usr/bin/perl
use utf8;
 
$a="測試";
$b="yyy";
 
$ca=utf8::is_utf8($a)?"UTF-8":"ASCII";
$cb=utf8::is_utf8($b)?"UTF-8":"ASCII";
 
binmode(STDOUT, ':encoding(utf-8)');
print("[$a] : $ca\n");
print("[$b] : $cb\n");

[測試] : UTF-8
[yyy] : ASCII

- PHP

<?php
$a="測試";
$b="yyy";
 
$ca=is_utf8($a)?"UTF-8":"ASCII";
$cb=is_utf8($b)?"UTF-8":"ASCII";
 
echo("[$a] : $ca\n");
echo("[$b] : $cb\n");
 
function is_utf8($string) {
 
// From http://w3.org/International/questions/qa-forms-utf-8.html
        return preg_match('%^(?:
        [\x09\x0A\x0D\x20-\x7E] # ASCII
        | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
        | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
        | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
        | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
        | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
        )*$%xs', $string);
} // function is_utf8
?>

[測試] : UTF-8
[yyy] : UTF-8

因為這個程式碼採用 UTF-8 編碼，所以裡面的字串都會是 UTF-8, 如果改存成 ASCII, 結果也都會變成 ASCII, 所以當判別的字串是開啟外部檔案內容就可以看出不同點.
Exp. 假設原本 UTF-8 程式檔名為 tx.pl 轉成 Big5 編碼存成 txx.pl 執行結果就變成以下

[apache@tryboxap04 tmp]$ iconv -f utf8 -t big5 tx.pl >txx.pl
[apache@tryboxap04 tmp]$ perl txx.pl
[測試] : ASCII
[yyy] : UTF-8

如果只需要判別文字檔案編碼格式，可以直接採用 Linux 內的 file¹⁾ 指令直接獲知.. Exp.:

[apache@tryboxap04 input]$ file 20080415-2.csv
20080415-2.csv: UTF-8 Unicode text, with CRLF line terminators
[apache@tryboxap04 input]$ file 20080415-2.csv.md5
20080415-2.csv.md5: ASCII text
[apache@tryboxap04 tmp]$ file 20080415-2.csv
20080415-2.csv: ISO-8859 text, with CRLF line terminators

參考資料

¹⁾

file 在 CentOS 完整路徑為 /usr/bin/file, RPM: file-4.10-3.0.2.el4