差異處
這裏顯示兩個版本的差異處。
兩邊的前次修訂版 前次修改 下次修改 | 前次修改 | ||
tech:check_utf8 [2008/04/18 14:57] – jonathan | tech:check_utf8 [2009/03/03 01:07] (目前版本) – jonathan | ||
---|---|---|---|
行 1: | 行 1: | ||
+ | ====== Perl / PHP 檢測變數有否 UTF-8 方式 ====== | ||
+ | **採用 w3c [[http:// | ||
+ | |||
+ | ===== - Perl ===== | ||
+ | |||
+ | <code perl> | ||
+ | # | ||
+ | |||
+ | $a=" | ||
+ | $b=" | ||
+ | |||
+ | $ca=is_utf8($a)?" | ||
+ | $cb=is_utf8($b)?" | ||
+ | |||
+ | print(" | ||
+ | print(" | ||
+ | exit; | ||
+ | |||
+ | # 判別是否 UTF-8 字串 | ||
+ | sub is_utf8 { | ||
+ | local($p_string) = @_; | ||
+ | | ||
+ | #From http:// | ||
+ | # It will return true if $p_string is UTF-8, and false otherwise. | ||
+ | return($p_string =~ m/\A( | ||
+ | | ||
+ | | [\xC2-\xDF][\x80-\xBF] | ||
+ | | ||
+ | | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | ||
+ | | ||
+ | | ||
+ | | [\xF1-\xF3][\x80-\xBF]{3} | ||
+ | | ||
+ | )*\z/x); | ||
+ | } | ||
+ | |||
+ | </ | ||
+ | < | ||
+ | [測試] : UTF-8 | ||
+ | [yyy] : UTF-8 | ||
+ | </ | ||
+ | |||
+ | ++++採用 use utf8; 與 is_utf8($string) 來判別..| | ||
+ | **這個方式主要的問題是必須要設定輸出格式< | ||
+ | <code perl> | ||
+ | # | ||
+ | use utf8; | ||
+ | |||
+ | $a=" | ||
+ | $b=" | ||
+ | |||
+ | $ca=utf8:: | ||
+ | $cb=utf8:: | ||
+ | |||
+ | binmode(STDOUT, | ||
+ | print(" | ||
+ | print(" | ||
+ | </ | ||
+ | < | ||
+ | [測試] : UTF-8 | ||
+ | [yyy] : ASCII | ||
+ | </ | ||
+ | ++++ | ||
+ | |||
+ | |||
+ | |||
+ | ===== - PHP ===== | ||
+ | |||
+ | <code php> | ||
+ | <?php | ||
+ | $a=" | ||
+ | $b=" | ||
+ | |||
+ | $ca=is_utf8($a)?" | ||
+ | $cb=is_utf8($b)?" | ||
+ | |||
+ | echo(" | ||
+ | echo(" | ||
+ | |||
+ | function is_utf8($string) { | ||
+ | |||
+ | // From http:// | ||
+ | return preg_match(' | ||
+ | [\x09\x0A\x0D\x20-\x7E] # ASCII | ||
+ | | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | ||
+ | | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | ||
+ | | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | ||
+ | | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | ||
+ | | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | ||
+ | | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | ||
+ | | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 | ||
+ | )*$%xs', | ||
+ | } // function is_utf8 | ||
+ | ?> | ||
+ | </ | ||
+ | < | ||
+ | [測試] : UTF-8 | ||
+ | [yyy] : UTF-8 | ||
+ | </ | ||
+ | < | ||
+ | 因為這個程式碼採用 UTF-8 編碼,所以裡面的字串都會是 UTF-8, 如果改存成 ASCII, 結果也都會變成 ASCII, 所以當判別的字串是開啟外部檔案內容就可以看出不同點. \\ | ||
+ | Exp. 假設原本 UTF-8 程式檔名為 tx.pl 轉成 Big5 編碼存成 txx.pl 執行結果就變成以下 | ||
+ | < | ||
+ | [apache@tryboxap04 tmp]$ iconv -f utf8 -t big5 tx.pl >txx.pl | ||
+ | [apache@tryboxap04 tmp]$ perl txx.pl | ||
+ | [測試] : ASCII | ||
+ | [yyy] : UTF-8 | ||
+ | </ | ||
+ | </ | ||
+ | |||
+ | <note important> | ||
+ | * 如果只需要判別文字檔案編碼格式,可以直接採用 Linux 內的 file((file 在 CentOS 完整路徑為 / | ||
+ | < | ||
+ | [apache@tryboxap04 input]$ file 20080415-2.csv | ||
+ | 20080415-2.csv: | ||
+ | [apache@tryboxap04 input]$ file 20080415-2.csv.md5 | ||
+ | 20080415-2.csv.md5: | ||
+ | [apache@tryboxap04 tmp]$ file 20080415-2.csv | ||
+ | 20080415-2.csv: | ||
+ | </ | ||
+ | </ | ||
+ | |||
+ | ===== 參考資料 ===== | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | |||
+ | {{tag> |