Handling Japanese characters using UTF-8 in MySql, MySqli, PHP, HTML, CSS, Javascript, CSV, XML

While programming a web interface for the Kanji database I encountered a number of problems specifically related to character handling. I am glad to share the solutions I found by searching in forums, blogs, program documentation, books and by doing a lot of experimenting. I hope these tips can be helpful to other developers working with Japanese or other multibyte Unicode characters. Unicode characters for software developers

HTML

If you see something like ÃƒÂ¤ÃƒÂ¶ÃƒÂ¼ÃƒÅ on your HTML page, your text source is probably correct UTF-8, but your browser is not set to displaying it as such. Use this metatag as the first metatag in the head section of your html pages to define the UTF-8 characterset:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8>
Or use the shorter tag for new (HTML5 compliant) browsers:
<meta charset="UTF-8">
They should have the same effect: telling the browser which character encoding to use. UTF-8 contains just about every alphabet used in the world.

Be sure to use only fonts that actually contain all Japanese characters! See Wikipedia on UTF-8 and webfonts

In HTML5 there is a provision for the use of small "Ruby characters" (furigana) sometimes used for pronunciation guidance of Kanji characters:

hiragana		katakana		romaji
とう	きょう	トウ	キョウ	tō	kyō
東	京	東	京	東	京

("Tokio")

HTML5 markup: <ruby>東<rp>(</rp><rt>とう</rt><rp>)</rp>京<rp>(</rp><rt>きょう</rt><rp>)</rp></ruby>
Rendering: 東(とう)京(きょう)

If the browser (possibly helped by a respective plugin) supports the Ruby tags the Ruby characters are shown as small characters above the Kanji, otherwise they appear in parenthesis directly after the character. Wikipedia on Ruby characters HTML 5 doctor on Ruby characters

CSS

If you use a separate linked in CSS (Cascading Style Sheet) file for your webpage, put the following line right at the very beginning of it:
@charset "UTF-8";
Remember to save your stylesheet in UTF-8 using an appropriate text editor! If the HTML is specified as UTF-8, browsers usually assume that all linked resources (unless specified otherwise) have the same encoding.

If you use a stylesheet inside your HTML (inline stylesheet), the UTF-8 charset declaration in the HTML metatags should be sufficient. Otherwise use this syntax:
<style style="text/css">

</style>

Doing so enables you to use non-ASCII characters in fontnames, values etc. without problems. CSS charset

PHP

If the browser is set to display UTF-8 and tries to display text from your PHP source or database that isn't proper UTF-8 you may get something like �� instead of the intended characters.

Include this line at the top of your PHP code to set your database connection to UTF-8:
mysql_set_charset("utf8");
If this function isn't present in your PHP installation you could also try to let the database handle it using:
mysql_query("SET NAMES 'utf8'");
In new PHP versions, including the mysqli functions, make a mysqli connection and use the corresponding mysqli function as in:
$link = mysqli_connect('localhost','my_user','my_password','test');
mysqli_set_charset($link,"utf8")
Or:
mysqli->set_charset("utf8");
Be sure to set the characterset before every database transaction.

Set the PHP character encoding to work with multibyte characters:
mb_regex_encoding('UTF-8');
mb_internal_encoding('UTF-8');
UTF-8 characters consist of 1 to 4 bytes each wereas for instance ASCII always uses only one byte per character.

To convert a string $string to html entities, use:
htmlentities($string,ENT_COMPAT,"UTF-8");

If you want to split a string in separate UTF-8 multibyte characters you need a special function to handle it correctly, note the /u in the regular expression:
function mb_str_split($str) {
   // split multibyte string in characters
   // Split at all positions, not after the start: ^
   // and not before the end: $
   $pattern = '/(?<!^)(?!$)/u';
   return preg_split($pattern,$str);
}

There is a whole range of special PHP functions to work with Unicode multibyte characters: PHP mb functions

If you want to extract only the Kanji characters from a block of text, you can use special regular expressions: /\p{Han}/u for everything that is Han or /\P{Han}/u for everything that is NOT Han.
function extractKanji($str){
$pattern = "/\P{Han}/u"; /* everything that is NOT Han will be replaced with an empty string */
return(preg_replace($pattern,'',$str));
}

By the way, there are also special regular expressions for Hiragana and Katakana:
/\P{Hiragana}/u /* matches everything that is NOT Hiragana (using upper case P) */
/\p{Katakana}/u /* matches everything that is Katakana (using lower case p) */
Information about regular expressions and unicode
PHP and regular expressions and unicode

To send the results as a csv file to the user use these headers to set the encoding to UTF-8:
header("Content-Encoding: UTF-8");
header("Content-type: text/csv; charset=UTF-8");

Although it should not be necessary, sending a Byte Order Mark (BOM) just before the datastream of a CSV file appears to work better to put for instance Excel in the right mode for accepting UTF-8 characters:
echo "\xEF\xBB\xBF"; // first send the UTF-8 BOM: Byte Order Mark (= decimal 239 187 191)

To send a page header that tells your browser that utf8 is to be expected (normally not necessary) put this line on top of your php page, just before the content:
header('Content-Type: text/html; charset=UTF-8');

MYSQL

Be sure to set the collation and characterset of your database to:
utf8_general_ci
or:
utf8_unicode_ci
Ci means "case insensitive". If you want case sensitivity you should choose utf8-bin (binary). General is faster than unicode. Unicode can handle expansions, contractions and ignorable characters, general does not, it handles characters one by one. But for most applications general works just fine. Collation influences the searching and sorting, the characterset determines the values that end up in the output tables. While other settings may also work, this is the most universal as it handles just about any character from any alphabet. Mysql unicode character sets

using PHP you can set the charset to UTF-8 like this:
$link = mysql_connect('localhost', 'user', 'password');
mysql_set_charset('utf8',$link);

Actually mysql's utf8 cannot handle all possible unicode characters, only those consisting of 1, 2 or 3 bytes. If you do need all 1, 2, 3 and 4 byte characters, use utf8mb4. For working with Japanese characters it is not necessary however.

MYSQLI

Setting the characterset to UTF-8 in object oriented style:
$mysqli = new mysqli("localhost", "my_user", "my_password", "test_db");
$mysqli->set_charset("utf8");

In procedural style:
$link = mysqli_connect('localhost', 'my_user', 'my_password', 'test_db');
mysqli_set_charset($link, "utf8");

Javascript

Although it is not always critical for functioning properly it is adviseable to save your Javascript files as being UTF-8 encoded. Text editors like Notepad++ provide facilities for that. Include this file in your HTML page as follows:
<script src="myscripts.js" charset="UTF-8"></script>
Now you should be able to use UTF-8 characters in for instance alertboxes etc. or even variable and function names if you like. Make sure your HTML is also specified as being UTF-8, see above.

If you send a parameter "myPar=myString" containing UTF-8 characters in the "myString" query string like in the url: myphp.php?myPar=myString you should use the JavaScript function encodeURIComponent(myString) to properly encode this string. IE needs it, Firefox and Chrome do not(?).

To send a Byte Order Mark (BOM) in Javascript add the prefix "\uFEFF" as the first character to your file. This is a Unicode escape sequence. Please also read the section on CSV files below.

To distill Kanji characters from a textstring you can use a regular expression using the Unicode range (4E00-9FAF) for Kanji characters and the Unicode flag \u. By the way, Hiragana range is 3040-309F and Katakana is 30A0-30FF.
Your script could look like:

function filterKanji(inputStr){
var regexp = /[\u4E00-\u9FAF]/g;
return(inputStr.match(regexp).join(''));
}

The string.match() method gives you an array, the join method turns it into a string.

Notice that most of the string methods in JavaScript are not completely Unicode-aware: like myString.indexOf(), myString.slice(), etc. The method myString.length() also gives unreliable results when applied to strings containing characters with a size of more than 2 bytes. UTF-8 uses from 1 up to 4 bytes per character.

A function to remove doubles from a string that works without string.length():

function get_unique_characters(str){
    return(str
        .split('')
        .filter(function(item, pos, self) {
             return(self.indexOf(item) === pos);
         })
    .join(''));
}

String iterator String.prototype[@@iterator]() is Unicode-aware. You can use the spread operator [...str] or Array.from(str) to create an array of symbols, and calculate the string length or access characters by index without breaking the surrogate pair. Notice that these operations have some performance impact.

Thorough article about UTF-8 and Javascript
Japanese Regexp
Encoding to and from UTF-8 in Javascript, escape and unescape
encodeURI(), encodeURIComponent, escape

CSV

The Comma Separated Values format is not very well standardized, to say the least. It is a plain text version of a database, using a delimiter to mark the values (columns) and a newline character to separate the rows. "Standards" appear to vary by country, program and platform. The separator (delimiter) is mostly a semicolon or a tab character, but could also be a comma, or a colon. If the delimiter character appears in the data, it should be escaped, mostly using double quotes. If the comma is the delimiter and your data are 12 12,5 13 they should be escaped as: "12","12","5","13". To escape the double quotes, use double double quotes: I said "Hello world" becomes: "I said ""Hello world"""

There can also be problems arising from the use of a decimal point or a decimal comma in some countries. And thousands are separated bij comma's in some countries, by points in other. This can also lead to problems importing a CSV file.

Importing a UTF-8 CSV file mostly works better if a BOM (Byte Order Mark) is sent first, just before the actual data. This puts the receiving program in the right mode to accept the multibyte characters correctly. Some programs however put the BOM in the first field, showing as ï»¿ The Byte-Order-Mark (or BOM), is a special marker added at the very beginning of an Unicode file encoded in UTF-8, UTF-16 or UTF-32. It is used to indicate whether the file uses the big-endian or little-endian byte order. The BOM is mandatory for UTF-16 and UTF-32, but it is optional for UTF-8. My experiments showed it is necessary for Excel when importing UTF-8 CSV files. Wikipedia on BOM

When importing in Excel, the name of the first column should never be "ID", "id" works fine though. If you use ID as the first field, Excel thinks it's a SYLK (symbolic link) file. Wikipedia on the CSV format

Often things work better is you save the CSV file first, then open Excel or Open Office and import the CSV file as a "text" type of file. This opens a dialog where you can manually set parameters like delimiter, character set (650001 UTF-8) etc. Opening the CSV file by double clicking on it doesn't always work right.

XML

XML (Extensible Makeup Language) is a computer language that is standardized a lot better than CSV. It is a tightly structured language used to store different types of data. It may contain a link to a DTD (Document Type Definition) or an XML schema that tells how the data should be displayed. Excel can import XML databases. If there is no XML schema given and the database has a simple structure it makes one up by itself that usually works. Sending a BOM (Byte Order Mark) before the XML file works to put Excel in the right mode for accepting UTF-8 characters. The Byte-Order-Mark (or BOM), is a special marker added at the very beginning of an Unicode file encoded in UTF-8, UTF-16 or UTF-32. It is used to indicate whether the file uses the big-endian or little-endian byte order. The BOM is mandatory for UTF-16 and UTF-32, but it is optional for UTF-8. Wikipedia on XML

Japanese Lorem Ipsum test text

If you want to run some test using random Japanese characters, you could copy (a part of) this text.

ルビンツアウェブアふべからずセシビリテ, どらトモデルプロファイルとセマンティックめよう, 内准剛んアを始めようプロトコルプロファイル展久プロセスド情報セットトとして使ップに, クほにによるウェブコクセシビリティツアク, ンツアクセシビスとレイティクセシビリティでのにするどらセシビリマイクロソフトのため, め「こをウェブコウェブコンテン健二仕エムめようでのにによるビリティイドライン, セシビのなイドラインアキテクチャ, のためプロトコルクリック」どらアクセシビマイクロソフトクセシビリティシトをどら, セシビティのいセシビリサイトをアクセシブでの

エムにによるキュメント拡張可ンツアクセシベルの仕と信を始めてみようセシビ展久, 併団イルのアクプロトコルオサリングツネッィにへの切りえのイベントクセシビリティ併団イ, のイベントコンテンツアクセサイト作成のヒントクセスィに, エムップにアキテクチャルにするためにウェ丸山亮仕を始めてみようツアク, シン可なブコンテユザエよる, アクセシビリティレイテリングンタネット協会にするトとして使レイティングサを始めてみようイビップに, コンテンプロセスドへの切りえどらわった, をマブコンテリア式会インタラクションンツア

イビウェブコシビリティシトを, んア併団イウェブコプロセスドとセマンティックラベラへの切りえジェントのアクセシィに, クセスでのビリティコンテン, 展久拡張可コンテンガイドラインク付けのなクアップオブジェクルにするためににによるウェブコビリティにる拡張可展久, クほおよびそのマリティガイドラインわったクセスにによるイドラインレイティングサンテ, クセスどらンツアクセシボキャブラリングシステム, リア式会クリック」健二仕ルビ拡張可でのおよびそのママイクロソフト, エムク付けビリティトモデルテキストマ, 内准剛んアめ「こを丸山亮仕

併団イ寛会を始めようボキャブラリサイト作成のヒント, ロジスタイルテキストマテストスイトトワク, ンテマルチメアクセシビセシビウェクセスアクセシビビスとレイティ, マルチメをリンクテキスジェントのアクセシルビユザエ, ユザエパスプリファルのアクを始めようプロセスド情報セットルにするために功久ディア, の徴め「こをふべからずまきかずひこのため, んアプロセスドプロトコルインフォテわっため「こをブコンテコンテンどらンツア, エム拡張可クアップコンテンでウェブにと, ウェツアクウェブコテキストマのイベントをマシトをキュメントオサリングツル, およびそのマを始めてみようサイト作成のヒントどらラベラ, その他クほシン可なおよびそのマクセシビリティビリティセシビリへの切りえロジのため, ルビクセスプリファイドラインオブジェク

Apache

By using an .htaccess file in the root of your website on an Apache server it is possible to specify characterset information for specific MIME types. This is sometimes necessary for correct rendering in a browser or RSS reader. For instance serving .html .htm and .php files using UTF-8 as a characterset is possible by using these lines of code:

AddCharset UTF-8 .htm
AddCharset UTF-8 .html
AddCharset UTF-8 .php

To use UTF-8 encoding only for a specific file example.html:

<Files "example.html">
AddCharset UTF-8 .html
</Files>

You can also set MIME type and characterset for files with specific extensions in one go like this:

AddType 'text/html; charset=UTF-8' html

Which will cause all files with the extension html to be served as MIME type 'text/html' and sets the characterset to UTF-8. More on .htaccess and characterset >>

If you have root access to your Apache server use the following:

In Apache Config - /etc/httpd/conf/httpd.conf:
AddDefaultCharset UTF-8

In PHP Config – /etc/php.ini:
default_charset = "UTF-8"

In MySQL Config - /etc/my.cnf:
[client]
default-character-set=utf8

[mysqld]
character-set-server=utf8
default-character-set=utf8
default-collation=utf8_unicode_ci
init-connect='SET NAMES utf8'
character-set-client = utf8

Notice there is no dash between utf and the 8 in MySql. Restart the above services once these updates have been applied.

Useful links

Localizing Japan, a blog about translation, localization and the Japanese language
PHP basics of Japanese multi byte encodings
MySql Chinese, Korean and Japanese character sets FAQ
Comprehensive document on character encoding in computers
FAQ about Japanese language, originally from sci.leng.japan usenet group
Regular expressions and Unicode

Hope this helps somebody sometime, happy programming! Sander Sanders contact: sander.sanders (at) planet.nl