Archive for the ‘utf8’ Category

How to detect UTF8 encoded text

February 6, 2010

What is UTF8 encoding and how to detect it?

You can find UTF8 specification documents at the end of this post.

The long story short UTF8 is a unicode text representation that may use one to four bytes to represent a single text character.

How to detect it?
First the easy solution would be to check for the Byte Order Mark at the beginning. For UTF8 it is: ef bb bf.
However a lot of UTF8 files do not use it and generally Unix/Linux programs do not ust UTF8 with BOM and some of them may not handle it well.

So how to detect UTF8 without BOM?

Basically check up the file if it only contains byte sequences list in the following table:

00..7F	 	 	 
C2..DF	80..BF 	 	 
E0	A0..BF	80..BF 	 
E1..EF	80..BF	80..BF 	 
F0	90..BF	80..BF	80..BF
F1..F3	80..BF	80..BF	80..BF
F4	80..8F	80..BF 	80..BF

I have created little c# library that does exactly this:
http://www.codeplex.com/utf8checker

There are unit tests too using some extensive utf8 sample files.

Feedback and questions welcome!

Rerefences

http://www.unicode.org/versions/corrigendum1.html

http://www.ietf.org/rfc/rfc2279.txt

http://anubis.dkuug.dk/JTC1/SC2/WG2/docs/n1335

http://www.cl.cam.ac.uk/~mgk25/unicode.html