How to detect UTF8 encoded text


What is UTF8 encoding and how to detect it?

You can find UTF8 specification documents at the end of this post.

The long story short UTF8 is a unicode text representation that may use one to four bytes to represent a single text character.

How to detect it?
First the easy solution would be to check for the Byte Order Mark at the beginning. For UTF8 it is: ef bb bf.
However a lot of UTF8 files do not use it and generally Unix/Linux programs do not ust UTF8 with BOM and some of them may not handle it well.

So how to detect UTF8 without BOM?

Basically check up the file if it only contains byte sequences list in the following table:

00..7F	 	 	 
C2..DF	80..BF 	 	 
E0	A0..BF	80..BF 	 
E1..EF	80..BF	80..BF 	 
F0	90..BF	80..BF	80..BF
F1..F3	80..BF	80..BF	80..BF
F4	80..8F	80..BF 	80..BF

I have created little c# library that does exactly this:
http://www.codeplex.com/utf8checker

There are unit tests too using some extensive utf8 sample files.

Feedback and questions welcome!

Rerefences

http://www.unicode.org/versions/corrigendum1.html

http://www.ietf.org/rfc/rfc2279.txt

http://anubis.dkuug.dk/JTC1/SC2/WG2/docs/n1335

http://www.cl.cam.ac.uk/~mgk25/unicode.html

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: