Saturday, June 11, 2016

Fun with Unicode and SQL Server

I had to debug a situation recently where some exported data didn't work and I stumbled into the world of Unicode characters. Here are some of the fun facts that I learned

A brief (very loose) interpretation of the history of Unicode

In the beginning there were a bunch if people in the basement of some computer lab somewhere who needed to type words on a computer screen. The realized that with a 7 bit value they would have 128 options in which they could put the entire English alphabet. This included  upper and lower case, numbers, punctuation, and whitespace characters. They called it ASCII (American Standard Code for Information Interchange)  and they saw it was good.

One day they realized that they wanted to communicate with sell computers to people who spoke other languages so they added another bit that allowe them to have 256 characters so they could create the accented and special characters of the most of the European languages.They called it Extended ASCII, and they saw it was good.

But then the heard the cry's of their friends in countries that spoke languages like Japanese, Chinese, and Hebrew who wanted to communicate with buy computers from them and the invented Unicode. And here we are.

Notes about Unicode and SQL Sever

  1. There are lots of variants/encoding for Unicode based on my experimentation it appears that SQL server supports UTF-8 encoding with SQL_Latin1_General_CP1_CI_AS collation.
  2. You need to use a Unicode data types (nchar, nvarchar, ntext)
  3. You need to prefix your string with the UPPERCASE letter N  (ex. N'Some Unicode')
  4. All of the lower 256 ASCII characters are compatible with the Unicode (its backwards compatible)
  5. If you try to cast a Unicode char (not part of the 256 ASCII characters) as ASCII you will get a '?' and there is not way to tell if its not actually a question mark.
  6. Unicode data takes up twice the space that ASCII data takes up. ( varchar(8000) = nvarchar(4000) )
  7. Some of applications cannot import Unicode data files
  8. The way your text editor knows that a file has Unicode data is that the first character has special value (that varies depending of the type of encoding).
  9. Microsoft says that Unicode data processes faster.

Use this code snippet to have 'fun' with unicode


Comments, corrections, clarifications are always appreciated in the comments.

No comments:

Post a Comment