Unicode, UTF and Multi Language Support
“You can never understand one language until you understand at least two” — Geoffrey Williams.
Did you ever face a situation where data received from other system felt like ‘corrupted’ or ‘gibberish’ ? Something like helloẀ or hello???
Read on to learn about it.
Morse Code
While teaching at New York university in 1835, Samuel Morse proved that ‘signals’ could be transmitted by electric wire. He went on to create a system using just 2 symbols which came to be known as the Morse code!
Morse code was a revolutionary innovation - with just 2 symbols - dots and dashes, messages could travel across continents with virtually no wait time!
Remember this was year 1838!
You can explore more here.
Computers
Fast forward to computer age in 1950s…
Working of modern computer is mostly based on electric signals and more specifically transistors. Transistor is like an electric switch. i.e. it can either be On or Off. These two states are represented as 1 and 0.
Think of computer storage as made of billions of these switches and all they can do is be On or Off. We refer to these switches as bits (binary digit)
American Standard Code for Information Interchange (ASCII)
After the invent of modern computers in 1950s, manufactures were looking for a way to encode data in binary. As binary was the only ‘language’ computers could understand.
To this date most computers only understand binary. Quantum Computers are one exception here!
So manufactures started their own version of character encoding. e.g. representation of English alphabets , numbers etc into binary format.
By 1963, computer manufacturers had over sixty different ways of representing characters in computers. e.g. manufacturer ABC was referring A with 0001 and manufacturer XYZ was referring to same A as 0101 and so on.
As a result computers from different manufactures couldn't process each others data.
Think about it, a text file created by you can’t be opened by your friends computer as that’s from different manufacturer!
In May 1961, an IBM engineer, Bob Bemer, sent a proposal to the American National Standards Institute (ANSI) to develop a single code for computer communication.
The idea was that 128 Latin characters — letters, numbers, punctuation marks, and control codes — would each have a standard numeric value.
As each bit could represent 2 states, 7 bits were needed for representing 128 unique characters (2 to power 7 ) . To support few other characters 8 bits system was finalized in 1963.
8 bits is also known as 1 byte.
The result!
Standard Character Encoding : Representing all characters with unique number and binary representation.
As you may have guessed , now we have standard representation for all English characters. Why only English? at that time most manufactures and users of computers were in English speaking countries so it made business sense to cater for English.
Unicode
Everyone in the world should be able to use their own language on phones and computers.
As use of computers spread to different geographies, need to support non English characters grew.
In 1987 Joe Becker from Xerox with Lee Collins and Mark Davis from Apple started exploring universal character set. After multi-year collaboration with other industry experts, first version of Unicode was released in 1991.
The goal was to assign Unique number to every character in every written language.
This was similar effort as ASCII, except now scope was almost all written languages on the earth!
Code Point is a unique hexadecimal number representing a character. It is represented as U+<hexadecimal code>
E.g. English letter A is assigned U+0041. That’s equivalent to decimal 65. That’s same code in ASCII for letter ‘A’ so Unicode is backward compatible with ASCII!
You can use this link for conversions.
Lets see how various languages are assigned unique numbers or code points…
First letter for each language is highlighted with binary equivalent. e.g. for arabic first letter hexadecimal is U+FEBE and binary equivalent for it is 11111110 10111110.
Is that all? nah.
Did you notice that for non English characters 16 bits are being used?
Pause for a moment and think - with ASCII it was easy every character was represented by unique number that would fit in 8 bits/1byte .
Now if computer receive binary data as 0011000001001011, how does it know if it’s first letter of Japanese alphabet か or 0K (ASCII) or something else ?
Use this converter and above ASCII table if it’s not clear.
So how do handle this ambiguity ? enter UTF.
Unicode Transformation Format (UTF)
This is final piece of this overall puzzle and an interesting one. From Unicode website it’s
An algorithmic mapping from every Unicode code point to a unique byte sequence.
Basically it can translate any Unicode character to matching binary format.
There are many flavors of UTF; e.g. UTF-8, UTF-16, UTF-32 etc. Out of these UTF-8 is widely used so let’s explore that.
UTF-8 is variable length encoding and can use from 1 to 4 bytes to encode a character.
So here is how UTF-8 mapping algorithm works
- If a byte starts with 0 then it represents single byte character (e.g A)
e.g.01000001
- If a byte starts with 110 then its first of 2 byte character (e.g. Ĝ)
e.g.11000100 10011100
- If a byte starts with 1110 then its first of 3 byte character (e.g. か)
e.g.11100011 10000001 10001011
- If a byte starts with 11110 then its first of 4 byte character (e.g. 𐎀)
e.g.11110000 10010000 10001110 10000000
- For multi byte characters all bytes after first start with 10.
First byte is known as leading byte and rest of them as continuation bytes
Please ignore spaces in above binary representation. They are for better readability and not part of actual data.
Unicode to UTF-8 mapping and range of values from Wikipedia.
Bytes Bits First Last Bytes
1 7 U+000000 U+00007F 0xxxxxxx
2 11 U+000080 U+0007FF 110xxxxx 10xxxxxx
3 16 U+000800 U+00FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 21 U+010000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Coming back to gibberish or corrupted data, this usually happens when data contains non ASCII characters and program tries to read it as ASCII.
Lets pick 2 byte character example - µ . Unicode assigned to this is U+00B5 (181 in decimal, 10110101 in binary). As you can see from above mapping , this falls into 2 bytes range.
Let’s try to fit as per UTF-8 rules
- We first create ‘empty’ bytes based on above rules.
- We need to fill placeholder digits — 6 places in 2nd byte and 5 places in first byte
- Start filling bits from right to left!
- Fill any renaming digits to left most positions with 0s
Hence UTF-8 representation of this character will be 11000010 10110101
Now the fun part, if a computer or program receive below data and ‘assume’ it to be ASCII, how will it interpret ?
11000010 10110101
First 8 bits 11000010 in decimal is 194, this may be shown as Â
Similarly 2nd part 10110101 and decimal is 181, µ
So editor may incorrectly interpret this as µ instead of µ - sounds like data corruption!!
Please note there are tons of other encodings besides ASCII and UTF. so depending upon editor or program’s default encoding , you may see different values.
Conclusion
- Agree and be aware of encoding being used (mostly UTF-8 is default )
- Problem usually happens when we are dealing with non ASCII characters
- There may be multiple hops between source and destination. Check all of them and make sure encoding is consistent.
Thanks for reading. Please share your likes and comments :)