UTF-8 has become the most popular character encoding standard. 60% of the websites on the internet use UTF-8 and if you include ASCII which is a subset of UTF-8, that number goes up to 80%. The main reason for using UTF-8 is to display text in languages other than English. It also supports emojis which have become very popular in text messages.
UTF-8 works by representing characters in binary numbers. Each unicode character is represented by one to four bytes. The high order bits in UTF-8 tells us how many bytes were used to encode a character.
ASCII characters which range from 0 to 127 are represented using a single byte. All ASCII characters can be represented using 7 bits. This frees up the first bit of the byte to store
0 which tells us that the character was encoded into a single byte. For example, the capital letter
A has a code point of 65. It can be represented using the binary number
|Bit Utility||Indicator||Bit 1||Bit 2||Bit 3||Bit 4||Bit 5||Bit 6||Bit 7|
2 byte encodings are for characters ranging from 128 to 2047. The first 3 bits of the first byte are always set to
110. The first 2 bits of the second byte is always set to
10. That leaves 11 bits available for the actual character being encoded. Given below is a breakdown of how 2 byte UTF-8 encoding happens for the unicode character ™ which represents the word Trademark.
Similarly, characters from 2,048 to 65,535 take up 3 bytes. While all other characters from 65,536 to 1,112,064 take 4 bytes.
|Character||Numeric Code||Hex Code||Binary Code||Bytes Used||Indicators|
|™||49817||C299||1100 0010 1001 1001||2||110 10|
|ओ||14,722,195||E0A493||1110 0000 1010 0100 1001 0011||3||1110 10 10|
It all comes down to the indicator bits that tells us how many bytes were used to encode a single character. Because, there are only two characters in Binary, these prefixes tend to grow in size.
Given below is a list of all our tools that deal with UTF-8 Encoding.
Encode any text in UTF-8