Jun 23, 2019

UTF-8 Encoding

UTF-8 is a character encoding format to encode text in any language to bytes. It is a variable sized encoding for unicode characters. UTF-8 uses one to four bytes to represent a Unicode character.

Why UTF-8?

UTF-8 has become the most popular character encoding standard. 60% of the websites on the internet use UTF-8 and if you include ASCII which is a subset of UTF-8, that number goes up to 80%. The main reason for using UTF-8 is to display text in languages other than English. It also supports emojis which have become very popular in text messages.

How does UTF-8 work?

UTF-8 works by representing characters in binary numbers. Each unicode character is represented by one to four bytes. The high order bits in UTF-8 tells us how many bytes were used to encode a character.

1 Byte UTF-8 Encoding

ASCII characters which range from 0 to 127 are represented using a single byte. All ASCII characters can be represented using 7 bits. This frees up the first bit of the byte to store 0 which tells us that the character was encoded into a single byte. For example, the capital letter A has a code point of 65. It can be represented using the binary number 1000001.

CharacterA
Binary01000001
Bit UtilityIndicatorBit 1Bit 2Bit 3Bit 4Bit 5Bit 6Bit 7
Hex41

2 Byte UTF-8 Encoding

2 byte encodings are for characters ranging from 128 to 2047. The first 3 bits of the first byte are always set to 110. The first 2 bits of the second byte is always set to 10. That leaves 11 bits available for the actual character being encoded. Given below is a breakdown of how 2 byte UTF-8 encoding happens for the unicode character ™ which represents the word Trademark.

Character
Binary1100001010011001
HexC299

Similarly, characters from 2,048 to 65,535 take up 3 bytes. While all other characters from 65,536 to 1,112,064 take 4 bytes.

CharacterNumeric CodeHex CodeBinary CodeBytes UsedIndicators
A65410100 000110
49817C2991100 0010 1001 10012110 10
14,722,195E0A4931110 0000 1010 0100 1001 001131110 10 10
Advantages of UTF-8
  • Works with null terminated string functions
  • Widely used such as in HTML, JSON & XML
  • Any Unicode character can be encoded without having to choose a code page
  • Simple bit operations can be used to perform UTF-8 encoding. Hence, it is faster.
  • Does not depend on the Endianness of the computer
  • Smaller in size compared to UTF-16 when dealing with only latin characters
Disadvantages of UTF-8
  • Larger in size for text in languages that need 3 or 4 bytes to be represented
  • Characters in Japanese, Chinese and Korean languages require 3 bytes in UTF-8 compared to 2 in UTF-16
  • Takes 2x the space to encode Cyrillic and Greek text compared to their dedicated encoding formats
  • Takes 3x the space to encode Hindi and Thai text compared to their dedicated encoding formats

It all comes down to the indicator bits that tells us how many bytes were used to encode a single character. Because, there are only two characters in Binary, these prefixes tend to grow in size.

  • 0
  • 110
  • 1110
  • 11110

UTF-8 Tools

Given below is a list of all our tools that deal with UTF-8 Encoding.

Remove Ads

Encode any text in UTF-8

View Tool