Unicode encoding systems are algorithmic methods that convert Unicode code points into storable byte sequences. The three primary encoding systems are UTF-8 (1–4 bytes per character), UTF-16 (2–4 bytes per character), and UTF-32 (4 bytes per character). UTF-8 is the most widely adopted encoding format on the web due to its backward compatibility with ASCII.
Think of it this way. Unicode is like a massive phone directory that gives every character a unique number. But that number alone cannot be saved on your computer. Encoding systems like UTF-8 and UTF-16 decide how that number gets converted into actual bytes that your hard drive, browser, or application can store and read.
Before Unicode existed, the internet was filled with garbled text like this:
text�����[ Éf����Õì ÔǵÇ���¢!!
That garbled mess happened because different computers used different character encoding tables. Unicode and its encoding systems solved this problem permanently.
This guide explains what Unicode encoding systems are, how UTF-8 and UTF-16 work mechanically, their key differences, and which one you should use for your specific project. If you are new to Unicode itself, our complete plain language guide to Unicode and Non-Unicode covers the foundational concepts.
Why Were Encoding Systems Needed? (The Problem Before Unicode)
Before Unicode, computers used a 7-bit encoding called ASCII that could represent only 128 characters — enough for English letters, digits, and basic punctuation. ASCII worked perfectly for English-speaking countries but failed completely for languages like Hindi, Chinese, Arabic, or even German (which needs characters like ö, ü, ß).
To fix this, different regions created their own Extended ASCII tables — each using the extra 128 character slots differently. By the late 1990s, there were over 60 different extended character encoding tables in use worldwide. Windows-1252 for Western Europe. Big5 for Traditional Chinese. ISO-8859-1 for Latin languages. Shift_JIS for Japanese.
The fundamental problem was simple: if you opened a document saved with one encoding table using a different encoding table, the text became unreadable. A Hindi document saved with one encoding would display as random symbols on a computer using a different encoding.
This chaos is exactly why Unicode was created — to unify all characters from all writing systems under one universal standard. And encoding systems like UTF-8 and UTF-16 are how computers actually store and transmit those unified characters. For a deeper explanation of how Unicode operates at the hardware level, see our guide on how Unicode works in computers.
What Is Unicode? (The Standard vs The Encoding)
Unicode is a universal character encoding standard that assigns a unique code point to every character across every writing system in the world. The Unicode Consortium maintains and develops this standard.
Here is the most important thing to understand: Unicode is NOT an encoding. Unicode is a character set — a giant catalogue. It tells you what number each character gets. It does not tell you how to store that number as bytes.
Every character in Unicode receives a hexadecimal code point. For example:
- U+0041 = “A” (Latin capital letter A)
- U+0939 = “ह” (Devanagari letter Ha)
- U+4F60 = “你” (Chinese character for “you”)
- U+1F4A9 = “💩” (Pile of poo emoji)
- U+1F600 = “😀” (Grinning face emoji)
The full Unicode codespace ranges from U+0000 to U+10FFFF. This gives a total of 1,114,112 possible code points organised across 17 planes, where each plane contains 65,536 code points. Not all code points are assigned — the standard reserves space for future additions.
The character repertoire of Unicode is synchronised with the international standard ISO/IEC 10646, ensuring global interoperability across all platforms and operating systems. Understanding the key differences between Unicode and Non-Unicode helps clarify why Unicode code points need encoding systems to function in real software.
Key Fact: Unicode version 16.0 (released September 2024) defines 154,998 characters covering 168 modern and historic scripts — including all emoji. New characters and scripts are added with each version update. Source: Unicode.org — Supported Scripts
What Are Encoding Systems in Unicode?
Encoding systems in Unicode are the algorithmic mappings that convert each Unicode code point into a unique sequence of bytes. Without an encoding system, code points remain abstract numbers that computers cannot store, transmit, or process.
Computers operate on binary data — ones and zeros grouped into bytes (8-bit units, also called octets). When you type the letter “A” or the emoji “😀”, your system must convert that character’s code point into a specific byte sequence. The encoding system defines the exact rules for this conversion.
The Unicode Standard defines three official encoding forms:
- UTF-8 — uses 8-bit code units (1 to 4 bytes per character)
- UTF-16 — uses 16-bit code units (2 or 4 bytes per character)
- UTF-32 — uses 32-bit code units (exactly 4 bytes per character)
The term “UTF” stands for Unicode Transformation Format. Each encoding format can represent all 1,114,112 code points in the Unicode codespace. They differ only in how they convert those code points into bytes — not in which characters they can represent.
A critical concept here is the code unit. A code unit is the fixed-size building block that each encoding uses internally. UTF-8’s code unit is 8 bits (1 byte). UTF-16’s code unit is 16 bits (2 bytes). UTF-32’s code unit is 32 bits (4 bytes).
Together with the three encoding forms, the Unicode Standard specifies seven character encoding schemes: UTF-8, UTF-16, UTF-16BE (big-endian), UTF-16LE (little-endian), UTF-32, UTF-32BE, and UTF-32LE. The “BE” and “LE” variants specify explicit byte order for multi-byte code units.
Key Fact: The Unicode Standard states that any implementation supporting Unicode must support either UTF-8 or UTF-16 (or both). UTF-32 support is optional. This is why these two encodings dominate all modern software. Source: Unicode.org FAQ — UTF-8, UTF-16, UTF-32 & BOM
How UTF-8 Encoding Works
UTF-8 is a variable-width encoding that represents each Unicode code point using 1 to 4 bytes. It is fully backward compatible with ASCII — the first 128 characters encode identically in both systems. This means any valid ASCII text is already valid UTF-8 without any modification.
Here is how UTF-8 determines the number of bytes needed:
Byte Structure Pattern
| Code Point Range | Bytes Used | Bit Pattern | Characters Covered |
|---|---|---|---|
| U+0000 – U+007F | 1 byte | 0xxxxxxx | ASCII (English letters, digits, basic punctuation) |
| U+0080 – U+07FF | 2 bytes | 110xxxxx 10xxxxxx | Latin extended, Greek, Cyrillic, Arabic, Hebrew |
| U+0800 – U+FFFF | 3 bytes | 1110xxxx 10xxxxxx 10xxxxxx | Devanagari, Chinese, Japanese, Korean (CJK), most BMP characters |
| U+10000 – U+10FFFF | 4 bytes | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | Emoji, mathematical symbols, ancient scripts |
The system works through leading bits and continuation bytes:
- The first byte (count byte) tells you how many total bytes the character uses. If it starts with
0, it is a single-byte ASCII character. If it starts with110, it is a 2-byte sequence. If1110, it is 3 bytes. If11110, it is 4 bytes. - Every continuation byte starts with
10. This makes UTF-8 self-synchronising — you can look at any byte in isolation and immediately know whether it is the start of a new character or a continuation of a previous one.
Practical Example: Encoding “€” (Euro Sign) in UTF-8
The Euro sign “€” has the code point U+20AC.
- U+20AC falls in the range U+0800–U+FFFF, so it needs 3 bytes
- Convert 20AC to binary:
0010 0000 1010 1100 - Fill the 3-byte template
1110xxxx 10xxxxxx 10xxxxxx - Result:
11100010 10000010 10101100 - In hexadecimal: E2 82 AC
Practical Example: Encoding “अ” (Devanagari Letter A) in UTF-8
The Hindi character “अ” has the code point U+0905.
- U+0905 falls in the range U+0800–U+FFFF, so it needs 3 bytes
- Convert 0905 to binary:
0000 1001 0000 0101 - Fill the 3-byte template
1110xxxx 10xxxxxx 10xxxxxx - Result:
11100000 10100100 10000101 - In hexadecimal: E0 A4 85
This means every Hindi (Devanagari) character consumes 3 bytes in UTF-8.
Tip: UTF-8 is byte-order independent. Since its code unit is a single byte (8 bits), there are no endianness issues. This eliminates the need for a Byte Order Mark (BOM) and makes UTF-8 ideal for network transmission and cross-platform data exchange.
How UTF-16 Encoding Works
UTF-16 is a variable-width encoding that uses either 2 or 4 bytes per character. Its base code unit is 16 bits (2 bytes). Characters within the Basic Multilingual Plane (BMP) — the first 65,536 code points — encode as a single 16-bit code unit. Characters beyond the BMP require a surrogate pair of two 16-bit code units.
Basic Multilingual Plane (2 Bytes)
For code points U+0000 to U+FFFF (excluding the surrogate range U+D800–U+DFFF), UTF-16 stores the code point directly as a 16-bit integer value. This covers over 60,000 of the most commonly used characters including:
- All Latin, Greek, Cyrillic, Arabic, Hebrew, and Devanagari characters
- All CJK (Chinese, Japanese, Korean) ideographs in common use
- Most mathematical symbols and punctuation marks
Surrogate Pairs (4 Bytes)
For characters beyond U+FFFF (supplementary planes), UTF-16 uses a pair of 16-bit code units called a surrogate pair:
- High surrogate: A code unit in the range U+D800–U+DBFF (1,024 values)
- Low surrogate: A code unit in the range U+DC00–U+DFFF (1,024 values)
Together, these two code units can represent 1,024 × 1,024 = 1,048,576 supplementary characters. The surrogate ranges are completely disjoint from valid single-unit character ranges, so there is never ambiguity when decoding.
Practical Example: Encoding “अ” (Devanagari Letter A) in UTF-16
The Hindi character “अ” has the code point U+0905.
- U+0905 falls within the BMP (U+0000–U+FFFF)
- It encodes directly as a single 16-bit code unit: 09 05
- Total: 2 bytes
Compare this with UTF-8 where the same character requires 3 bytes. For Hindi-dominant text, UTF-16 uses 33% less space per character.
Practical Example: Encoding “😀” (U+1F600) in UTF-16
- Subtract 0x10000 from the code point: 1F600 – 10000 = F600
- Convert F600 to binary (20 bits):
0000 1111 0110 0000 0000 - High 10 bits:
0000111101→ Add to 0xD800: 0xD83D - Low 10 bits:
1000000000→ Add to 0xDC00: 0xDE00 - Result: D8 3D DE 00 (4 bytes as two 16-bit code units)
Byte Order and BOM
UTF-16 has an endianness concern because its code unit is 2 bytes wide. The same character can be stored as either:
- Big-endian (UTF-16BE): Most significant byte first
- Little-endian (UTF-16LE): Least significant byte first
To indicate which byte order is used, files often begin with a Byte Order Mark (BOM) — the character U+FEFF placed at the start of the file. If a decoder reads FE FF at the beginning, the file is big-endian. If it reads FF FE, the file is little-endian.
Tip: A common pitfall in UTF-16 string processing is treating every 16-bit code unit as an individual character. Surrogate pairs must always be handled as a single unit. Splitting text in the middle of a surrogate pair creates invalid lone surrogates that corrupt the data permanently.
UTF-8 vs UTF-16: Complete Comparison
Here is a direct comparison of UTF-8 and UTF-16 across every critical attribute:
| Attribute | UTF-8 | UTF-16 |
|---|---|---|
| Code unit size | 8 bits (1 byte) | 16 bits (2 bytes) |
| Bytes per character | 1–4 bytes | 2–4 bytes |
| ASCII compatibility | Yes — identical to ASCII | No — “A” becomes 00 41 instead of 41 |
| Byte order dependency | None | Yes — requires BOM or explicit BE/LE |
| BOM requirement | Optional (not recommended) | Required for unmarked byte order |
| Minimum bytes per character | 1 byte | 2 bytes |
| Encoding for BMP characters | 1–3 bytes (varies) | 2 bytes (fixed for entire BMP) |
| Encoding for supplementary characters | 4 bytes | 4 bytes (surrogate pair) |
| Self-synchronising | Yes | Yes |
| Web usage | 98%+ of all websites | Rare on the web |
| Used internally by | Python 3, Linux, macOS, Go, Rust | Java, JavaScript, C#, .NET, Windows API |
| Null-byte safety | Yes — no embedded null bytes | No — ASCII chars produce null in high byte |
| Storage efficiency (English/Latin) | More efficient (1 byte per character) | Less efficient (2 bytes per character) |
| Storage efficiency (CJK) | Less efficient (3 bytes per character) | More efficient (2 bytes per character) |
| Storage efficiency (Hindi/Devanagari) | 3 bytes per character | 2 bytes per character |
| Conversion between them | Completely lossless | Completely lossless |
Key Fact: Conversion between UTF-8, UTF-16, and UTF-32 is completely lossless because all three encoding formats represent the identical codespace (U+0000–U+10FFFF). They differ only in byte representation, not in character coverage. No data is ever lost during conversion. Source: Unicode.org FAQ — UTF-8, UTF-16, UTF-32 & BOM
Storage Efficiency by Language and Script
The encoding efficiency of UTF-8 versus UTF-16 depends entirely on the type of content being encoded. Here is a breakdown by script:
| Script / Character Type | UTF-8 Bytes | UTF-16 Bytes | More Efficient |
|---|---|---|---|
| ASCII / Basic Latin (A, B, 1, 2, @) | 1 byte | 2 bytes | UTF-8 |
| Extended Latin (é, ñ, ü, ø) | 2 bytes | 2 bytes | Equal |
| Devanagari — Hindi (अ, ब, क, ह) | 3 bytes | 2 bytes | UTF-16 |
| Tamil (அ, ஆ, இ) | 3 bytes | 2 bytes | UTF-16 |
| Bengali (অ, আ, ই) | 3 bytes | 2 bytes | UTF-16 |
| Greek, Cyrillic, Arabic, Hebrew | 2 bytes | 2 bytes | Equal |
| Chinese, Japanese, Korean (CJK) | 3 bytes | 2 bytes | UTF-16 |
| Emoji (😀, ❤️, 🎉, 💩) | 4 bytes | 4 bytes | Equal |
| Mathematical symbols, ancient scripts | 4 bytes | 4 bytes | Equal |
For an Indian audience, this table reveals something important. Hindi, Tamil, Bengali, Telugu, and all other Indian language scripts use 3 bytes per character in UTF-8 but only 2 bytes in UTF-16.
However, web pages do not contain only text in one language. They also include HTML tags, CSS class names, JavaScript code, URLs, and JSON data — all of which are pure ASCII. This ASCII overhead typically accounts for 40–60% of a web page’s total text content.
In practice, even a Hindi-language web page has enough ASCII overhead from markup and code that UTF-8 remains the more efficient choice for overall page size when serving web content.
Tip: For web applications serving Indian languages (Hindi, Tamil, Bengali, Telugu, Kannada), UTF-8 remains the correct choice because HTML markup, JSON APIs, and URL strings are ASCII-based. The total byte savings from 1-byte ASCII characters outweigh the extra byte per Devanagari character in most real-world pages.
Why UTF-8 Dominates the Web
UTF-8 is used by over 98% of all websites on the internet. Five specific technical reasons explain this dominance:
1. Backward compatibility with ASCII Any existing ASCII text is already valid UTF-8 without any modification. When the web transitioned from ASCII-era encodings to Unicode, UTF-8 required zero changes to existing English content. This made adoption painless and risk-free.
2. Byte-order independence UTF-8 has no endianness issues whatsoever. It works identically on big-endian and little-endian processors without needing a Byte Order Mark. Data can be transmitted between any two systems without byte-swapping concerns.
3. Null-byte safety UTF-8 never produces null bytes (0x00) for non-null characters. This is critical because C-style null-terminated strings and many Unix system calls use 0x00 as a string terminator. UTF-16 breaks these systems because ASCII characters like “A” produce a null byte in their high byte position (0x00 0x41).
4. Official W3C and IETF recommendation The W3C (World Wide Web Consortium) explicitly recommends UTF-8 as the default encoding for all web content. The JSON specification (RFC 8259) mandates UTF-8. HTML5 defaults to UTF-8 when no charset is declared. XML uses UTF-8 as its default encoding.
5. Compact storage for web content HTML tags (<div>, <p>, <span>), CSS properties, JavaScript syntax, URL paths, and HTTP headers are all pure ASCII. In UTF-8, each of these characters consumes only 1 byte — exactly half what UTF-16 would require for the same content.
Key Fact: The W3C officially states: “Everyone developing content, whether content authors or programmers, should use the UTF-8 character encoding, unless there are very special reasons for using something else.” This recommendation covers HTML, XML, CSS, and all web technologies. Source: W3C — Character Encoding Declarations
Why Java, JavaScript, and Windows Use UTF-16
Java was designed in the early 1990s when the Unicode standard was still in version 1.0. At that time, Unicode was conceived as a purely 16-bit character set — all characters were expected to fit within 65,536 code points. The fixed-width encoding UCS-2 (2 bytes per character) was the implementation standard.
Java adopted UCS-2 as its internal string representation because it offered constant-time character indexing — jumping directly to the 100th character by skipping exactly 200 bytes.
JavaScript inherited this 16-bit model because it was heavily influenced by Java during its creation in 1995 at Netscape. The language’s string model was built around 16-bit code units.
Windows adopted UCS-2 during the Windows NT era (early 1990s) for its internal APIs. This is also why the language for non-Unicode programs setting exists in Windows — it handles legacy applications that never adopted Unicode.
When Unicode expanded beyond 65,536 characters in version 2.0 (July 1996), all three systems upgraded from UCS-2 to UTF-16 by adding surrogate pair support. Their internal architecture remained 16-bit based, but they could now represent the full Unicode range.
This creates a practical consequence that developers face daily:
JavaScriptlet emoji = '💩';
console.log(emoji.length);
// Output: 2 (not 1!)
JavaScript’s .length property counts UTF-16 code units, not actual characters. The emoji “💩” (U+1F4A9) requires a surrogate pair (\uD83D\uDCA9), so JavaScript reports its length as 2.
JavaScriptlet emoji = '💩';
console.log([...emoji].length);
// Output: 1 (correct!)
Using the spread operator gives you the actual character count by properly handling surrogate pairs.
Tip: In JavaScript, use
Array.from(string).lengthor the spread operator[...string].lengthto correctly count characters that occupy surrogate pairs. The.lengthproperty gives incorrect results for emoji and supplementary characters because it counts 16-bit code units, not graphemes.
What Is UTF-32? (Brief Overview)
UTF-32 is a fixed-width encoding that uses exactly 4 bytes (32 bits) for every single character, regardless of which character it represents. The letter “A” takes 4 bytes. A Chinese character takes 4 bytes. An emoji takes 4 bytes. A Hindi “अ” takes 4 bytes.
The advantage of UTF-32 is simplicity. There is a direct one-to-one mapping between code points and code units. Character indexing becomes a constant-time operation — to find the 50th character, skip to byte position 200. No variable-length decoding logic is needed.
The disadvantage is enormous storage waste. For English text, UTF-32 uses 4 times more space than UTF-8. Even for CJK text, it uses 2 times more space than UTF-16. As one Unicode expert puts it, a blog post in UTF-32 takes approximately four times more space than the same content in UTF-8.
Because of this inefficiency, UTF-32 (also called UCS-4) is rarely used for file storage or network transmission. Some text-processing applications use it internally during operations where random character access matters, then convert back to UTF-8 or UTF-16 for output.
When to Use UTF-8 vs UTF-16 (Practical Recommendations)
Use UTF-8 When:
- Building websites, web applications, or REST APIs
- Storing data in JSON, XML, CSV, or YAML format
- Working with files containing primarily English or Latin text
- Developing cross-platform applications for Linux, macOS, and Windows
- Configuring database encoding for web applications (use
utf8mb4in MySQL) - Sending data over HTTP, SMTP (email), or any internet protocol
- Writing source code files in any programming language
- Starting any new project where no specific constraint exists
- Serving Indian language web content (Hindi, Tamil, Bengali, Telugu)
Use UTF-16 When:
- Making Windows API calls that require wide strings (LPWSTR)
- Developing applications that interface with Java or .NET libraries internally
- Processing text that is overwhelmingly CJK with minimal ASCII content
- Maintaining compatibility with legacy systems that mandate UTF-16
- Working within runtime environments (JVM or CLR) where strings are natively UTF-16
- Building desktop applications exclusively for Windows using the Win32 API
The practical rule is simple: default to UTF-8 for everything unless a specific technical requirement forces UTF-16. Developers working with enterprise systems like SAP can refer to our guide on Unicode vs Non-Unicode in SAP for platform-specific encoding decisions.
Tip: For all new web projects, databases, and APIs in India — whether serving content in Hindi, Tamil, Bengali, Marathi, or English — UTF-8 is the correct default choice. Use
utf8mb4(notutf8) in MySQL to support the full Unicode range including emoji. The standard MySQLutf8charset only supports up to 3 bytes and cannot store emoji or supplementary characters.
Common Encoding Errors and How to Avoid Them
When text is decoded using the wrong encoding, the result is mojibake — garbled or corrupted characters that appear instead of the intended text. The term comes from Japanese (文字化け) and literally means “character transformation.”
Here is what mojibake looks like in practice. If you save Hindi text “नमस्ते” in UTF-8 but your application reads it as Latin-1 (ISO-8859-1), you might see:
textनमसà¥à¤¤à¥
This is a common problem in India when database connections use the wrong charset, when email clients misinterpret encoding headers, or when legacy systems interact with modern Unicode applications.
4 Common Encoding Error Scenarios:
1. Saving a file in UTF-16 but reading it as UTF-8 The decoder interprets UTF-16 byte sequences as UTF-8 multi-byte patterns, producing completely garbled output with replacement characters (�).
2. Missing BOM in a UTF-16 file Without a Byte Order Mark, the reading system cannot determine whether bytes are big-endian or little-endian. Characters get their bytes swapped, producing entirely wrong characters.
3. Treating UTF-16 surrogate pairs as individual characters If a program splits a string in the middle of a surrogate pair, it creates invalid lone surrogates. This corrupts emoji and supplementary characters permanently.
4. Mixing encodings across a data pipeline The database stores text in UTF-8, but the application connection uses Latin-1. Every non-ASCII character (Hindi, Tamil, emoji) appears as garbled symbols or question marks. Developers working with SQL Server face this issue frequently when choosing between Unicode and Non-Unicode data types in SQL Server.
How to Prevent Encoding Errors:
- Always declare encoding explicitly in HTML:
<meta charset="UTF-8"> - Set database connection charset explicitly:
SET NAMES utf8mb4in MySQL - Ensure encoding consistency across database, backend application, and frontend
- Use HTTP
Content-Typeheader with charset:Content-Type: text/html; charset=utf-8 - Validate byte sequences before processing untrusted input
- Test with multi-script content — mix Hindi, English, CJK, and emoji in your test data
Key Fact: Many web pages historically declared ISO-8859-1 as their encoding but actually used the similar Windows-1252 encoding. Modern browsers now treat any ISO-8859-1 declaration as Windows-1252 to prevent display errors. This legacy confusion is one reason the industry standardised on UTF-8. Source: Wikipedia — Windows-1252
Lossless Conversion Between UTF-8 and UTF-16
Conversion between UTF-8, UTF-16, and UTF-32 is completely lossless. No data is lost when you convert text from one Unicode encoding format to another.
This is because all three encoding systems represent the exact same codespace — U+0000 to U+10FFFF. They are different byte representations of identical information. Converting between them is like converting temperature between Celsius and Fahrenheit — the actual value does not change, only the representation changes.
Every major programming language provides built-in functions for encoding conversion:
Python# Python
text = "नमस्ते 😀"
utf8_bytes = text.encode('utf-8') # Convert to UTF-8 bytes
utf16_bytes = text.encode('utf-16') # Convert to UTF-16 bytes
restored = utf8_bytes.decode('utf-8') # Back to string — no data lost
JavaScript// JavaScript (Node.js)
const text = "नमस्ते 😀";
const utf8Buffer = Buffer.from(text, 'utf-8');
const utf16Buffer = Buffer.from(text, 'utf16le');
Java// Java
String text = "नमस्ते 😀";
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
byte[] utf16Bytes = text.getBytes(StandardCharsets.UTF_16);
The key principle: since all UTF encodings cover the full Unicode range, conversion between any two of them preserves every single character without exception.
If you work with Unicode and non-Unicode text formats regularly, our Unicode to Anu converter handles conversions between different encoding systems for Telugu script.
Frequently Asked Questions
Is UTF-8 better than UTF-16?
UTF-8 is more efficient for web content and Latin-heavy text because it uses only 1 byte per ASCII character instead of UTF-16’s mandatory 2 bytes. UTF-16 is more storage-efficient when content is dominated by CJK or Indian language characters (2 bytes versus UTF-8’s 3 bytes). Neither is universally better — the optimal choice depends on your content type, platform, and system requirements. For web development, UTF-8 is the industry standard.
Can UTF-8 represent all Unicode characters?
Yes. UTF-8 can encode all 1,114,112 Unicode code points using its variable-width 1-to-4-byte structure. There is no character in the Unicode Standard — including emoji, ancient scripts, and mathematical symbols — that UTF-8 cannot represent.
What is a surrogate pair in UTF-16?
A surrogate pair is a combination of two 16-bit code units — a high surrogate (range U+D800–U+DBFF) followed by a low surrogate (range U+DC00–U+DFFF) — that together represent a single character from the supplementary planes (U+10000–U+10FFFF). Characters like emoji (😀, 💩, 🎉) and rare historical scripts require surrogate pairs in UTF-16.
Why is UTF-8 the default encoding for HTML?
UTF-8 is the default HTML encoding because it is backward compatible with ASCII, byte-order independent, and universally supported by all modern browsers. The W3C officially recommends UTF-8 for all web content, and the HTML5 specification defaults to UTF-8 when no encoding is explicitly declared via the <meta charset> tag.
How many bytes does an emoji use in UTF-8 and UTF-16?
Most emoji require 4 bytes in both UTF-8 and UTF-16. Emoji occupy code points in the supplementary planes (above U+10000). In UTF-8, this means a 4-byte sequence using the 11110xxx pattern. In UTF-16, this means a 4-byte surrogate pair consisting of two 16-bit code units (high surrogate + low surrogate).
What is the difference between Unicode and UTF-8?
Unicode is a standard that assigns unique code points (numbers) to characters across all writing systems. UTF-8 is one of three encoding formats that convert those code points into actual byte sequences for storage and transmission. Unicode defines what number each character receives. UTF-8 defines how that number is stored as bytes in memory or on disk. You can think of Unicode as the dictionary and UTF-8 as the handwriting style used to write entries from that dictionary.
How many bytes does a Hindi character take in UTF-8?
Most Devanagari characters (used for Hindi, Marathi, Sanskrit, Nepali) occupy 3 bytes in UTF-8. They fall in the Unicode range U+0900–U+097F, which maps to the 3-byte encoding pattern. In UTF-16, the same characters require only 2 bytes because they are within the Basic Multilingual Plane.
What is a grapheme in Unicode?
A grapheme is what users perceive as a single visible character, even though it may consist of multiple Unicode code points internally. For example, the character “क्ष” in Hindi is a single grapheme (one visual unit) but is composed of three code points: क (U+0915) + ् (U+094D) + ष (U+0937). This is called an extended grapheme cluster. Proper text processing must account for grapheme boundaries, not just code point boundaries.
Key Takeaways
- Unicode is a character standard (154,998 characters across 168 scripts as of version 16.0); UTF-8 and UTF-16 are encoding formats that convert those characters into storable bytes
- UTF-8 uses 1–4 bytes per character and is backward compatible with ASCII — it powers over 98% of all websites globally
- UTF-16 uses 2–4 bytes per character and serves as the internal encoding for Java, JavaScript, C#, and the Windows operating system
- UTF-8 is more storage-efficient for English, Latin, and mixed-language web content; UTF-16 is more efficient for CJK and Indian language text in isolation
- Conversion between UTF-8 and UTF-16 is completely lossless — both represent the full Unicode codespace without any character loss
- Default to UTF-8 for all new web projects, APIs, databases, and cross-platform applications — this applies whether you are serving content in Hindi, Tamil, Bengali, or English
- Always declare encoding explicitly in HTML (
<meta charset="UTF-8">), database connections (utf8mb4), and HTTP headers to prevent mojibake and encoding mismatches
Understanding encoding systems is fundamental for any developer working with multilingual text. For more Unicode resources, tools, and guides, visit unicode-to-nonunicode.com.