How Unicode Works in Computers
A Simple, Clear Explanation & Guide
Ever wondered how your computer displays Telugu, Chinese, Emoji, and English β all at the same time? Unicode is the answer. Here's exactly how it works.
π Table of Contents
- What Is Unicode? (The 30-Second Version)
- Before Unicode β The Chaos of Code Pages
- Code Points β Every Character Gets a Number
- How Computers Actually Store Unicode Text
- UTF-8, UTF-16, UTF-32 β What's the Difference?
- A Real Binary Example β Letter "A" Step by Step
- Fonts and Rendering β From Code Point to Character on Screen
- Unicode vs Non-Unicode β Why Both Still Exist
- Frequently Asked Questions
- Related Articles & Tools
What Is Unicode? (The 30-Second Version)
Imagine a giant master list that gives every single character β from every language ever written by humans β its own unique identification number. That's Unicode.
Before Unicode, computers had to use different, incompatible systems for different languages. Japanese text would break on a Spanish computer. Arabic would turn into gibberish on an English machine. Unicode solved this by creating one unified numbering system that all computers, operating systems, and applications can agree on.
One Global Standard
Unicode covers 154,998 characters across 168 writing systems β from modern English to ancient Sumerian cuneiform.
Every Character = A Number
The letter "A" is U+0041. The Telugu character "ΰ° " is U+0C05. Even emoji have numbers β π is U+1F600.
Universally Agreed Upon
Maintained by the Unicode Consortium, a non-profit that includes Apple, Google, Microsoft, Adobe, and others.
Want a complete guide on what Unicode is and how it compares to non-Unicode systems? Read our full article: What is Unicode and Non-Unicode? A Complete Plain Language Guide β
Before Unicode β The Chaos of Code Pages
To understand why Unicode matters, you need to understand the mess it replaced. Before Unicode became universal in the 1990s and 2000s, every computer system used a code page β a small, private table that mapped numbers (bytes) to characters.
The problem? Different code pages assigned the same number to completely different characters.
| Byte Value | Windows-1252 (Western) | Windows-1251 (Cyrillic) | Windows-1256 (Arabic) |
|---|---|---|---|
| 0xC0 | Γ (A with grave) | Π (Cyrillic A) | Ω (Arabic Tatweel) |
| 0xE9 | Γ© (e with accent) | ΠΉ (Cyrillic short I) | Ω (Arabic Ya) |
| 0xD0 | Γ (Eth) | Π (Cyrillic P) | Ψ¦ (Arabic Ye) |
When a French document (using Windows-1252) was opened on a Russian computer (using Windows-1251), every accented letter turned into random Cyrillic characters. This was called mojibake β Japanese for "character transformation" β and it was a daily headache for anyone working with international text.
This same problem shows up in modern workflows when you're dealing with legacy fonts like Anu Script for Telugu or Kruti Dev for Hindi. These fonts use their own private character maps β which is exactly why tools like our Unicode to ANU Converter are still needed today.
Code Points β Every Character Gets a Number
The foundation of Unicode is simple: every character gets a unique identification number called a code point. Code points are written in the format U+XXXX, where the X values are hexadecimal digits.
Examples of Unicode Code Points
How Many Code Points Exist?
Unicode's total capacity is 1,114,112 possible code points, organised into 17 groups called planes. As of today, 154,998 of those positions are officially assigned to real characters. The rest are reserved for future additions.
| Plane | Range | Name | Contains |
|---|---|---|---|
| Plane 0 | U+0000βU+FFFF | BMP | All common characters β Latin, Cyrillic, Arabic, Hebrew, most CJK, Telugu, Hindi, symbols Most Used |
| Plane 1 | U+10000βU+1FFFF | SMP | Historic scripts, musical notation, emoji, rare symbols |
| Plane 2 | U+20000βU+2FFFF | SIP | Rare and historic CJK characters |
| Planes 3β13 | β | Unassigned | Reserved for future use |
| Planes 15β16 | β | PUA | Private Use Area β custom/vendor characters |
How Computers Actually Store Unicode Text
Computers store everything as binary numbers β sequences of 0s and 1s grouped into bytes. To store the text "Hello ΰ° π" in a file, your computer needs to convert each character's code point into a specific sequence of bytes.
This conversion process is called encoding. Think of it as translation: the code point is the "idea" and the encoded bytes are the "spoken words." Different encodings are like different languages for expressing the same idea.
The Journey from Character to Storage
Say you type the letter "A". Your keyboard sends a signal to the OS.
The OS knows "A" = U+0041 (decimal 65). This is the abstract identity of the character.
Using UTF-8, U+0041 becomes the single byte 0x41 = binary 01000001.
The file or database now contains those bytes. The character is stored.
When you open the file, the reverse happens: bytes are decoded to code points, the OS finds the right glyph in the active font, and renders it on screen.
UTF-8, UTF-16, UTF-32 β What's the Difference?
All three encoding formats can represent every Unicode character. The difference is how many bytes they use and in what situations each one makes sense.
| Character | Code Point | UTF-8 Bytes | UTF-16 Bytes | UTF-32 Bytes |
|---|---|---|---|---|
| A | U+0041 | 1 byte: 41 | 2 bytes: 41 00 | 4 bytes: 41 00 00 00 |
| Γ© | U+00E9 | 2 bytes: C3 A9 | 2 bytes: E9 00 | 4 bytes: E9 00 00 00 |
| ΰ° (Telugu) | U+0C05 | 3 bytes: E0 B0 85 | 2 bytes: 05 0C | 4 bytes: 05 0C 00 00 |
| π (Emoji) | U+1F600 | 4 bytes: F0 9F 98 80 | 4 bytes: surrogate pair | 4 bytes: 00 F6 01 00 |
For SQL Server developers, understanding UTF-16 is especially important because the nvarchar and nchar data types use UTF-16 encoding internally. Read our detailed guide: Unicode vs Non-Unicode in SQL Server β Developer's Guide β
A Real Binary Example β Letter "A" Step by Step
Let's trace the letter "A" all the way from your keyboard to the raw bits stored on disk, using UTF-8.
Now Let's Trace "ΰ° " (Telugu A) in UTF-8
Fonts and Rendering β From Code Point to Character on Screen
A common misconception: Unicode doesn't draw characters β fonts do. Unicode just assigns a number to each character. It's the font file that contains the actual visual shape (called a glyph) for each code point.
Here's how the rendering pipeline works:
Your browser, text editor, or OS reads bytes and converts them to code points using the file's declared encoding.
For complex scripts like Telugu, Arabic, or Devanagari, a shaping engine (like HarfBuzz) combines code points into correct ligature forms and applies directional rules (right-to-left for Arabic).
The OS checks the active font file for a glyph matching each code point. If no glyph is found, you see β‘ (the "tofu" box) or a fallback character.
The font rasteriser converts vector outlines into pixels at the right size, with anti-aliasing for smooth edges.
Unicode vs Non-Unicode β Why Both Still Exist
If Unicode is so much better, why do non-Unicode systems still exist in 2026? The answer is simple: legacy systems, existing workflows, and the cost of migration.
| Dimension | Unicode | Non-Unicode |
|---|---|---|
| Character Capacity | 154,000+ characters, 1.1M potential Universal | 256 per SBCS code page Limited |
| Multilingual Support | All languages in one document | One language family per document |
| Storage per ASCII char | 1 byte (UTF-8) / 2 bytes (UTF-16) | 1 byte (SBCS) |
| SQL Server | nvarchar, nchar UTF-16 | varchar, char (code page-based) |
| Web standard | 98%+ of websites use UTF-8 | Not used on modern web |
| DTP & Legacy Fonts | Compatible with modern fonts (OTF, TTF) | Required for Anu Script, Kruti Dev, Nudi |
| SAP Systems | Required for SAP Unicode systems (post-2005) | Used in pre-2005 SAP installations |
For a deep-dive into specific use cases, explore these guides:
π Need to Convert Telugu Unicode to Non-Unicode?
Use our free online tool β no sign-up, no installation. Works with Anu Script, PageMaker, CorelDraw, and all legacy DTP formats.
Try the Free Converter βFrequently Asked Questions
Related Articles & Tools
Explore more guides on unicode-to-nonunicode.com β everything about Unicode, non-Unicode, and text encoding explained in plain language.