
If you have ever opened a text file and seen something like é instead of é, or if your code suddenly breaks with a UnicodeDecodeError, then you are dealing with an encoding issue. Do not worry. This guide from unicode-to-nonunicode.com will help you find the problem and fix it step by step.
In this complete guide, you will learn:
The 5 most common encoding problem symptoms Command line and Python methods to detect encoding How to find invisible characters and security threats Step-by-step fixes for mojibake, BOM issues, and CSV import problems What to do when automatic detection fails Let us start.
What Are Encoding Issues? (5 Symptoms Users Actually Experience)
Before you fix anything, you need to know what kind of problem you have. Here are five common symptoms people face when dealing with text file encoding issues.
Symptom 1: Mojibake – “é” instead of “é”
You see weird symbols, question mark diamonds, or random letters with accents. This is called mojibake (a Japanese word that means “character transformation”). It happens when a file saved as UTF-8 is opened using Windows-1252 encoding.
Real world example: You download a CSV file from a database. When you open it, Hernández becomes Hernández. Your data looks completely broken.
Why this happens: Your computer is reading the bytes of a UTF-8 file but trying to display them using a different encoding map. Understanding how Unicode works in computers helps explain why these byte-level mismatches create garbled text.
Symptom 2: Invisible Characters Breaking Your Code
Your code looks perfect. The strings appear identical. But when you compare them, Python says False. This happens because of zero width spaces (U+200B) – characters you cannot see but your computer can.
Example:
password1 = "admin123"
password2 = "admin\u200b123" # Contains zero width space
print(password1 == password2) # False
Both look the same on screen. But they are different. This breaks login systems, data validation, and API calls.
Symptom 3: String Comparisons Fail Silently
Sometimes the problem is not invisible spaces but homoglyphs – characters that look identical but come from different scripts.
Example: The Latin letter a (U+0061) and the Cyrillic letter а (U+0430) look exactly the same. A fake domain name like pаypal.com uses the Cyrillic а. You cannot see the difference, but browsers and databases can. This is one of the key differences between Unicode and non-Unicode systems — Unicode assigns unique code points to visually similar characters from different scripts.
Symptom 4: Your File Works on Mac But Not Windows
You create a file on your Mac. Everything works fine. You send it to a Windows user. Suddenly, they see broken characters or the file does not open at all.
Why this happens:
Mac and Linux use LF (line feed) for new lines Windows uses CRLF (carriage return + line feed) Some Windows applications add a BOM (Byte Order Mark) that breaks Unix tools
If you are experiencing this on Windows, you may need to adjust the language for non-Unicode programs in Windows settings to match the encoding of your file.
Symptom 5: CSV or JSON Imports Fail Unexpectedly
You have a CSV file. It imports correctly when the file is small. But when the file is large, the import fails with an encoding error.
According to a Metabase issue report, a CSV file imported as UTF-8 worked perfectly as UTF-8-BOM but failed as plain UTF-8. This happens because the encoding detection system only reads the first few bytes. If those bytes look normal, but later bytes contain different characters, the detection fails.
Key takeaway: Identify your symptom above, then jump to the relevant detection method below.
Method 1: Detect Encoding Using Command Line (Fast and Built-in)
If you are on a Linux or Mac computer, you already have tools to detect encoding. You do not need to install anything.
The file Command on Linux and macOS
Open your terminal and run this command:
file -i filename.txt
Example output:
filename.txt: text/plain; charset=utf-8
The -i flag tells the file command to show MIME type and encoding information. It works by reading the first few bytes of your file and looking for byte patterns.
Limitation: The file command can only detect encoding. It cannot find invisible characters like zero width spaces, bidirectional overrides, or homoglyphs.
Using chardetect (Python-based)
If you have Python installed, you can use the chardet library. Install it first:
pip install chardet
Then run:
chardetect mystery_file.csv
Example output:
mystery_file.csv: utf-8 with confidence 0.99
The confidence score tells you how sure the detector is. A score of 0.99 means 99% confidence.
Quick Comparison: Which Method to Use When
| Scenario | Best Tool | Why |
|---|---|---|
| Quick check on a Linux server | file -i | Built into the system, no installation needed |
| You are already using Python | chardetect | Gives you a confidence score |
| You need the highest accuracy | chardetect | 99.3% accuracy on standard test files |
| You are on Windows without Python | Try online detector or install WSL | Windows does not have a built-in encoding command |
Fact 1: The chardet library version 7.4.0 achieves 99.3% accuracy on a test set of 2,517 files. It is the most accurate encoding detector available today. (Source)
Method 2: Detect Encoding Using Python Libraries (Most Accurate)
If you are a developer or you work with many files, you should use Python libraries. They give you more control and better accuracy.
Using chardet (99.3% Accuracy)
import chardet
with open('unknown_file.txt', 'rb') as file:
raw_data = file.read()
result = chardet.detect(raw_data)
print(f"Encoding: {result['encoding']}")
print(f"Confidence: {result['confidence']}")
The file must be opened in binary mode (‘rb’). This tells Python to read raw bytes, not try to decode them.
Why choose chardet: It is 11.1% more accurate than older versions and 13.9% more accurate than the alternative library charset-normalizer.
Fact 2: chardet 7.4.0 processes 551 files per second on a standard computer. That is 47 times faster than older versions. (Source)
Using charset-normalizer (Alternative Approach)
from charset_normalizer import detect
with open('unknown_file.txt', 'rb') as file:
result = detect(file.read())
print(f"Encoding: {result['encoding']}")
print(f"Confidence: {result['confidence']}")
The main difference is that charset-normalizer also tries to detect the language of the text. This can be useful if your file contains mixed languages.
Fact 3: charset-normalizer 3.4.6 achieves 85.4% accuracy, compared to chardet 7.4.0 at 99.3%. For most users, chardet remains the better choice. (Source)
Complete Accuracy and Speed Benchmark Table
| Detector | Accuracy | Speed (files/second) | Memory Usage |
|---|---|---|---|
| chardet 7.4.0 (mypyc version) | 99.3% | 551 files/sec | 52.9 MB |
| chardet 6.0.0 | 88.2% | 12 files/sec | 29.5 MB |
| charset-normalizer 3.4.6 | 85.4% | 376 files/sec | 78.8 MB |
| cchardet 2.1.19 | 55.9% | 2,005 files/sec | Not available |
Key takeaway: If you need maximum accuracy, use chardet. If you need maximum speed and can tolerate lower accuracy, use cchardet.
Method 3: Detect Invisible and Dangerous Characters (Security Focus)
This section is extremely important. Most guides about encoding detection completely ignore invisible characters and security threats. But these problems can break your code or even allow hackers to attack your system.
What Are Zero-Width Spaces? (U+200B)
A zero-width space is exactly what it sounds like – a space character that takes up no visual width. You cannot see it, but your computer can. This makes it very dangerous.
How it breaks your code:
user_input = "admin" # Normal string
database_value = "admin\u200b" # Contains zero-width space at the end
if user_input == database_value:
print("Access granted")
else:
print("Access denied") # This runs, even though both look the same
Where zero-width spaces hide:
Text copied from PDF files Messages from messaging apps Data pasted from web pages User input in web forms
How to detect them: Use the straudit tool. We will cover this below.
BiDi Override Characters and Trojan Source Attacks (CVE-2021-42574)
This is a real security vulnerability discovered by researchers at Cambridge University. It affects almost every programming language.
What is a BiDi override? BiDi stands for bidirectional text. Normally, English text goes left to right, and Arabic or Hebrew text goes right to left. BiDi override characters force the text direction to change, even in the middle of a line.
How attackers use this: They can write code that looks safe in a code review but actually does something completely different when compiled.
The attack example:
What the reviewer sees:
/* if (isAdmin) { */ start_admin_privileges(); /* } */
What the compiler actually runs:
start_admin_privileges();
The comment appears to wrap the dangerous function call. But because of hidden BiDi characters, the compiler ignores the comment markers.
Fact 4: CVE-2021-42574 (Trojan Source vulnerability) affects all programming languages that support Unicode. The vulnerability was discovered by researchers at the University of Cambridge. (Source)
How to detect BiDi overrides:
straudit --security source_code.py
The tool will warn you if it finds any BiDi override characters.
Homoglyphs – When “a” is Not Actually “a”
A homoglyph is a character that looks like another character but has a different Unicode code point.
Real example:
Latin letter a = U+0061 Cyrillic letter а = U+0430 Greek letter α = U+03B1
All three look almost identical. But a computer sees them as completely different characters.
Attack scenario: An attacker registers the domain pаypal.com using the Cyrillic а. They send you a phishing email. You click the link. The domain looks exactly like the real PayPal domain. But it is a fake website that steals your password.
How to detect homoglyphs:
straudit --explain suspicious_text.txt
The tool will show you each character, its Unicode code point, and its script (Latin, Cyrillic, Greek, etc.). Mixed scripts are highlighted as warnings.
Using straudit to Detect All Issues in One Pass
straudit is a single command line tool that checks everything – encoding, invisible characters, security issues, line endings, and more.
Complete file audit:
straudit my_file.txt
What it checks in one command:
File encoding (with confidence score) BOM (Byte Order Mark) presence BiDi override characters (Trojan Source attacks) Homoglyphs (Cyrillic, Greek lookalikes) Mixed scripts (Latin + Cyrillic + Greek in same file) Zero-width spaces and joiners Mixed line endings (LF vs CRLF) Non-printable and control characters
Comparison: straudit vs other tools
| Feature | straudit | file command | chardet | cat -v |
|---|---|---|---|---|
| Encoding detection | Yes | Yes | Yes | No |
| Invisible characters | Yes | No | Partial | No |
| BiDi attack detection | Yes | No | No | No |
| Homoglyph detection | Yes | No | No | No |
| Mixed script detection | Yes | No | No | No |
| Line ending check | Yes | No | No | No |
Method 4: Fix Encoding Issues Once Detected
Now that you have detected the problem, here is how to fix it.
Converting Files with iconv
iconv is a command line tool available on Linux, Mac, and Windows (through WSL). It converts files from one encoding to another.
Basic syntax:
iconv -f from_encoding -t to_encoding input.txt > output.txt
Real world examples:
# Convert Windows-1252 to UTF-8
iconv -f WINDOWS-1252 -t UTF-8 corrupted_file.csv > fixed_file.csv
# Convert ISO-8859-1 to UTF-8
iconv -f ISO-8859-1 -t UTF-8 european_text.txt > utf8_text.txt
# Convert UTF-16 to UTF-8
iconv -f UTF-16 -t UTF-8 japanese_file.txt > fixed.txt
If you need to understand the differences between these encoding formats, our guide on encoding systems: Unicode UTF-8 vs UTF-16 explains them in detail.
Removing BOM from UTF-8 Files
The Byte Order Mark (BOM) is a special character at the start of some UTF-8 files. It helps Windows applications identify the file as UTF-8. But it breaks Unix tools, JSON parsers, and many web applications.
The BOM problem in real life: According to a Metabase support issue, a CSV file imported correctly when saved as UTF-8-BOM, but failed completely when saved as plain UTF-8. The solution was to always use UTF-8-BOM for CSV files that will be imported into Metabase.
How to remove BOM using Python:
import codecs
# Read file with BOM using utf-8-sig (automatically strips BOM)
with codecs.open('file_with_bom.csv', 'r', 'utf-8-sig') as source:
content = source.read()
# Write file without BOM
with codecs.open('file_without_bom.csv', 'w', 'utf-8') as target:
target.write(content)
Tip: The utf-8-sig encoding in Python automatically removes the BOM when reading and adds it when writing (if needed). Use this for maximum compatibility.
Fixing Mojibake with ftfy
ftfy (Fixes Text For You) is a Python library specifically designed to fix mojibake and other Unicode problems.
Installation:
pip install ftfy
Usage:
from ftfy import fix_text
garbled_text = "été" # This should be "été"
fixed_text = fix_text(garbled_text)
print(fixed_text) # Output: été
Another example:
garbled = "Hernández"
fixed = fix_text(garbled)
print(fixed) # Output: Hernández
Using clean-text Library
The clean-text library includes a specific function called fix_bad_unicode().
Installation:
pip install clean-text
Usage:
from cleantext import clean
text = "Hernández"
cleaned = clean(text, fix_unicode=True)
print(cleaned) # Output: Hernández
Tip: The fix_unicode parameter repairs common mojibake patterns automatically.
Unicode Normalization (NFC vs NFD)
Some characters can be written in two different ways in Unicode. For example, the letter é can be:
A single character: U+00E9 (called NFC – Normalization Form Composed) Two characters: e (U+0065) + combining accent (U+0301) (called NFD – Normalization Form Decomposed)
Both look the same, but computers see them as different. This breaks string searches, sorting, and comparisons. For a deeper understanding of why this happens, read our guide on what is Unicode and non-Unicode.
How to normalize in Python:
import unicodedata
text = "café" # May contain composed or decomposed characters
# Normalize to composed form (NFC) – use this for storage
normalized_nfc = unicodedata.normalize('NFC', text)
# Normalize to decomposed form (NFD) – sometimes needed for comparisons
normalized_nfd = unicodedata.normalize('NFD', text)
Recommendation: Always normalize text to NFC before storing it in a database. This ensures consistent comparisons.
Platform-Specific Encoding Problems (And Their Fixes)
Different platforms handle encoding differently. Here are the most common problems and their solutions.
CSV Import Encoding Problems
The problem: You try to import a CSV file. When the file is small, it works. When the file is large, you get Chinese characters or question marks.
Why this happens: The system only reads the first few bytes of your file to guess the encoding. If those first bytes look like UTF-8, it assumes the whole file is UTF-8. But later bytes might be in a different encoding.
Solution using Python:
import codecs
with codecs.open('input.csv', 'r', 'utf-8-sig') as source_file:
content = source_file.read()
with codecs.open('output.csv', 'w', 'utf-8') as target_file:
target_file.write(content)
Tip for CSV files: Always save CSV files as UTF-8-BOM (not plain UTF-8) when you know the file will be imported into systems like Metabase, Excel, or Power BI.
JSON UnicodeDecodeError
The error message:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 123: character maps to <undefined>
The fix: Always specify encoding=’utf-8′ when opening JSON files.
import json
# Correct way
with open('data.json', 'r', encoding='utf-8') as file:
data = json.load(file)
# Wrong way (causes UnicodeDecodeError on Windows)
with open('data.json', 'r') as file:
data = json.load(file)
Why this works: On Windows, the default system encoding is not UTF-8. So you must explicitly tell Python to use UTF-8.
Windows Console UTF-8 Display Issues
The problem: You print UTF-8 text in Windows Command Prompt, and you see garbled characters like ├® instead of é.
The fix: Change the console code page to UTF-8.
chcp 65001
What is code page 65001? It is Windows’ identifier for UTF-8. After running this command, the console will correctly display UTF-8 characters.
Permanent fix: Add this to your Windows system environment variables. Set PYTHONUTF8=1 to force Python to use UTF-8 everywhere.
What to Do When Encoding Detection Fails (Fallback Methods)
Sometimes, even the best tools cannot detect the encoding with high confidence. Here is why and what to do.
Why Detection Sometimes Fails (Encoding Ambiguity)
Different encodings can decode the same bytes in valid ways. For example, the byte 0x80 means:
In Windows-1252: Euro symbol (€) In ISO-8859-1: Unused control character In some other encodings: A different symbol
The detector has to guess. Sometimes the guess is wrong.
Fact 5: ISO-8859-1 and Windows-1252 are identical for 95% of byte values. They only differ in the range 0x80 to 0x9F. This is why detectors often return low confidence scores for files that could be either encoding. (Source)
Manual Detection by Byte Patterns
When auto-detection fails, look for Byte Order Marks (BOMs). A BOM is a special byte sequence at the very start of a file.
Common BOM signatures in hexadecimal:
| Encoding | BOM Signature (Hex) |
|---|---|
| UTF-8 with BOM | EF BB BF |
| UTF-16 Little Endian | FF FE |
| UTF-16 Big Endian | FE FF |
| UTF-32 Little Endian | FF FE 00 00 |
| UTF-32 Big Endian | 00 00 FE FF |
How to view the hex of a file:
xxd filename.txt | head -n 5
The first line shows the first 16 bytes of your file. If you see ef bb bf at the beginning, your file is UTF-8 with BOM.
Automating Encoding Detection (CI/CD Integration)
If you maintain a codebase or a data pipeline, you should automate encoding detection to prevent problems before they happen.
Using straudit –check in CI Pipelines
The –check mode makes straudit exit with an error code if any issue is found. This is perfect for CI/CD (Continuous Integration / Continuous Deployment).
GitHub Actions example:
- name: Check for encoding and security issues
run: |
pip install straudit
straudit --check src/**/*.py
# If straudit finds any issue, it exits with code 1
# This fails the build, preventing bad files from entering your codebase
Why you should do this: It prevents Trojan Source attacks, invisible character bugs, and encoding-related crashes from ever reaching production.
JSON Output for Automation
straudit -o json my_file.txt
The output is valid JSON that you can parse with other tools.
Use cases:
Send alerts to monitoring systems Log encoding issues for audit trails Integrate with Python, Node.js, or any language that can parse JSON
Comparison of Encoding Detection Tools
Here is a complete comparison to help you choose the right tool for your needs.
| Tool | Best For | Accuracy | Finds Invisible Characters? | Finds Security Issues? | Works in CI/CD? | Platform |
|---|---|---|---|---|---|---|
| chardet | Highest accuracy encoding detection | 99.3% | No | No | Yes | Python 3.10 and above |
| straudit | Complete security + encoding audit | Good | Yes | Yes | Yes | Python (standard library only) |
| file -i | Quick command line check | Medium | No | No | No | Linux, macOS |
| charset-normalizer | Language detection + encoding | 85.4% | No | No | Yes | Python |
| iconv | Converting files | N/A | No | No | No | Cross-platform (command line) |
Recommendation from unicode-to-nonunicode.com: Use chardet when you only need encoding detection. Use straudit when you also need security auditing (invisible characters, BiDi overrides, homoglyphs, mixed scripts). If you work with enterprise systems, you may also want to understand how encoding affects platforms like SAP or SQL Server.
Frequently Asked Questions (FAQ)
Q1: How do I know if my file is UTF-8 encoded?
A: Use the file -i filename.txt command on Linux or macOS. On Windows, use chardetect filename.txt (you need to install chardet first). Both tools will tell you the encoding and give you a confidence score.
Q2: What is the difference between UTF-8 and UTF-8 with BOM?
A: UTF-8 with BOM includes a special 3-byte signature (EF BB BF) at the start of the file. This helps Windows applications identify the file as UTF-8. However, the BOM breaks Unix tools, many JSON parsers, and some web applications. According to a Metabase issue report, some systems require UTF-8-BOM while others require plain UTF-8. Test your specific platform.
Q3: Why do I see the � (replacement character) in my text?
A: The � character (Unicode code point U+FFFD) appears when your text viewer tries to decode a byte that does not exist in the current encoding. Your file is saved in a different encoding than what your viewer expects. Use chardet to detect the correct encoding, then convert the file using iconv.
Q4: Can invisible characters affect code execution?
A: Yes, absolutely. Zero-width spaces break string comparisons and password validation. BiDi override characters (CVE-2021-42574) can completely change how code executes while making it look safe in a code review. The Trojan Source vulnerability affects all major programming languages. Always scan your code files with tools like straudit before committing them.
Q5: What is the most accurate encoding detector?
A: According to the official benchmark, chardet version 7.4.0 achieves 99.3% accuracy on a test set of 2,517 files. This is the highest accuracy among all detectors. The next closest is charset-normalizer at 85.4%. Use chardet for maximum accuracy.
Q6: How do I fix a CSV file that shows Chinese characters when I import it?
A: This usually happens because your CSV file is saved as plain UTF-8, but the import system expects UTF-8-BOM. Open your CSV file in a text editor like Notepad++, go to Encoding menu, select “UTF-8-BOM” (not just “UTF-8”), save the file, and try importing again. Alternatively, use the Python utf-8-sig encoding as shown in the CSV section above.
Q7: My Python code works on Mac but breaks on Windows. Why?
A: Windows uses a different default system encoding (usually Windows-1252) while Mac and Linux use UTF-8. Always specify encoding=’utf-8′ when opening files in Python. Example: open(‘file.txt’, ‘r’, encoding=’utf-8′). This makes your code work the same way on all platforms.