How to Detect Encoding Issues in Text Files (Unicode Problems Fix)

If you have ever opened a text file and seen something like Ã© instead of é, or if your code suddenly breaks with a UnicodeDecodeError, then you are dealing with an encoding issue. Do not worry. This guide from unicode-to-nonunicode.com will help you find the problem and fix it step by step.

In this complete guide, you will learn:

The 5 most common encoding problem symptoms Command line and Python methods to detect encoding How to find invisible characters and security threats Step-by-step fixes for mojibake, BOM issues, and CSV import problems What to do when automatic detection fails Let us start.

Table of Contents

What Are Encoding Issues? (5 Symptoms Users Actually Experience)

Before you fix anything, you need to know what kind of problem you have. Here are five common symptoms people face when dealing with text file encoding issues.

Symptom 1: Mojibake – “Ã©” instead of “é”

You see weird symbols, question mark diamonds, or random letters with accents. This is called mojibake (a Japanese word that means “character transformation”). It happens when a file saved as UTF-8 is opened using Windows-1252 encoding.

Real world example: You download a CSV file from a database. When you open it, Hernández becomes HernÃ¡ndez. Your data looks completely broken.

Why this happens: Your computer is reading the bytes of a UTF-8 file but trying to display them using a different encoding map. Understanding how Unicode works in computers helps explain why these byte-level mismatches create garbled text.

Symptom 2: Invisible Characters Breaking Your Code

Your code looks perfect. The strings appear identical. But when you compare them, Python says False. This happens because of zero width spaces (U+200B) – characters you cannot see but your computer can.

Example:

Python

password1 = "admin123"
password2 = "admin\u200b123"  # Contains zero width space
print(password1 == password2)  # False

Both look the same on screen. But they are different. This breaks login systems, data validation, and API calls.

Symptom 3: String Comparisons Fail Silently

Sometimes the problem is not invisible spaces but homoglyphs – characters that look identical but come from different scripts.

Example: The Latin letter a (U+0061) and the Cyrillic letter а (U+0430) look exactly the same. A fake domain name like pаypal.com uses the Cyrillic а. You cannot see the difference, but browsers and databases can. This is one of the key differences between Unicode and non-Unicode systems — Unicode assigns unique code points to visually similar characters from different scripts.

Symptom 4: Your File Works on Mac But Not Windows

You create a file on your Mac. Everything works fine. You send it to a Windows user. Suddenly, they see broken characters or the file does not open at all.

Why this happens:

Mac and Linux use LF (line feed) for new lines Windows uses CRLF (carriage return + line feed) Some Windows applications add a BOM (Byte Order Mark) that breaks Unix tools

If you are experiencing this on Windows, you may need to adjust the language for non-Unicode programs in Windows settings to match the encoding of your file.

Symptom 5: CSV or JSON Imports Fail Unexpectedly

You have a CSV file. It imports correctly when the file is small. But when the file is large, the import fails with an encoding error.

According to a Metabase issue report, a CSV file imported as UTF-8 worked perfectly as UTF-8-BOM but failed as plain UTF-8. This happens because the encoding detection system only reads the first few bytes. If those bytes look normal, but later bytes contain different characters, the detection fails.

Key takeaway: Identify your symptom above, then jump to the relevant detection method below.

Method 1: Detect Encoding Using Command Line (Fast and Built-in)

If you are on a Linux or Mac computer, you already have tools to detect encoding. You do not need to install anything.

The file Command on Linux and macOS

Open your terminal and run this command:

Bash

file -i filename.txt

Example output:

text

filename.txt: text/plain; charset=utf-8

The -i flag tells the file command to show MIME type and encoding information. It works by reading the first few bytes of your file and looking for byte patterns.

Limitation: The file command can only detect encoding. It cannot find invisible characters like zero width spaces, bidirectional overrides, or homoglyphs.

Using chardetect (Python-based)

If you have Python installed, you can use the chardet library. Install it first:

Bash

pip install chardet

Then run:

Bash

chardetect mystery_file.csv

Example output:

text

mystery_file.csv: utf-8 with confidence 0.99

The confidence score tells you how sure the detector is. A score of 0.99 means 99% confidence.

Quick Comparison: Which Method to Use When

Scenario	Best Tool	Why
Quick check on a Linux server	file -i	Built into the system, no installation needed
You are already using Python	chardetect	Gives you a confidence score
You need the highest accuracy	chardetect	99.3% accuracy on standard test files
You are on Windows without Python	Try online detector or install WSL	Windows does not have a built-in encoding command

Fact 1: The chardet library version 7.4.0 achieves 99.3% accuracy on a test set of 2,517 files. It is the most accurate encoding detector available today. (Source)

Method 2: Detect Encoding Using Python Libraries (Most Accurate)

If you are a developer or you work with many files, you should use Python libraries. They give you more control and better accuracy.

Using chardet (99.3% Accuracy)

Python

import chardet

with open('unknown_file.txt', 'rb') as file:
    raw_data = file.read()
    result = chardet.detect(raw_data)
    print(f"Encoding: {result['encoding']}")
    print(f"Confidence: {result['confidence']}")

The file must be opened in binary mode (‘rb’). This tells Python to read raw bytes, not try to decode them.

Why choose chardet: It is 11.1% more accurate than older versions and 13.9% more accurate than the alternative library charset-normalizer.

Fact 2: chardet 7.4.0 processes 551 files per second on a standard computer. That is 47 times faster than older versions. (Source)

Using charset-normalizer (Alternative Approach)

Python

from charset_normalizer import detect

with open('unknown_file.txt', 'rb') as file:
    result = detect(file.read())
    print(f"Encoding: {result['encoding']}")
    print(f"Confidence: {result['confidence']}")

The main difference is that charset-normalizer also tries to detect the language of the text. This can be useful if your file contains mixed languages.

Fact 3: charset-normalizer 3.4.6 achieves 85.4% accuracy, compared to chardet 7.4.0 at 99.3%. For most users, chardet remains the better choice. (Source)

Complete Accuracy and Speed Benchmark Table

Detector	Accuracy	Speed (files/second)	Memory Usage
chardet 7.4.0 (mypyc version)	99.3%	551 files/sec	52.9 MB
chardet 6.0.0	88.2%	12 files/sec	29.5 MB
charset-normalizer 3.4.6	85.4%	376 files/sec	78.8 MB
cchardet 2.1.19	55.9%	2,005 files/sec	Not available

Key takeaway: If you need maximum accuracy, use chardet. If you need maximum speed and can tolerate lower accuracy, use cchardet.

Method 3: Detect Invisible and Dangerous Characters (Security Focus)

This section is extremely important. Most guides about encoding detection completely ignore invisible characters and security threats. But these problems can break your code or even allow hackers to attack your system.

What Are Zero-Width Spaces? (U+200B)

A zero-width space is exactly what it sounds like – a space character that takes up no visual width. You cannot see it, but your computer can. This makes it very dangerous.

How it breaks your code:

Python

user_input = "admin"           # Normal string
database_value = "admin\u200b" # Contains zero-width space at the end

if user_input == database_value:
    print("Access granted")
else:
    print("Access denied")     # This runs, even though both look the same

Where zero-width spaces hide:

Text copied from PDF files Messages from messaging apps Data pasted from web pages User input in web forms

How to detect them: Use the straudit tool. We will cover this below.

BiDi Override Characters and Trojan Source Attacks (CVE-2021-42574)

This is a real security vulnerability discovered by researchers at Cambridge University. It affects almost every programming language.

What is a BiDi override? BiDi stands for bidirectional text. Normally, English text goes left to right, and Arabic or Hebrew text goes right to left. BiDi override characters force the text direction to change, even in the middle of a line.

How attackers use this: They can write code that looks safe in a code review but actually does something completely different when compiled.

The attack example:

text

What the reviewer sees:
/* if (isAdmin) { */ start_admin_privileges(); /* } */

What the compiler actually runs:
start_admin_privileges();

The comment appears to wrap the dangerous function call. But because of hidden BiDi characters, the compiler ignores the comment markers.

Fact 4: CVE-2021-42574 (Trojan Source vulnerability) affects all programming languages that support Unicode. The vulnerability was discovered by researchers at the University of Cambridge. (Source)

How to detect BiDi overrides:

Bash

straudit --security source_code.py

The tool will warn you if it finds any BiDi override characters.

Homoglyphs – When “a” is Not Actually “a”

A homoglyph is a character that looks like another character but has a different Unicode code point.

Real example:

Latin letter a = U+0061 Cyrillic letter а = U+0430 Greek letter α = U+03B1

All three look almost identical. But a computer sees them as completely different characters.

Attack scenario: An attacker registers the domain pаypal.com using the Cyrillic а. They send you a phishing email. You click the link. The domain looks exactly like the real PayPal domain. But it is a fake website that steals your password.

How to detect homoglyphs:

Bash

straudit --explain suspicious_text.txt

The tool will show you each character, its Unicode code point, and its script (Latin, Cyrillic, Greek, etc.). Mixed scripts are highlighted as warnings.

Using straudit to Detect All Issues in One Pass

straudit is a single command line tool that checks everything – encoding, invisible characters, security issues, line endings, and more.

Complete file audit:

Bash

straudit my_file.txt

What it checks in one command:

File encoding (with confidence score) BOM (Byte Order Mark) presence BiDi override characters (Trojan Source attacks) Homoglyphs (Cyrillic, Greek lookalikes) Mixed scripts (Latin + Cyrillic + Greek in same file) Zero-width spaces and joiners Mixed line endings (LF vs CRLF) Non-printable and control characters

Comparison: straudit vs other tools

Feature	straudit	file command	chardet	cat -v
Encoding detection	Yes	Yes	Yes	No
Invisible characters	Yes	No	Partial	No
BiDi attack detection	Yes	No	No	No
Homoglyph detection	Yes	No	No	No
Mixed script detection	Yes	No	No	No
Line ending check	Yes	No	No	No

Method 4: Fix Encoding Issues Once Detected

Now that you have detected the problem, here is how to fix it.

Converting Files with iconv

iconv is a command line tool available on Linux, Mac, and Windows (through WSL). It converts files from one encoding to another.

Basic syntax:

Bash

iconv -f from_encoding -t to_encoding input.txt > output.txt

Real world examples:

Bash

# Convert Windows-1252 to UTF-8
iconv -f WINDOWS-1252 -t UTF-8 corrupted_file.csv > fixed_file.csv

# Convert ISO-8859-1 to UTF-8
iconv -f ISO-8859-1 -t UTF-8 european_text.txt > utf8_text.txt

# Convert UTF-16 to UTF-8
iconv -f UTF-16 -t UTF-8 japanese_file.txt > fixed.txt

If you need to understand the differences between these encoding formats, our guide on encoding systems: Unicode UTF-8 vs UTF-16 explains them in detail.

Removing BOM from UTF-8 Files

The Byte Order Mark (BOM) is a special character at the start of some UTF-8 files. It helps Windows applications identify the file as UTF-8. But it breaks Unix tools, JSON parsers, and many web applications.

The BOM problem in real life: According to a Metabase support issue, a CSV file imported correctly when saved as UTF-8-BOM, but failed completely when saved as plain UTF-8. The solution was to always use UTF-8-BOM for CSV files that will be imported into Metabase.

How to remove BOM using Python:

Python

import codecs

# Read file with BOM using utf-8-sig (automatically strips BOM)
with codecs.open('file_with_bom.csv', 'r', 'utf-8-sig') as source:
    content = source.read()

# Write file without BOM
with codecs.open('file_without_bom.csv', 'w', 'utf-8') as target:
    target.write(content)

Tip: The utf-8-sig encoding in Python automatically removes the BOM when reading and adds it when writing (if needed). Use this for maximum compatibility.

Fixing Mojibake with ftfy

ftfy (Fixes Text For You) is a Python library specifically designed to fix mojibake and other Unicode problems.

Installation:

Bash

pip install ftfy

Usage:

Python

from ftfy import fix_text

garbled_text = "Ã©tÃ©"  # This should be "été"
fixed_text = fix_text(garbled_text)
print(fixed_text)  # Output: été

Another example:

Python

garbled = "HernÃ¡ndez"
fixed = fix_text(garbled)
print(fixed)  # Output: Hernández

Using clean-text Library

The clean-text library includes a specific function called fix_bad_unicode().

Installation:

Bash

pip install clean-text

Usage:

Python

from cleantext import clean

text = "HernÃ¡ndez"
cleaned = clean(text, fix_unicode=True)
print(cleaned)  # Output: Hernández

Tip: The fix_unicode parameter repairs common mojibake patterns automatically.

Unicode Normalization (NFC vs NFD)

Some characters can be written in two different ways in Unicode. For example, the letter é can be:

A single character: U+00E9 (called NFC – Normalization Form Composed) Two characters: e (U+0065) + combining accent (U+0301) (called NFD – Normalization Form Decomposed)

Both look the same, but computers see them as different. This breaks string searches, sorting, and comparisons. For a deeper understanding of why this happens, read our guide on what is Unicode and non-Unicode.

How to normalize in Python:

Python

import unicodedata

text = "café"  # May contain composed or decomposed characters

# Normalize to composed form (NFC) – use this for storage
normalized_nfc = unicodedata.normalize('NFC', text)

# Normalize to decomposed form (NFD) – sometimes needed for comparisons
normalized_nfd = unicodedata.normalize('NFD', text)

Recommendation: Always normalize text to NFC before storing it in a database. This ensures consistent comparisons.

Platform-Specific Encoding Problems (And Their Fixes)

Different platforms handle encoding differently. Here are the most common problems and their solutions.

CSV Import Encoding Problems

The problem: You try to import a CSV file. When the file is small, it works. When the file is large, you get Chinese characters or question marks.

Why this happens: The system only reads the first few bytes of your file to guess the encoding. If those first bytes look like UTF-8, it assumes the whole file is UTF-8. But later bytes might be in a different encoding.

Solution using Python:

Python

import codecs

with codecs.open('input.csv', 'r', 'utf-8-sig') as source_file:
    content = source_file.read()

with codecs.open('output.csv', 'w', 'utf-8') as target_file:
    target_file.write(content)

Tip for CSV files: Always save CSV files as UTF-8-BOM (not plain UTF-8) when you know the file will be imported into systems like Metabase, Excel, or Power BI.

JSON UnicodeDecodeError

The error message:

text

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 123: character maps to <undefined>

The fix: Always specify encoding=’utf-8′ when opening JSON files.

Python

import json

# Correct way
with open('data.json', 'r', encoding='utf-8') as file:
    data = json.load(file)

# Wrong way (causes UnicodeDecodeError on Windows)
with open('data.json', 'r') as file:
    data = json.load(file)

Why this works: On Windows, the default system encoding is not UTF-8. So you must explicitly tell Python to use UTF-8.

Windows Console UTF-8 Display Issues

The problem: You print UTF-8 text in Windows Command Prompt, and you see garbled characters like ├® instead of é.

The fix: Change the console code page to UTF-8.

cmd

chcp 65001

What is code page 65001? It is Windows’ identifier for UTF-8. After running this command, the console will correctly display UTF-8 characters.

Permanent fix: Add this to your Windows system environment variables. Set PYTHONUTF8=1 to force Python to use UTF-8 everywhere.

What to Do When Encoding Detection Fails (Fallback Methods)

Sometimes, even the best tools cannot detect the encoding with high confidence. Here is why and what to do.

Why Detection Sometimes Fails (Encoding Ambiguity)

Different encodings can decode the same bytes in valid ways. For example, the byte 0x80 means:

In Windows-1252: Euro symbol (€) In ISO-8859-1: Unused control character In some other encodings: A different symbol

The detector has to guess. Sometimes the guess is wrong.

Fact 5: ISO-8859-1 and Windows-1252 are identical for 95% of byte values. They only differ in the range 0x80 to 0x9F. This is why detectors often return low confidence scores for files that could be either encoding. (Source)

Manual Detection by Byte Patterns

When auto-detection fails, look for Byte Order Marks (BOMs). A BOM is a special byte sequence at the very start of a file.

Common BOM signatures in hexadecimal:

Encoding	BOM Signature (Hex)
UTF-8 with BOM	EF BB BF
UTF-16 Little Endian	FF FE
UTF-16 Big Endian	FE FF
UTF-32 Little Endian	FF FE 00 00
UTF-32 Big Endian	00 00 FE FF

How to view the hex of a file:

Bash

xxd filename.txt | head -n 5

The first line shows the first 16 bytes of your file. If you see ef bb bf at the beginning, your file is UTF-8 with BOM.

Automating Encoding Detection (CI/CD Integration)

If you maintain a codebase or a data pipeline, you should automate encoding detection to prevent problems before they happen.

Using straudit –check in CI Pipelines

The –check mode makes straudit exit with an error code if any issue is found. This is perfect for CI/CD (Continuous Integration / Continuous Deployment).

GitHub Actions example:

YAML

- name: Check for encoding and security issues
  run: |
    pip install straudit
    straudit --check src/**/*.py
  # If straudit finds any issue, it exits with code 1
  # This fails the build, preventing bad files from entering your codebase

Why you should do this: It prevents Trojan Source attacks, invisible character bugs, and encoding-related crashes from ever reaching production.

JSON Output for Automation

Bash

straudit -o json my_file.txt

The output is valid JSON that you can parse with other tools.

Use cases:

Send alerts to monitoring systems Log encoding issues for audit trails Integrate with Python, Node.js, or any language that can parse JSON

Comparison of Encoding Detection Tools

Here is a complete comparison to help you choose the right tool for your needs.

Tool	Best For	Accuracy	Finds Invisible Characters?	Finds Security Issues?	Works in CI/CD?	Platform
chardet	Highest accuracy encoding detection	99.3%	No	No	Yes	Python 3.10 and above
straudit	Complete security + encoding audit	Good	Yes	Yes	Yes	Python (standard library only)
file -i	Quick command line check	Medium	No	No	No	Linux, macOS
charset-normalizer	Language detection + encoding	85.4%	No	No	Yes	Python
iconv	Converting files	N/A	No	No	No	Cross-platform (command line)

Recommendation from unicode-to-nonunicode.com: Use chardet when you only need encoding detection. Use straudit when you also need security auditing (invisible characters, BiDi overrides, homoglyphs, mixed scripts). If you work with enterprise systems, you may also want to understand how encoding affects platforms like SAP or SQL Server.

Frequently Asked Questions (FAQ)

Q1: How do I know if my file is UTF-8 encoded?

A: Use the file -i filename.txt command on Linux or macOS. On Windows, use chardetect filename.txt (you need to install chardet first). Both tools will tell you the encoding and give you a confidence score.

Q2: What is the difference between UTF-8 and UTF-8 with BOM?

A: UTF-8 with BOM includes a special 3-byte signature (EF BB BF) at the start of the file. This helps Windows applications identify the file as UTF-8. However, the BOM breaks Unix tools, many JSON parsers, and some web applications. According to a Metabase issue report, some systems require UTF-8-BOM while others require plain UTF-8. Test your specific platform.

Q3: Why do I see the � (replacement character) in my text?

A: The � character (Unicode code point U+FFFD) appears when your text viewer tries to decode a byte that does not exist in the current encoding. Your file is saved in a different encoding than what your viewer expects. Use chardet to detect the correct encoding, then convert the file using iconv.

Q4: Can invisible characters affect code execution?

A: Yes, absolutely. Zero-width spaces break string comparisons and password validation. BiDi override characters (CVE-2021-42574) can completely change how code executes while making it look safe in a code review. The Trojan Source vulnerability affects all major programming languages. Always scan your code files with tools like straudit before committing them.

Q5: What is the most accurate encoding detector?

A: According to the official benchmark, chardet version 7.4.0 achieves 99.3% accuracy on a test set of 2,517 files. This is the highest accuracy among all detectors. The next closest is charset-normalizer at 85.4%. Use chardet for maximum accuracy.

Q6: How do I fix a CSV file that shows Chinese characters when I import it?

A: This usually happens because your CSV file is saved as plain UTF-8, but the import system expects UTF-8-BOM. Open your CSV file in a text editor like Notepad++, go to Encoding menu, select “UTF-8-BOM” (not just “UTF-8”), save the file, and try importing again. Alternatively, use the Python utf-8-sig encoding as shown in the CSV section above.

Q7: My Python code works on Mac but breaks on Windows. Why?

A: Windows uses a different default system encoding (usually Windows-1252) while Mac and Linux use UTF-8. Always specify encoding=’utf-8′ when opening files in Python. Example: open(‘file.txt’, ‘r’, encoding=’utf-8′). This makes your code work the same way on all platforms.

What Are Encoding Issues? (5 Symptoms Users Actually Experience)

Symptom 1: Mojibake – “Ã©” instead of “é”

Symptom 2: Invisible Characters Breaking Your Code

Symptom 3: String Comparisons Fail Silently

Symptom 4: Your File Works on Mac But Not Windows

Symptom 5: CSV or JSON Imports Fail Unexpectedly

Method 1: Detect Encoding Using Command Line (Fast and Built-in)

The file Command on Linux and macOS

Using chardetect (Python-based)

Quick Comparison: Which Method to Use When

Method 2: Detect Encoding Using Python Libraries (Most Accurate)

Using chardet (99.3% Accuracy)

Using charset-normalizer (Alternative Approach)

Complete Accuracy and Speed Benchmark Table

Method 3: Detect Invisible and Dangerous Characters (Security Focus)

What Are Zero-Width Spaces? (U+200B)

BiDi Override Characters and Trojan Source Attacks (CVE-2021-42574)

Homoglyphs – When “a” is Not Actually “a”

Using straudit to Detect All Issues in One Pass

Comparison: straudit vs other tools

Method 4: Fix Encoding Issues Once Detected

Converting Files with iconv

Removing BOM from UTF-8 Files

Fixing Mojibake with ftfy

Using clean-text Library

Unicode Normalization (NFC vs NFD)

Platform-Specific Encoding Problems (And Their Fixes)

CSV Import Encoding Problems

JSON UnicodeDecodeError

Windows Console UTF-8 Display Issues

What to Do When Encoding Detection Fails (Fallback Methods)

Why Detection Sometimes Fails (Encoding Ambiguity)

Manual Detection by Byte Patterns

Automating Encoding Detection (CI/CD Integration)

Using straudit –check in CI Pipelines

JSON Output for Automation

Comparison of Encoding Detection Tools

Frequently Asked Questions (FAQ)

Q1: How do I know if my file is UTF-8 encoded?

Q2: What is the difference between UTF-8 and UTF-8 with BOM?

Q3: Why do I see the � (replacement character) in my text?

Q4: Can invisible characters affect code execution?

Q5: What is the most accurate encoding detector?

Q6: How do I fix a CSV file that shows Chinese characters when I import it?

Q7: My Python code works on Mac but breaks on Windows. Why?

Leave a Comment Cancel Reply