Common Problems in Unicode Conversion and Their Solutions

Common Problems in Unicode Conversion and Their Solutions

Unicode conversion is one of those invisible infrastructure tasks that nobody thinks about — until it breaks. A garbled customer name in a database. Reversed Urdu text on a wedding card. Question marks replacing Telugu conjunct characters in a newspaper headline. A crashed SSIS pipeline blocking an entire data warehouse refresh. These are not hypothetical scenarios — they are daily realities for thousands of professionals working across Unicode and non-Unicode systems.

In this comprehensive guide, we catalog the most common problems in Unicode conversion and provide actionable, tested solutions for each one. Whether you are a developer, designer, data analyst, or content manager, this reference will help you diagnose and fix encoding issues before they cost you time, money, or credibility.

Problem 1: Question Marks (?) Replacing Characters After Conversion

What Happens

You convert Unicode text to a non-Unicode format and some or all characters turn into question marks (?). This is the single most common Unicode conversion complaint.

Root Cause

The target non-Unicode code page does not contain the Unicode characters you are converting. Code pages like Windows-1252 support only 256 characters. When a character from Unicode (which supports 154,000+) has no equivalent in the target code page, the conversion system replaces it with a question mark as a fallback.

Solution

Immediate Fix: Identify which characters are being replaced and check if they exist in the target code page. Use a code page reference chart or the Windows Character Map utility.

Permanent Fix: Switch your target system to UTF-8 encoding wherever possible. Modern applications (browsers, databases, operating systems) all support UTF-8 natively. Reserve non-Unicode conversion only for legacy systems that absolutely require it.

For Indian Languages: If you need to convert Telugu, Hindi, Urdu, or Kannada text to legacy non-Unicode fonts (Anu Script, Kruti Dev, Nudi, InPage), use a specialized conversion tool that maps Unicode code points to the correct glyph positions in the target font. Generic code page conversion will not work for these scripts. The Unicode to Non Unicode Converter at unicode-to-nonunicode.com provides script-specific character mapping.

Problem 2: Text Reversal in Right-to-Left Scripts

What Happens

Urdu, Arabic, Hebrew, or Persian text appears backwards after conversion. “سلام” becomes “م ل ا س” or “ما را” reads as “ار ما.”

Root Cause

Unicode includes bidirectional text handling through special control characters (Right-to-Left Mark U+200F, Left-to-Right Mark U+200E, Right-to-Left Override U+202E). Non-Unicode code pages do not support these control characters. During conversion, the bidirectional information is lost, and the text renders in the default left-to-right direction.

Solution

Step 1: Before conversion, identify RTL sections in your document.

Step 2: Add RTL direction markers that the target application can understand. In HTML, use <span dir="rtl">...</span>. In Word, use the right-to-left text direction button.

Step 3: After conversion to non-Unicode, manually verify RTL rendering in the target application.

Step 4: For InPage Urdu documents, the converter tool should handle RTL directionality automatically. If not, use InPage’s built-in right-to-left text direction setting.

Pro Tip: When copying text from a Unicode source (Word, browser) to a non-Unicode target (InPage, legacy DTP), paste into a plain text editor (Notepad++) first, then copy from Notepad++ to the target application. This strips hidden Unicode control characters that can cause unexpected direction changes.

Problem 3: SSIS “Cannot Convert Between Unicode and Non-Unicode String Data Types”

What Happens

Your SSIS package fails at validation with the error: “Column X cannot convert between unicode and non-unicode string data types.” The entire data flow is blocked.

Root Cause

SSIS is strict about data type matching. Your source component (Excel, Flat File, XML) outputs Unicode strings (DT_WSTR), while your destination (SQL Server table, text file) expects non-Unicode strings (DT_STR). SSIS does not perform this conversion automatically.

Solution

Fix A — Data Conversion Transformation (Easiest):

  1. Open your SSIS package in Visual Studio
  2. Drag a Data Conversion Transformation onto the Data Flow canvas
  3. Place it between the source and destination components
  4. Open the editor, select the Unicode columns, and set the output type to DT_STR
  5. Specify the correct code page (1252 for English/Western European)
  6. Map the converted output columns in the destination

Fix B — Derived Column Transformation (More Control):

  1. Add a Derived Column Transformation to your data flow
  2. Create an expression for each column:text(DT_STR, 100, 1252) [ColumnName]
  3. Map the derived columns to the destination

Fix C — SQL CAST in Source Query (Best Performance):
Change the source’s Data Access Mode to “SQL Command” and cast columns directly:

SQLSELECT 
    CAST(CustomerName AS VARCHAR(100)) AS CustomerName,
    CAST(Email AS VARCHAR(255)) AS Email
FROM dbo.Customers

Fix D — Change Destination to Unicode (Most Future-Proof):
Alter your destination table columns to NVARCHAR instead of VARCHAR. This eliminates the conversion need entirely and prepares your database for global content.

For more detailed SSIS troubleshooting, see our guide on Unicode to Non-Unicode conversion errors.

Problem 4: SQL Server Implicit Conversion Killing Query Performance

What Happens

Your queries run slowly — much slower than expected. Execution plans show “Index Scan” instead of “Index Seek.” A simple lookup that should take milliseconds takes seconds.

Root Cause

When you compare or join an NVARCHAR (Unicode) column with a VARCHAR (non-Unicode) value or parameter, SQL Server silently converts every row in the table to match the comparison type. This is called implicit conversion, and it prevents the query optimizer from using indexes efficiently.

Solution

Fix 1 — Use the N Prefix for Unicode Literals:

SQL-- ❌ BAD: Implicit conversion, forces full table scan
SELECT * FROM Customers WHERE City = 'حیدرآباد'

-- ✅ GOOD: No implicit conversion, uses index seek
SELECT * FROM Customers WHERE City = N'حیدرآباد'

Fix 2 — Match Parameter Types to Column Types:
If your stored procedure accepts a @Name VARCHAR(50) parameter but the column is NVARCHAR(50), change the parameter type:

SQLCREATE PROCEDURE SearchCustomer
    @Name NVARCHAR(50)  -- Match the column type
AS
BEGIN
    SELECT * FROM Customers WHERE CustomerName = @Name
END

Fix 3 — Find and Fix Existing Implicit Conversions:

  1. Open SQL Server Management Studio
  2. Enable Actual Execution Plan (Ctrl + M)
  3. Run your query
  4. Look for yellow warning triangles on SELECT, JOIN, or WHERE operations
  5. Hover over the warning — it will say “Type conversion in expression may affect CardinalityEstimate”
  6. Fix the data type mismatch in your query or table schema

Fix 4 — Use COLLATE Explicitly:
If you must compare different collations, use the COLLATE clause:

SQLSELECT * FROM Products 
WHERE ProductName COLLATE Latin1_General_CI_AS = N'Special Item'

Impact

Fixing implicit conversions on large tables (millions of rows) can improve query performance by 10x to 100x. The storage cost difference between VARCHAR and NVARCHAR (1 byte vs. 2 bytes per character) is negligible compared to the performance cost of a full table scan.

Problem 5: SAP Unicode Conversion Errors During S/4HANA Migration

What Happens

During SAP S/4HANA migration, the Unicode conversion phase fails with errors related to custom ABAP programs, cluster tables, or third-party integrations.

Root Cause

SAP S/4HANA requires a Unicode-enabled system. If your existing SAP ERP system runs on a non-Unicode code page configuration, a full Unicode conversion must be completed before migration. This conversion affects every layer of the SAP stack — database, ABAP runtime, custom code, and interfaces.

Solution

Step 1 — Run UCCHECK:
Execute the UCCHECK transaction to scan all custom ABAP programs for Unicode compliance. Fix all reported errors before proceeding.

Step 2 — Clean Cluster Tables:
Run the SDBI_CLUSTER_CHECK report to identify and clean up cluster tables that may cause conversion failures.

Step 3 — Delete Match Code IDs:
Execute report TWTOOL01 to remove Match Code IDs that are not Unicode-compatible.

Step 4 — Set Up Sandbox:
Create a sandbox environment and perform the full Unicode conversion there first. Test every business process, custom report, and third-party integration.

Step 5 — Handle Open Dataset Encoding:
Review all ABAP OPEN DATASET statements and update the encoding parameter:

abapOPEN DATASET lv_file FOR OUTPUT IN TEXT MODE ENCODING UTF-8.

Step 6 — Execute Production Conversion:
Using R3trans, perform the export/import conversion in your production system during a planned maintenance window.

Step 7 — Verify Third-Party Products:
Confirm that all integrated systems, interfaces, and add-ons are Unicode-compatible.

⚠ Critical Planning Note: Unicode conversion typically increases SAP database size by 30% to 70% because UTF-16 uses 2 bytes per character versus 1 byte for single-byte code pages. Plan your storage capacity accordingly.

Problem 6: Excel CSV Import Corrupting Non-English Text

What Happens

You export data from a system to CSV and open it in Excel. All non-English characters (Telugu, Hindi, Urdu, Chinese, Arabic) display as garbled text, question marks, or boxes.

Root Cause

Excel’s default behavior when opening a CSV file is to use the system’s default code page (usually Windows-1252 for English-language systems). It does not auto-detect UTF-8 encoding unless you explicitly import it through the Data → Get Data wizard.

Solution

Fix 1 — Use Excel’s Text Import Wizard:

  1. Open Excel (do NOT double-click the CSV file)
  2. Go to Data → Get Data → From Text/CSV
  3. Select your CSV file
  4. In the File Origin dropdown, choose 65001: Unicode (UTF-8)
  5. Click Load

Fix 2 — Add UTF-8 BOM to the CSV:
The Byte Order Mark (BOM) is a special character sequence (EF BB BF) at the start of a UTF-8 file that signals Excel to use UTF-8 encoding. Add it by:

  1. Opening the CSV in Notepad++
  2. Going to Encoding → Encode in UTF-8-BOM
  3. Saving the file
  4. Opening it in Excel (double-click now works correctly)

Fix 3 — Use Power Query:

  1. Open Excel → Data → Get Data → From File → From Text/CSV
  2. Select your file
  3. Set encoding to UTF-8 in the preview window
  4. Click Transform Data → Close & Load

For a complete walkthrough, see our Excel Unicode conversion guide.

Problem 7: Legacy Font Rendering Breaks After System Update

What Happens

After a Windows update, your Telugu Anu Script, Hindi Kruti Dev, or Urdu InPage fonts suddenly display incorrectly. Text that was fine yesterday now shows boxes, wrong characters, or missing glyphs.

Root Cause

Windows updates can change the “Language for non-Unicode programs” system locale, which affects how legacy applications interpret non-Unicode character byte values. The update may have reset this setting back to English (US), breaking the mapping for your Indian language fonts.

Solution

Step 1: Open Settings → Time & Language → Language & Region → Administrative Language Settings

Step 2: Click Change system locale under “Language for non-Unicode programs”

Step 3: Select the language matching your non-Unicode fonts (Telugu, Hindi, Urdu, Kannada, etc.)

Step 4: Check the box for “Beta: Use Unicode UTF-8 for worldwide language support” only if your applications support it (most legacy DTP tools do not)

Step 5: Click OK and restart your computer

Step 6: Verify that fonts display correctly in your target application

Pro Tip: After major Windows updates, always check this setting. It is the #1 cause of “my fonts broke after update” complaints.

Problem 8: Python/Programming Language String Encoding Errors

What Happens

Your Python script crashes with:

textUnicodeEncodeError: 'ascii' codec can't encode character '\u0c05' in position 0: ordinal not in range(128)

Or in JavaScript:

textInvalidCharacterError: The string contains invalid characters

Root Cause

Python 2 defaults to ASCII encoding, which only supports 128 characters. Any character outside this range (including all Indian language characters, emoji, and most non-English text) triggers an encoding error. Even in Python 3, file I/O operations require explicit encoding declarations.

Solution

Python Fix:

Python# Always specify encoding when opening files
with open('data.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# For writing
with open('output.txt', 'w', encoding='utf-8') as f:
    f.write(converted_text)

# For printing to console
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

JavaScript Fix:

JavaScript// Ensure your HTML page declares UTF-8
// <meta charset="UTF-8">

// For Node.js file operations
const fs = require('fs');
const text = fs.readFileSync('data.txt', 'utf8');
fs.writeFileSync('output.txt', convertedText, 'utf8');

// For web APIs, set Content-Type header
// Content-Type: text/plain; charset=utf-8

Java Fix:

Java// When reading files
BufferedReader reader = new BufferedReader(
    new InputStreamReader(new FileInputStream("data.txt"), "UTF-8"));

// When writing files
BufferedWriter writer = new BufferedWriter(
    new OutputStreamWriter(new FileOutputStream("output.txt"), "UTF-8"));

Problem 9: Database Migration Character Set Collisions

What Happens

During a database migration (MySQL to PostgreSQL, Oracle to SQL Server, etc.), text columns with non-English content become corrupted. Accented characters, Indian script characters, and special symbols are lost or transformed into incorrect values.

Root Cause

Different database systems use different default character sets and collations. MySQL’s latin1, SQL Server’s SQL_Latin1_General_CP1_CI_AS, and PostgreSQL’s UTF8 all handle character encoding differently. During migration, if the target character set does not support the source characters, data is silently corrupted.

Solution

Step 1 — Audit Source Character Set:

SQL-- MySQL
SHOW VARIABLES LIKE 'character_set%';

-- SQL Server
SELECT name, collation_name FROM sys.databases;

-- PostgreSQL
SELECT datname, datcollate, datctype FROM pg_database;

Step 2 — Ensure Target Supports All Source Characters:
Always use UTF-8 (or UTF-16 for SQL Server NVARCHAR) as the target character set.

Step 3 — Export with Explicit Encoding:

Bash# MySQL export
mysqldump --default-character-set=utf8mb4 dbname > dump.sql

# PostgreSQL export
pg_dump --encoding=UTF8 dbname > dump.sql

Step 4 — Import with Explicit Encoding:

Bash# MySQL import
mysql --default-character-set=utf8mb4 dbname < dump.sql

# SQL Server import (via BCP)
bcp dbname.dbo.tablename in data.txt -c -C 65001 -T

Step 5 — Verify Character Integrity:
Run spot checks on known non-English data:

SQLSELECT CustomerName FROM Customers WHERE CustomerID = 1234;
-- Verify the output matches the original

Problem 10: Copy-Paste Between Applications Destroys Text

What Happens

You copy Urdu or Telugu text from a web browser and paste it into Word, InPage, or a design tool. The pasted text is garbled, reversed, or shows placeholder boxes.

Root Cause

The Windows clipboard stores data in multiple formats simultaneously (plain text, rich text, HTML, Unicode text). When you paste, the receiving application chooses which format to use. If it chooses a non-Unicode format or the wrong code page, the text corrupts.

Solution

Fix 1 — Paste as Plain Text:
Instead of regular paste (Ctrl+V), use Paste Special → Unformatted Text or Ctrl+Shift+V (in many applications). Then apply the target font manually.

Fix 2 — Use an Intermediate Text Editor:

  1. Copy text from source
  2. Paste into Notepad++ (which preserves Unicode)
  3. Set encoding to UTF-8
  4. Copy from Notepad++ to target application

Fix 3 — Convert Before Pasting:
Use the Unicode to Non Unicode Converter to convert your text to the target non-Unicode font format first, then paste the converted result.

Quick Reference: Common Problems and Their Primary Fixes

ProblemPrimary Solution
Question marks replacing charactersUse UTF-8 target encoding or specialized script converter
RTL text reversalAdd RTL direction markers; verify in target app
SSIS conversion errorAdd Data Conversion Transformation (DT_WSTR → DT_STR)
SQL Server slow queriesUse N’ prefix for Unicode literals; match parameter types
SAP migration failuresRun UCCHECK; clean cluster tables; test in sandbox
Excel CSV corruptionImport via Data → Get Data with UTF-8 encoding
Fonts break after updateReset Windows system locale for non-Unicode programs
Python encoding errorsSpecify encoding=’utf-8′ in all file I/O operations
Database migration corruptionExport/import with explicit UTF-8 encoding
Copy-paste destroys textPaste as plain text; use intermediate text editor

Frequently Asked Questions

Can I reverse a Unicode to non-Unicode conversion?

It depends. If the non-Unicode target code page contains all the characters from your original Unicode text, the conversion is reversible. However, if any characters were replaced with question marks or lost during conversion, those characters cannot be recovered from the non-Unicode version alone. Always keep your original Unicode file.

Why does UTF-8 use different numbers of bytes per character?

UTF-8 is a variable-length encoding. ASCII characters (English letters, numbers, basic punctuation) use 1 byte. Characters from most world scripts (Telugu, Hindi, Arabic, Chinese) use 2–3 bytes. Emoji and rare characters use 4 bytes. This design makes UTF-8 efficient for English text while still supporting every language.

Should I use VARCHAR or NVARCHAR in SQL Server?

Use NVARCHAR for any column that may store non-English text (Indian languages, Arabic, Chinese, emoji, etc.). Use VARCHAR only for columns that will contain strictly ASCII characters (English letters, numbers, basic symbols). The storage difference (1 byte vs. 2 bytes per character) is usually negligible compared to the risk of data corruption with VARCHAR for multilingual content.

How do I test if my conversion was successful?

Create a test document containing every character your target language uses — including vowels, consonants, conjuncts, diacritics, numbers, and punctuation. Convert this test document and visually compare the output against the original. Any character that does not match indicates a problem with the conversion mapping.

What is the difference between UTF-8 and UTF-16?

Both are Unicode encoding formats. UTF-8 uses 1–4 bytes per character and is the dominant web standard (used by 98%+ of websites). UTF-16 uses 2 or 4 bytes per character and is the default for Windows, Java, and SQL Server. UTF-8 is more space-efficient for ASCII-heavy text; UTF-16 provides more consistent character sizes.

Conclusion

Unicode conversion problems are universal, but so are the solutions. The key principles apply across every platform, language, and tool: always specify encoding explicitly, verify output against the original, and keep your Unicode source files as the definitive backup. When you encounter a conversion issue, start with the problem descriptions in this guide, identify the root cause, and apply the matching solution.

For reliable, script-specific Unicode to non-Unicode conversion — covering Telugu, Hindi, Urdu, Kannada, and many other languages — visit unicode-to-nonunicode.com for free online conversion tools and detailed technical guides.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top