Thursday, April 30, 2026

Why encoding ='utf-8' is More Than Just a Bug Fix

We’ve all been there. You’re building a RAG pipeline or trying to load a fresh dataset into a notebook. You run f.read() and suddenly—Boom.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2...

For years, as a Mechanical Engineer by training, my response was purely “patch-work.” I’d Google the error, find a StackOverflow thread telling me to add encoding='utf-8' to my code, and move on once the red text disappeared. It worked like magic.

But recently, while digging into speech recognition and pattern books, I realized I was treating a masterpiece of optimization like a simple bug fix. I never formally studied encodings, but once you look at the “hardware” logic behind them, it’s a beautiful story of solving a global data traffic jam.

The 7-Bit Prison: A Relic of the 1960s

Back in the 1960s, computing was a “walled garden.” Everything was built around ASCII — a system designed primarily for English teletype machines. To save every precious bit of expensive memory, ASCII only used 7 bits of a byte, keeping that 8th bit permanently locked at zero.

It was lean, sure, but it was incredibly narrow-minded. If you wanted to write a Spanish phrase like Señor or use the Devanagari script (the beautiful characters behind Hindi and Sanskrit), ASCII simply didn’t have the “slots” to hold them. You were literally trapped in a 127-character world.

The “Global ID” Spreadsheet

To break out of that prison, we created Unicode.

The best way to visualize Unicode isn’t as a file format, but as a massive, abstract spreadsheet. In this spreadsheet, every character in human history — from ancient Sumerian cuneiform to that “Grinning Face” emoji on your phone — is assigned a unique ID called a Code Point.

For example, the letter a is assigned the ID U+0061. It’s a beautiful, inclusive library that now supports over 150,000 characters and 168 different scripts. But as an engineer, the first thing I thought was: “Wait. If we have over a million potential IDs, how do we save them without making every simple text file four times larger?”

UTF-8: The “Chameleon” of Engineering

This is where the engineering gets truly clever.

If we used a fixed-width system like UTF-32 (4 bytes for every single character), a simple English sentence would be 75% “wasted” zeros. It would be like using a semi-truck to deliver a single envelope.

UTF-8 is the solution — a variable-length masterpiece. It’s essentially a chameleon that changes its size based on the character it’s carrying:

  • For the “Standard” stuff: It uses just 1 byte for the first 127 characters, making it perfectly backward-compatible with those old ASCII systems.
  • For the “Global” stuff: When it hits scripts like Hindi, Arabic, or Spanish, it smoothly expands to 2 or 3 bytes.
  • For the “Fun” stuff: Emojis and rare symbols take up 4 bytes.

But the real “Mechanical Engineer” in me loves the Self-Synchronization feature. In older systems, if one byte got corrupted, the whole file turned into gibberish. UTF-8 is built so that if a byte fails, the system can just “skip” and find the start of the next character by looking at the first few bits. It’s resilient, it’s efficient, and it’s why your RAG pipelines actually work when you feed them multilingual data.

The Proof in the Bytes: A Python Experiment

I ran a quick script to compare how these encodings “weigh” the same information.

import sys

def check_encoding_impact(text):
print(f"\nTarget Text: {text}")
print("-" * 30)
for enc in ['ascii', 'utf-8', 'utf-16', 'utf-32']:
try:
encoded_data = text.encode(enc)
print(f"{enc.upper():<8} | {len(encoded_data)} bytes")
except:
print(f"{enc.upper():<8} | not supported")

# Test 1: Standard English
check_encoding_impact("Hello")

# Test 2: Multilingual (Hindi)
check_encoding_impact("नमस्ते")

# Test 3: The Emoji Tax
check_encoding_impact("RAG 🤖")
OUTPUT --

Target Text: Hello
------------------------------

ASCII | 5 bytes
UTF-8 | 5 bytes
UTF-16 | 12 bytes
UTF-32 | 24 bytes

Target Text: नमस्ते
------------------------------

ASCII | not supported
UTF-8 | 18 bytes
UTF-16 | 14 bytes
UTF-32 | 28 bytes

Target Text: RAG 🤖
------------------------------

ASCII | not supported
UTF-8 | 8 bytes
UTF-16 | 14 bytes
UTF-32 | 24 bytes
Press enter or click to view image in full size

import os

# One combined string with three lines of text
# English + Spanish + Chinese
multilingual_content = (
"Hello, How are yu ? Have a great day ahead !\n" # English
"¡Hola! ¿Cómo estás? ¡Que tengas un gran día!\n" # Spanish
"你好,你好嗎?祝你有美好的一天!" # Chinese
)

def save_and_measure_global(text):
formats = ['utf-8', 'utf-16', 'utf-32']

print("--- Global Multilingual Test ---")
print(f"Total Characters (including newlines): {len(text)}")
print(f"{'Encoding':<10} | {'File Size (Bytes)':<18}")
print("-" * 35)

for fmt in formats:
filename = f"global_test_{fmt}.txt"

# Writing the combined string to disk
with open(filename, 'w', encoding=fmt) as f:
f.write(text)

# Measuring the physical footprint
file_size = os.path.getsize(filename)
print(f"{fmt.upper():<10} | {file_size:<18}")

# Clean up
os.remove(filename)

if __name__ == "__main__":
save_and_measure_global(multilingual_content)
--- Global Multilingual Test ---
Total Characters (including newlines): 106
Encoding | File Size (Bytes)
-----------------------------------
UTF-8 | 146
UTF-16 | 218
UTF-32 | 436

This experiment moves the discussion from theory to hardware, proving that encoding isn’t just about avoiding “broken” characters — it’s a critical decision for storage optimization. While the UTF-8 vs. UTF-16 debate often favors the former for English-heavy data, the results clearly show a “signaling tax” that flips the efficiency in favor of UTF-16 for non-Latin scripts like Hindi. By measuring the literal physical footprint on disk, we see that modern encodings are a game of strategic tradeoffs: choosing the right one can reduce your storage and memory overhead by nearly 30% depending on the linguistic footprint of your global RAG pipeline or localized application.

Reflection: The Hidden Infrastructure

As engineers, we often obsess over the “sexy” stuff — LLM parameters and vector databases. But all of it rests on these tiny, byte-level protocols.

There is no such thing as a “plain text file” without an encoding. Every time you specify encoding='utf-8', you aren't just fixing an error; you are participating in a global standard that allows an engineer in Bangalore, a developer in Seattle, and a machine from 1965 to understand each other.

It’s a reminder that sometimes, the most elegant engineering is the stuff we never see — until it fails.

Sunday, April 19, 2026

This is for those who Feel Stuck

If you are currently pursuing a Master’s, a CFA, or something similar—that’s great. You already have a short-to-mid-term goal to work toward, and I wish you the best of luck!

This post is for those who don’t know what to aim for, or who feel a bit stuck and unable to move toward a goal.

Five or six years ago, I felt what we now call a "mid-life crisis." I assumed it was a one-time event that would just pass. Oh boy, I couldn’t have been more wrong. It keeps coming back—and it is perfectly fine if you feel stuck momentarily.

I would say: Go back to basics. Go back to your roots.

Often in our day-to-day work, we lose touch with the fundamentals we once studied. It is a great idea to revisit your old reference books or walk through your old code and repositories. I am sure there are topics you struggled with back then—maybe chapters you kept as "optional" just to pass the course. Go back to those chapters now. Prepare as if you were studying for college exams, or look into new developments in your core area of expertise. Reinforce your foundation with new technology.

This may not directly unlock your "next step," but it re-strengthens your foundation. When you eventually figure out a goal in a few weeks, that reinforced foundation will ensure you achieve it much faster. Being an engineer, I’ll use a personal example: Integrals. I used to solve complex integrals in my sleep; now, when I see one, I find myself wondering how to even approach it. Picking that back up is a powerful way to reset.

Think of Creed 2 (or the Rocky movies) -- When Adonis is lost, Rocky takes him to the desert—away from the noise, the ego, and the comfort. He makes him go back to the basics to master what he once knew. It’s the same in Rocky 4. You cut off the distractions and return to the foundation. This is where you get rid of the doubt and the noise.



The Second Path: Learn something entirely new.

This is the opposite approach—inspired by the likes of Alex Hormozi. Learn something totally irrelevant to what you have learned so far to build new capabilities. This requires you to be ready to "suck" at something. You have to be willing to be bad at something for a long time before you can be great at it.

If you are a Doctor, maybe learn about AI. If you are an Engineer, learn about sales or philosophy.

A word of caution: When I say "learn something new," I don't mean a two-hour crash course or a quick YouTube tutorial. Learn from the base. If you are learning finance, pick up academic reference books or join a rigorous 6–9 month course. This deep dive might even land you on a new path you never considered.

In this fast-paced world, these ideas are "slow grinders." They require time, effort, and patience. However, both approaches help fire up your "learning neurons," which improves your thinking and helps you find clarity in the moments you feel most stuck.

This is for the people who want to play the long game. When you’re in a crisis, a pause—spent either returning to basics or learning something new without the pressure of a specific goal—will go a long way.

Trust me, it always pays off in the long run.