Monday, June 22, 2026

Vibe Surgeon

 

Vibe Surgeon

Doctor and GenAI -- I like to take an extreme example of a doctor and try to imagine different scenarios of docs using GenAI (LLMs) in order to replicate what engineers do or what we expect them to do and all that.

Maybe the stakes are very different, and I do believe that based on sensitivity and cost impact, different AI and analytics models can be valuable. But for a second, let us go back to the doctor in an operating theatre.

So, "vibe coding" -- imagine a Vibe Surgeon. A doc comes into the OT and uploads all reports -- maybe X-rays and a live video feed -- and asks AI to guide the surgery. Starting from which surgery to perform.

Now, say our doc knows just a little bit about tool use and how the human body works, and starts getting into the surgery. Then suddenly, he sees that this particular patient has some complications. This is realized after the initial cut. Then our AI goes into thinking mode, and after 5 minutes, again lists down 5 root causes and maybe 5 more things to look for, etc, etc. And our Surgeon follows it religiously and sees things worsening.


So here is my hypothesis, or one explanation: LLM models are trained on internet data, and they just predict the next word and then the next, and so on and so on. Common structures are bound to get higher probability, and so they will pop up more. One can tune some parameters and make the model more creative, and it will do that by matching words for very, very different domains that may or may not make sense. Meanwhile, here our patient's situation is worsening, and it turns out this is one of those edge cases. And even our GenAI is not able to fully understand the situation.

I don't want to comment on how it ends.

Marvel fan or not -- if you have watched Dr. Strange, then this paragraph may make more sense. Remember how our egotistical Dr. Strange looks at a few cases and says -- A, B, C -- simple or not worth my time or anyone else can do -- and then picks some unique case and he is like, "Yes, this is worth my thing," and all that. We need such experts. And with the way things are going on, these experts are going to be fewer and fewer. Already they are rare species, and 10 years from now? Phew. Just can't imagine.

Lesson or something -- well -- thanks for reading, but I don't have a specific takeaway for you. Just my thoughts, building a persona of a Vibe Surgeon, and I would love to hear from you.

Thursday, May 7, 2026

SomeDays

 𝙎𝙤𝙢𝙚𝙙𝙖𝙮𝙨

𝙇𝙞𝙛𝙚 𝙞𝙨 𝙬𝙝𝙖𝙩 𝙝𝙖𝙥𝙥𝙚𝙣𝙨 𝙗𝙚𝙩𝙬𝙚𝙚𝙣 𝙩𝙝𝙚𝙨𝙚 𝙎𝙤𝙢𝙚𝘿𝙖𝙮𝙨 !

Somedays I feel like the King of the world

Somedays I am the Loser of the Year


Somedays I would cry over all the challenges

Somedays I would blush around all day for no reasons



Somedays I feel blessed that we met

Somedays I feel depressed that yu left


Somedays I want to dance in the pouring rain

Somedays I want to stay in blankets on a winter morning


Somedays I want to floor a sports car

Somedays I want to cuddle up with a Teddy


Somedays I would gulp down multiple scoops of ice cream

Somedays I would fast only on bitter black coffee


Somedays I don't sweat after a 10 miler

Somedays I am not able to complete 2 mile runs


Somedays I would walk 5kms to save 10 bucks

Somedays I would spend thousands for a fancy mobile cover


Somedays I would buckle up and tie up

Somedays I would roam around in shorts


𝙇𝙞𝙛𝙚 𝙞𝙨 𝙬𝙝𝙖𝙩 𝙝𝙖𝙥𝙥𝙚𝙣𝙨 𝙗𝙚𝙩𝙬𝙚𝙚𝙣 𝙩𝙝𝙚𝙨𝙚 𝙎𝙤𝙢𝙚𝘿𝙖𝙮𝙨 !


Friday, May 1, 2026

With 60% Matches Complete, Has IPL 2026's Top 4 Already Locked Horns?

Using Monte Carlo Simulations to Predict IPL 2026 Playoff Scenarios

Analysis - based till completion on match 42 on 30 April..

TL;DR: The Numbers Don't Lie

With 28 matches still to play in IPL 2026, I ran 1 million Monte Carlo simulations to calculate each team's playoff probability. Here's what the data says:




The Key Insight

Yes, RCB has less than 50% chance to qualify — but so do SRH and RR, who are tied on points!

PBKS has essentially locked one playoff spot (and potentially top 2) with 70% probability. But for the remaining three spots, we have a three-way battle between RCB, SRH, and RR — each sitting at 12 points with roughly equal ~50% odds and their odds come to 85% roughly and then it comes down to NRR..

GT lurks in the shadows with 10 points and an 56% chance — to tie or better 4th place and they will have to win with big margins.

Now, let me show you how I got these numbers — and why accounting for team form changes everything.

The Story: When Simple Math Isn't Enough

Last year, with just 14 matches remaining in the IPL season, cricket Twitter was ablaze with RCB playoff predictions. A widely-circulated analysis claimed RCB had a whopping 98% chance of making the top 4. My manager, a die-hard cricket fan, was ecstatic and shared the paper with the entire team.

Curious about the methodology, I decided to verify these claims myself. The approach was straightforward: simulate all possible outcomes of the remaining 14 matches. Each match has 2 possible outcomes (either team can win), so with 14 matches, we have 2^14 = 16,384 total scenarios. Running through all combinations on my laptop took just a few seconds, and the results were clear.

Fast forward to IPL 2026.

With 28 matches still to be played, the same brute-force approach would require evaluating 2^28 = 268,435,456 scenarios — over 268 million combinations! Even on a modern machine, this would take considerable time and memory. More importantly, it's computationally wasteful.

Enter Monte Carlo simulation — a smarter way to estimate probabilities without checking every single possibility.


What is Monte Carlo Simulation?

Monte Carlo simulation is a statistical technique that uses random sampling to estimate outcomes when the solution space is too large to explore exhaustively.

In our IPL context:

  • Instead of simulating all 268 million scenarios, we randomly sample (say) 1 million scenarios
  • For each scenario, we randomly decide the winner of each of the 28 remaining matches
  • After all simulations, we calculate: "In how many scenarios did RCB finish in the top 4?"
  • If RCB made top 4 in 475,000 out of 1,000,000 simulations, their probability is 47.5%

The beauty: With enough simulations (typically 1 million), the results converge to the "true" probability — but we only need to run a tiny fraction of all possible scenarios.


Convergence Analysis: How Many Simulations Do We Need?

One key question with Monte Carlo methods: How many simulations are enough?

To answer this, I ran the playoff predictions with different simulation counts: 100, 1,000, 100K, 500K, 1M, and 2M simulations. Here's what the convergence looks like: [can be debated though]

2M simulations - is close to 1/100th of 268M possibilities we have out there.


Plot showing % of making it to top 4 or tie for top 4 position



From above plot we see that as number of simulations increase (X-axis) - probabilities seem to have stabilized. Below table shows the difference between 1M and 2M simulations




Verdict: By 100,000 simulations, probabilities have largely stabilized. Running 1M or 2M simulations provides marginal improvements in precision. For our analysis, 1 million simulations offers an excellent balance between accuracy and computation time (~5-10 seconds).


Understanding Uncertainty: Standard Error

Every Monte Carlo simulation has inherent uncertainty. How confident can we be in our 47.7% estimate for RCB?

The answer lies in standard error — a measure of precision for our probability estimates:

Standard Error (SE) = √[p(1-p)/n]

Where:

  • p = estimated probability (e.g., 0.8723 for RCB)
  • n = number of simulations




What This Means:

Key Insight: With 1 million simulations, RCB's 87.23% estimate has a 95% confidence interval of roughly [87.15%, 87.25%] — very precise!

To halve the standard error, you need 4x more simulations (due to the √n relationship). The law of diminishing returns kicks in quickly.







The Problem with 50-50 Assumptions

So far, we've assumed every match is a coin flip — each team has exactly 50% chance to win. While this is a conservative baseline, it ignores a crucial factor: current team form.

Consider this:

  • SRH has won their last 5 matches (100% recent form)
  • LSG has lost their last 5 matches (0% recent form)

Should we really treat an SRH vs LSG match as 50-50? Obviously not.

Form-Based Simulation: A More Realistic Approach

Instead of coin flips, we can use recent match results to estimate win probabilities:

Step 1: Calculate each team's win percentage over their last N matches (tunable parameter, we use N=4)

SRH: W-W-W-W-W → 100% form
LSG: L-L-L-L-L → 0% form
RCB: W-W-L-W-L → 50% form

Step 2: When two teams face off, normalize their forms to get win probabilities

SRH vs LSG:
  - SRH gets: 100/(100+0) = 100% win chance
  - LSG gets: 0/(100+0) = 0% win chance

RCB vs PBKS (75% form):
  - RCB gets: 50/(50+75) = 40% win chance
  - PBKS gets: 75/(50+75) = 60% win chance

Step 3: Run Monte Carlo simulation using these weighted probabilities instead of 50-50


Results: 50-50 vs Form-Based Predictions

Here's how playoff probabilities change when we account for current team form:





Major Takeaways:

  1. RR is the big winner (+10%) — Although their form is 50% only but the average of their opponents form is 25% and so RR stand to gain a lot if current form is to be followed.
  2. SRH small improvement (5%) — they have a 100% form % but they will be playing stronger opponents as compared to RR and so they will have to fight hard to keep their spot
  3. PBKS remains strong — Already leading the table with good form maintains ~98% playoff odds




Final Verdict: The Playoff Race as of May 1, 2026

Form-Based Playoff Probabilities (1M Simulations):







The Storylines:

✅ PBKS: The Frontrunners — With 13 points and strong form (98%), they're virtual locks for playoffs. Only a catastrophic collapse keeps them out.

🔥 SRH: Form is Everything — On paper, tied with RCB and RR at 12 points. But their 5-match winning streak makes them the favorites among this trio - but they do face against some tough opponents.

⚠️ RCB: The Title Says It All — Despite being tied for 2nd place in points, RCB's inconsistent form (alternating W-L pattern) drops them 4th on our list.

📉 RR: Form Slump Hurts — Like RCB, they have 12 points but only 50% form in last 4 matches. Their playoff hopes are fading fast but they might be playing some out of form opponents - who may still fight back for some glory points.

🎲 GT: The Dark Horse — 2 points behind at 10, but 60% form keeps them in the race with a 1-in-5 shot.


❌ The Rest: Mathematically Alive, Practically Done — CSK might show some probability but looks a tall mountain to climb for them. DC, KKR have 3% odds. MI and LSG are essentially eliminated (<0.01%).





Conclusion: Data-Driven Cricket Analysis

What started as a simple question — "Will RCB make the playoffs?" — turned into a fascinating exploration of:

  • Monte Carlo simulation as a practical tool for complex probability problems
  • Convergence analysis to understand how many simulations are "enough"
  • Standard error to quantify uncertainty in our estimates
  • Form-based predictions that go beyond coin-flip assumptions



Try It Yourself

All code and data used in this analysis are available in this Jupyter notebook:

Feel free to:

  • Adjust the form window (last 3 vs last 4 vs last 5 matches)
  • Run your own simulations with different team points
  • Explore scenario analysis: "What if RCB wins their next 2 matches?"

Cricket + Data Science = ❤️


Last Updated: May 1, 2026 | Simulations: 1,000,000 | Method: Monte Carlo with Form-Based Win Probabilities

Thursday, April 30, 2026

Why encoding ='utf-8' is More Than Just a Bug Fix

We’ve all been there. You’re building a RAG pipeline or trying to load a fresh dataset into a notebook. You run f.read() and suddenly—Boom.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2...

For years, as a Mechanical Engineer by training, my response was purely “patch-work.” I’d Google the error, find a StackOverflow thread telling me to add encoding='utf-8' to my code, and move on once the red text disappeared. It worked like magic.

But recently, while digging into speech recognition and pattern books, I realized I was treating a masterpiece of optimization like a simple bug fix. I never formally studied encodings, but once you look at the “hardware” logic behind them, it’s a beautiful story of solving a global data traffic jam.

The 7-Bit Prison: A Relic of the 1960s

Back in the 1960s, computing was a “walled garden.” Everything was built around ASCII — a system designed primarily for English teletype machines. To save every precious bit of expensive memory, ASCII only used 7 bits of a byte, keeping that 8th bit permanently locked at zero.

It was lean, sure, but it was incredibly narrow-minded. If you wanted to write a Spanish phrase like Señor or use the Devanagari script (the beautiful characters behind Hindi and Sanskrit), ASCII simply didn’t have the “slots” to hold them. You were literally trapped in a 127-character world.

The “Global ID” Spreadsheet

To break out of that prison, we created Unicode.

The best way to visualize Unicode isn’t as a file format, but as a massive, abstract spreadsheet. In this spreadsheet, every character in human history — from ancient Sumerian cuneiform to that “Grinning Face” emoji on your phone — is assigned a unique ID called a Code Point.

For example, the letter a is assigned the ID U+0061. It’s a beautiful, inclusive library that now supports over 150,000 characters and 168 different scripts. But as an engineer, the first thing I thought was: “Wait. If we have over a million potential IDs, how do we save them without making every simple text file four times larger?”

UTF-8: The “Chameleon” of Engineering

This is where the engineering gets truly clever.

If we used a fixed-width system like UTF-32 (4 bytes for every single character), a simple English sentence would be 75% “wasted” zeros. It would be like using a semi-truck to deliver a single envelope.

UTF-8 is the solution — a variable-length masterpiece. It’s essentially a chameleon that changes its size based on the character it’s carrying:

  • For the “Standard” stuff: It uses just 1 byte for the first 127 characters, making it perfectly backward-compatible with those old ASCII systems.
  • For the “Global” stuff: When it hits scripts like Hindi, Arabic, or Spanish, it smoothly expands to 2 or 3 bytes.
  • For the “Fun” stuff: Emojis and rare symbols take up 4 bytes.

But the real “Mechanical Engineer” in me loves the Self-Synchronization feature. In older systems, if one byte got corrupted, the whole file turned into gibberish. UTF-8 is built so that if a byte fails, the system can just “skip” and find the start of the next character by looking at the first few bits. It’s resilient, it’s efficient, and it’s why your RAG pipelines actually work when you feed them multilingual data.

The Proof in the Bytes: A Python Experiment

I ran a quick script to compare how these encodings “weigh” the same information.

import sys

def check_encoding_impact(text):
print(f"\nTarget Text: {text}")
print("-" * 30)
for enc in ['ascii', 'utf-8', 'utf-16', 'utf-32']:
try:
encoded_data = text.encode(enc)
print(f"{enc.upper():<8} | {len(encoded_data)} bytes")
except:
print(f"{enc.upper():<8} | not supported")

# Test 1: Standard English
check_encoding_impact("Hello")

# Test 2: Multilingual (Hindi)
check_encoding_impact("नमस्ते")

# Test 3: The Emoji Tax
check_encoding_impact("RAG 🤖")
OUTPUT --

Target Text: Hello
------------------------------

ASCII | 5 bytes
UTF-8 | 5 bytes
UTF-16 | 12 bytes
UTF-32 | 24 bytes

Target Text: नमस्ते
------------------------------

ASCII | not supported
UTF-8 | 18 bytes
UTF-16 | 14 bytes
UTF-32 | 28 bytes

Target Text: RAG 🤖
------------------------------

ASCII | not supported
UTF-8 | 8 bytes
UTF-16 | 14 bytes
UTF-32 | 24 bytes
Press enter or click to view image in full size

import os

# One combined string with three lines of text
# English + Spanish + Chinese
multilingual_content = (
"Hello, How are yu ? Have a great day ahead !\n" # English
"¡Hola! ¿Cómo estás? ¡Que tengas un gran día!\n" # Spanish
"你好,你好嗎?祝你有美好的一天!" # Chinese
)

def save_and_measure_global(text):
formats = ['utf-8', 'utf-16', 'utf-32']

print("--- Global Multilingual Test ---")
print(f"Total Characters (including newlines): {len(text)}")
print(f"{'Encoding':<10} | {'File Size (Bytes)':<18}")
print("-" * 35)

for fmt in formats:
filename = f"global_test_{fmt}.txt"

# Writing the combined string to disk
with open(filename, 'w', encoding=fmt) as f:
f.write(text)

# Measuring the physical footprint
file_size = os.path.getsize(filename)
print(f"{fmt.upper():<10} | {file_size:<18}")

# Clean up
os.remove(filename)

if __name__ == "__main__":
save_and_measure_global(multilingual_content)
--- Global Multilingual Test ---
Total Characters (including newlines): 106
Encoding | File Size (Bytes)
-----------------------------------
UTF-8 | 146
UTF-16 | 218
UTF-32 | 436

This experiment moves the discussion from theory to hardware, proving that encoding isn’t just about avoiding “broken” characters — it’s a critical decision for storage optimization. While the UTF-8 vs. UTF-16 debate often favors the former for English-heavy data, the results clearly show a “signaling tax” that flips the efficiency in favor of UTF-16 for non-Latin scripts like Hindi. By measuring the literal physical footprint on disk, we see that modern encodings are a game of strategic tradeoffs: choosing the right one can reduce your storage and memory overhead by nearly 30% depending on the linguistic footprint of your global RAG pipeline or localized application.

Reflection: The Hidden Infrastructure

As engineers, we often obsess over the “sexy” stuff — LLM parameters and vector databases. But all of it rests on these tiny, byte-level protocols.

There is no such thing as a “plain text file” without an encoding. Every time you specify encoding='utf-8', you aren't just fixing an error; you are participating in a global standard that allows an engineer in Bangalore, a developer in Seattle, and a machine from 1965 to understand each other.

It’s a reminder that sometimes, the most elegant engineering is the stuff we never see — until it fails.

Sunday, April 19, 2026

This is for those who Feel Stuck

If you are currently pursuing a Master’s, a CFA, or something similar—that’s great. You already have a short-to-mid-term goal to work toward, and I wish you the best of luck!

This post is for those who don’t know what to aim for, or who feel a bit stuck and unable to move toward a goal.

Five or six years ago, I felt what we now call a "mid-life crisis." I assumed it was a one-time event that would just pass. Oh boy, I couldn’t have been more wrong. It keeps coming back—and it is perfectly fine if you feel stuck momentarily.

I would say: Go back to basics. Go back to your roots.

Often in our day-to-day work, we lose touch with the fundamentals we once studied. It is a great idea to revisit your old reference books or walk through your old code and repositories. I am sure there are topics you struggled with back then—maybe chapters you kept as "optional" just to pass the course. Go back to those chapters now. Prepare as if you were studying for college exams, or look into new developments in your core area of expertise. Reinforce your foundation with new technology.

This may not directly unlock your "next step," but it re-strengthens your foundation. When you eventually figure out a goal in a few weeks, that reinforced foundation will ensure you achieve it much faster. Being an engineer, I’ll use a personal example: Integrals. I used to solve complex integrals in my sleep; now, when I see one, I find myself wondering how to even approach it. Picking that back up is a powerful way to reset.

Think of Creed 2 (or the Rocky movies) -- When Adonis is lost, Rocky takes him to the desert—away from the noise, the ego, and the comfort. He makes him go back to the basics to master what he once knew. It’s the same in Rocky 4. You cut off the distractions and return to the foundation. This is where you get rid of the doubt and the noise.



The Second Path: Learn something entirely new.

This is the opposite approach—inspired by the likes of Alex Hormozi. Learn something totally irrelevant to what you have learned so far to build new capabilities. This requires you to be ready to "suck" at something. You have to be willing to be bad at something for a long time before you can be great at it.

If you are a Doctor, maybe learn about AI. If you are an Engineer, learn about sales or philosophy.

A word of caution: When I say "learn something new," I don't mean a two-hour crash course or a quick YouTube tutorial. Learn from the base. If you are learning finance, pick up academic reference books or join a rigorous 6–9 month course. This deep dive might even land you on a new path you never considered.

In this fast-paced world, these ideas are "slow grinders." They require time, effort, and patience. However, both approaches help fire up your "learning neurons," which improves your thinking and helps you find clarity in the moments you feel most stuck.

This is for the people who want to play the long game. When you’re in a crisis, a pause—spent either returning to basics or learning something new without the pressure of a specific goal—will go a long way.

Trust me, it always pays off in the long run.