Sandeep Shah

Thursday, May 7, 2026

SomeDays

𝙎𝙤𝙢𝙚𝙙𝙖𝙮𝙨

𝙇𝙞𝙛𝙚 𝙞𝙨 𝙬𝙝𝙖𝙩 𝙝𝙖𝙥𝙥𝙚𝙣𝙨 𝙗𝙚𝙩𝙬𝙚𝙚𝙣 𝙩𝙝𝙚𝙨𝙚 𝙎𝙤𝙢𝙚𝘿𝙖𝙮𝙨 !

Somedays I feel like the King of the world

Somedays I am the Loser of the Year

Somedays I would cry over all the challenges

Somedays I would blush around all day for no reasons

Somedays I feel blessed that we met

Somedays I feel depressed that yu left

Somedays I want to dance in the pouring rain

Somedays I want to stay in blankets on a winter morning

Somedays I want to floor a sports car

Somedays I want to cuddle up with a Teddy

Somedays I would gulp down multiple scoops of ice cream

Somedays I would fast only on bitter black coffee

Somedays I don't sweat after a 10 miler

Somedays I am not able to complete 2 mile runs

Somedays I would walk 5kms to save 10 bucks

Somedays I would spend thousands for a fancy mobile cover

Somedays I would buckle up and tie up

Somedays I would roam around in shorts

𝙇𝙞𝙛𝙚 𝙞𝙨 𝙬𝙝𝙖𝙩 𝙝𝙖𝙥𝙥𝙚𝙣𝙨 𝙗𝙚𝙩𝙬𝙚𝙚𝙣 𝙩𝙝𝙚𝙨𝙚 𝙎𝙤𝙢𝙚𝘿𝙖𝙮𝙨 !

Friday, May 1, 2026

With 60% Matches Complete, Has IPL 2026's Top 4 Already Locked Horns?

Using Monte Carlo Simulations to Predict IPL 2026 Playoff Scenarios
Analysis - based till completion on match 42 on 30 April..
TL;DR: The Numbers Don't LieWith 28 matches still to play in IPL 2026, I ran 1 million Monte Carlo simulations to calculate each team's playoff probability. Here's what the data says:

The Key Insight

Yes, RCB has less than 50% chance to qualify — but so do SRH and RR, who are tied on points!

PBKS has essentially locked one playoff spot (and potentially top 2) with 70% probability. But for the remaining three spots, we have a three-way battle between RCB, SRH, and RR — each sitting at 12 points with roughly equal ~50% odds and their odds come to 85% roughly and then it comes down to NRR..

GT lurks in the shadows with 10 points and an 56% chance — to tie or better 4th place and they will have to win with big margins.

Now, let me show you how I got these numbers — and why accounting for team form changes everything.

The Story: When Simple Math Isn't Enough

Last year, with just 14 matches remaining in the IPL season, cricket Twitter was ablaze with RCB playoff predictions. A widely-circulated analysis claimed RCB had a whopping 98% chance of making the top 4. My manager, a die-hard cricket fan, was ecstatic and shared the paper with the entire team.

Curious about the methodology, I decided to verify these claims myself. The approach was straightforward: simulate all possible outcomes of the remaining 14 matches. Each match has 2 possible outcomes (either team can win), so with 14 matches, we have 2^14 = 16,384 total scenarios. Running through all combinations on my laptop took just a few seconds, and the results were clear.

Fast forward to IPL 2026.

With 28 matches still to be played, the same brute-force approach would require evaluating 2^28 = 268,435,456 scenarios — over 268 million combinations! Even on a modern machine, this would take considerable time and memory. More importantly, it's computationally wasteful.

Enter Monte Carlo simulation — a smarter way to estimate probabilities without checking every single possibility.

What is Monte Carlo Simulation?

Monte Carlo simulation is a statistical technique that uses random sampling to estimate outcomes when the solution space is too large to explore exhaustively.

In our IPL context:

Instead of simulating all 268 million scenarios, we randomly sample (say) 1 million scenarios
For each scenario, we randomly decide the winner of each of the 28 remaining matches
After all simulations, we calculate: "In how many scenarios did RCB finish in the top 4?"
If RCB made top 4 in 475,000 out of 1,000,000 simulations, their probability is 47.5%

The beauty: With enough simulations (typically 1 million), the results converge to the "true" probability — but we only need to run a tiny fraction of all possible scenarios.

Convergence Analysis: How Many Simulations Do We Need?

One key question with Monte Carlo methods: How many simulations are enough?

To answer this, I ran the playoff predictions with different simulation counts: 100, 1,000, 100K, 500K, 1M, and 2M simulations. Here's what the convergence looks like: [can be debated though]

2M simulations - is close to 1/100th of 268M possibilities we have out there.

Plot showing % of making it to top 4 or tie for top 4 position

From above plot we see that as number of simulations increase (X-axis) - probabilities seem to have stabilized. Below table shows the difference between 1M and 2M simulations

Verdict: By 100,000 simulations, probabilities have largely stabilized. Running 1M or 2M simulations provides marginal improvements in precision. For our analysis, 1 million simulations offers an excellent balance between accuracy and computation time (~5-10 seconds).

Understanding Uncertainty: Standard Error

Every Monte Carlo simulation has inherent uncertainty. How confident can we be in our 47.7% estimate for RCB?

The answer lies in standard error — a measure of precision for our probability estimates:

Standard Error (SE) = √[p(1-p)/n]

Where:

p = estimated probability (e.g., 0.8723 for RCB)
n = number of simulations

What This Means:

Key Insight: With 1 million simulations, RCB's 87.23% estimate has a 95% confidence interval of roughly [87.15%, 87.25%] — very precise!

To halve the standard error, you need 4x more simulations (due to the √n relationship). The law of diminishing returns kicks in quickly.

The Problem with 50-50 Assumptions

So far, we've assumed every match is a coin flip — each team has exactly 50% chance to win. While this is a conservative baseline, it ignores a crucial factor: current team form.

Consider this:

SRH has won their last 5 matches (100% recent form)
LSG has lost their last 5 matches (0% recent form)

Should we really treat an SRH vs LSG match as 50-50? Obviously not.

Form-Based Simulation: A More Realistic Approach

Instead of coin flips, we can use recent match results to estimate win probabilities:

Step 1: Calculate each team's win percentage over their last N matches (tunable parameter, we use N=4)

SRH: W-W-W-W-W → 100% form
LSG: L-L-L-L-L → 0% form
RCB: W-W-L-W-L → 50% form

Step 2: When two teams face off, normalize their forms to get win probabilities

SRH vs LSG:
  - SRH gets: 100/(100+0) = 100% win chance
  - LSG gets: 0/(100+0) = 0% win chance

RCB vs PBKS (75% form):
  - RCB gets: 50/(50+75) = 40% win chance
  - PBKS gets: 75/(50+75) = 60% win chance

Step 3: Run Monte Carlo simulation using these weighted probabilities instead of 50-50

Results: 50-50 vs Form-Based Predictions

Here's how playoff probabilities change when we account for current team form:

Major Takeaways:

RR is the big winner (+10%) — Although their form is 50% only but the average of their opponents form is 25% and so RR stand to gain a lot if current form is to be followed.
SRH small improvement (5%) — they have a 100% form % but they will be playing stronger opponents as compared to RR and so they will have to fight hard to keep their spot
PBKS remains strong — Already leading the table with good form maintains ~98% playoff odds

Final Verdict: The Playoff Race as of May 1, 2026

Form-Based Playoff Probabilities (1M Simulations):

The Storylines:✅ PBKS: The Frontrunners — With 13 points and strong form (98%), they're virtual locks for playoffs. Only a catastrophic collapse keeps them out.
🔥 SRH: Form is Everything — On paper, tied with RCB and RR at 12 points. But their 5-match winning streak makes them the favorites among this trio - but they do face against some tough opponents.
⚠️ RCB: The Title Says It All — Despite being tied for 2nd place in points, RCB's inconsistent form (alternating W-L pattern) drops them 4th on our list.
📉 RR: Form Slump Hurts — Like RCB, they have 12 points but only 50% form in last 4 matches. Their playoff hopes are fading fast but they might be playing some out of form opponents - who may still fight back for some glory points.
🎲 GT: The Dark Horse — 2 points behind at 10, but 60% form keeps them in the race with a 1-in-5 shot.

❌ The Rest: Mathematically Alive, Practically Done — CSK might show some probability but looks a tall mountain to climb for them. DC, KKR have 3% odds. MI and LSG are essentially eliminated (<0.01%).




Conclusion: Data-Driven Cricket AnalysisWhat started as a simple question — "Will RCB make the playoffs?" — turned into a fascinating exploration of:
Monte Carlo simulation as a practical tool for complex probability problems
Convergence analysis to understand how many simulations are "enough"
Standard error to quantify uncertainty in our estimates
Form-based predictions that go beyond coin-flip assumptions

Try It YourselfAll code and data used in this analysis are available in this Jupyter notebook:
GitHub Repository: IPL Playoff Analysis
Interactive Notebook: Open in Google Colab / Binder
Feel free to:
Adjust the form window (last 3 vs last 4 vs last 5 matches)
Run your own simulations with different team points
Explore scenario analysis: "What if RCB wins their next 2 matches?"
Cricket + Data Science = ❤️
Last Updated: May 1, 2026 | Simulations: 1,000,000 | Method: Monte Carlo with Form-Based Win Probabilities

Thursday, April 30, 2026

Why encoding ='utf-8' is More Than Just a Bug Fix

We’ve all been there. You’re building a RAG pipeline or trying to load a fresh dataset into a notebook. You run f.read() and suddenly—Boom.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2...

For years, as a Mechanical Engineer by training, my response was purely “patch-work.” I’d Google the error, find a StackOverflow thread telling me to add encoding='utf-8' to my code, and move on once the red text disappeared. It worked like magic.

But recently, while digging into speech recognition and pattern books, I realized I was treating a masterpiece of optimization like a simple bug fix. I never formally studied encodings, but once you look at the “hardware” logic behind them, it’s a beautiful story of solving a global data traffic jam.

The 7-Bit Prison: A Relic of the 1960s

Back in the 1960s, computing was a “walled garden.” Everything was built around ASCII — a system designed primarily for English teletype machines. To save every precious bit of expensive memory, ASCII only used 7 bits of a byte, keeping that 8th bit permanently locked at zero.

It was lean, sure, but it was incredibly narrow-minded. If you wanted to write a Spanish phrase like Señor or use the Devanagari script (the beautiful characters behind Hindi and Sanskrit), ASCII simply didn’t have the “slots” to hold them. You were literally trapped in a 127-character world.

The “Global ID” Spreadsheet

To break out of that prison, we created Unicode.

The best way to visualize Unicode isn’t as a file format, but as a massive, abstract spreadsheet. In this spreadsheet, every character in human history — from ancient Sumerian cuneiform to that “Grinning Face” emoji on your phone — is assigned a unique ID called a Code Point.

For example, the letter a is assigned the ID U+0061. It’s a beautiful, inclusive library that now supports over 150,000 characters and 168 different scripts. But as an engineer, the first thing I thought was: “Wait. If we have over a million potential IDs, how do we save them without making every simple text file four times larger?”

UTF-8: The “Chameleon” of Engineering

This is where the engineering gets truly clever.

If we used a fixed-width system like UTF-32 (4 bytes for every single character), a simple English sentence would be 75% “wasted” zeros. It would be like using a semi-truck to deliver a single envelope.

UTF-8 is the solution — a variable-length masterpiece. It’s essentially a chameleon that changes its size based on the character it’s carrying:

For the “Standard” stuff: It uses just 1 byte for the first 127 characters, making it perfectly backward-compatible with those old ASCII systems.
For the “Global” stuff: When it hits scripts like Hindi, Arabic, or Spanish, it smoothly expands to 2 or 3 bytes.
For the “Fun” stuff: Emojis and rare symbols take up 4 bytes.

But the real “Mechanical Engineer” in me loves the Self-Synchronization feature. In older systems, if one byte got corrupted, the whole file turned into gibberish. UTF-8 is built so that if a byte fails, the system can just “skip” and find the start of the next character by looking at the first few bits. It’s resilient, it’s efficient, and it’s why your RAG pipelines actually work when you feed them multilingual data.

The Proof in the Bytes: A Python Experiment

I ran a quick script to compare how these encodings “weigh” the same information.

import sys

def check_encoding_impact(text):
    print(f"\nTarget Text: {text}")
    print("-" * 30)
    for enc in ['ascii', 'utf-8', 'utf-16', 'utf-32']:
        try:
            encoded_data = text.encode(enc)
            print(f"{enc.upper():<8} | {len(encoded_data)} bytes")
        except:
            print(f"{enc.upper():<8} | not supported")

# Test 1: Standard English
check_encoding_impact("Hello") 

# Test 2: Multilingual (Hindi)
check_encoding_impact("नमस्ते") 

# Test 3: The Emoji Tax
check_encoding_impact("RAG 🤖")

OUTPUT --

Target Text: Hello
------------------------------
ASCII    | 5 bytes
UTF-8    | 5 bytes
UTF-16   | 12 bytes
UTF-32   | 24 bytes

Target Text: नमस्ते
------------------------------
ASCII    | not supported
UTF-8    | 18 bytes
UTF-16   | 14 bytes
UTF-32   | 28 bytes

Target Text: RAG 🤖
------------------------------
ASCII    | not supported
UTF-8    | 8 bytes
UTF-16   | 14 bytes
UTF-32   | 24 bytes

Data shows the engineering tradeoffs in action: while UTF-8 is leanest for English (identical to ASCII at 5 bytes), it carries a “signaling tax” for non-Latin scripts like Hindi (18 bytes). Conversely, UTF-16 proves more efficient for the Devnagari text by utilizing a consistent 2-byte plane (14 bytes), effectively bypassing the variable-length overhead that makes UTF-8 larger in that specific case.

import os

# One combined string with three lines of text
# English + Spanish + Chinese
multilingual_content = (
    "Hello, How are yu ? Have a great day ahead !\n"  # English
    "¡Hola! ¿Cómo estás? ¡Que tengas un gran día!\n"   # Spanish
    "你好，你好嗎？祝你有美好的一天！"                 # Chinese
)

def save_and_measure_global(text):
    formats = ['utf-8', 'utf-16', 'utf-32']
    
    print("--- Global Multilingual Test ---")
    print(f"Total Characters (including newlines): {len(text)}")
    print(f"{'Encoding':<10} | {'File Size (Bytes)':<18}")
    print("-" * 35)

    for fmt in formats:
        filename = f"global_test_{fmt}.txt"
        
        # Writing the combined string to disk
        with open(filename, 'w', encoding=fmt) as f:
            f.write(text)
        
        # Measuring the physical footprint
        file_size = os.path.getsize(filename)
        print(f"{fmt.upper():<10} | {file_size:<18}")
        
        # Clean up
        os.remove(filename)

if __name__ == "__main__":
    save_and_measure_global(multilingual_content)

--- Global Multilingual Test ---
Total Characters (including newlines): 106
Encoding   | File Size (Bytes) 
-----------------------------------
UTF-8      | 146               
UTF-16     | 218               
UTF-32     | 436

This experiment moves the discussion from theory to hardware, proving that encoding isn’t just about avoiding “broken” characters — it’s a critical decision for storage optimization. While the UTF-8 vs. UTF-16 debate often favors the former for English-heavy data, the results clearly show a “signaling tax” that flips the efficiency in favor of UTF-16 for non-Latin scripts like Hindi. By measuring the literal physical footprint on disk, we see that modern encodings are a game of strategic tradeoffs: choosing the right one can reduce your storage and memory overhead by nearly 30% depending on the linguistic footprint of your global RAG pipeline or localized application.

Reflection: The Hidden Infrastructure

As engineers, we often obsess over the “sexy” stuff — LLM parameters and vector databases. But all of it rests on these tiny, byte-level protocols.

There is no such thing as a “plain text file” without an encoding. Every time you specify encoding='utf-8', you aren't just fixing an error; you are participating in a global standard that allows an engineer in Bangalore, a developer in Seattle, and a machine from 1965 to understand each other.

It’s a reminder that sometimes, the most elegant engineering is the stuff we never see — until it fails.

Sunday, April 19, 2026

This is for those who Feel Stuck

If you are currently pursuing a Master’s, a CFA, or something similar—that’s great. You already have a short-to-mid-term goal to work toward, and I wish you the best of luck!

This post is for those who don’t know what to aim for, or who feel a bit stuck and unable to move toward a goal.

Five or six years ago, I felt what we now call a "mid-life crisis." I assumed it was a one-time event that would just pass. Oh boy, I couldn’t have been more wrong. It keeps coming back—and it is perfectly fine if you feel stuck momentarily.

I would say: Go back to basics. Go back to your roots.

Often in our day-to-day work, we lose touch with the fundamentals we once studied. It is a great idea to revisit your old reference books or walk through your old code and repositories. I am sure there are topics you struggled with back then—maybe chapters you kept as "optional" just to pass the course. Go back to those chapters now. Prepare as if you were studying for college exams, or look into new developments in your core area of expertise. Reinforce your foundation with new technology.

This may not directly unlock your "next step," but it re-strengthens your foundation. When you eventually figure out a goal in a few weeks, that reinforced foundation will ensure you achieve it much faster. Being an engineer, I’ll use a personal example: Integrals. I used to solve complex integrals in my sleep; now, when I see one, I find myself wondering how to even approach it. Picking that back up is a powerful way to reset.

Think of Creed 2 (or the Rocky movies) -- When Adonis is lost, Rocky takes him to the desert—away from the noise, the ego, and the comfort. He makes him go back to the basics to master what he once knew. It’s the same in Rocky 4. You cut off the distractions and return to the foundation. This is where you get rid of the doubt and the noise.

The Second Path: Learn something entirely new.

This is the opposite approach—inspired by the likes of Alex Hormozi. Learn something totally irrelevant to what you have learned so far to build new capabilities. This requires you to be ready to "suck" at something. You have to be willing to be bad at something for a long time before you can be great at it.

If you are a Doctor, maybe learn about AI. If you are an Engineer, learn about sales or philosophy.

A word of caution: When I say "learn something new," I don't mean a two-hour crash course or a quick YouTube tutorial. Learn from the base. If you are learning finance, pick up academic reference books or join a rigorous 6–9 month course. This deep dive might even land you on a new path you never considered.

In this fast-paced world, these ideas are "slow grinders." They require time, effort, and patience. However, both approaches help fire up your "learning neurons," which improves your thinking and helps you find clarity in the moments you feel most stuck.

This is for the people who want to play the long game. When you’re in a crisis, a pause—spent either returning to basics or learning something new without the pressure of a specific goal—will go a long way.

Trust me, it always pays off in the long run.

Sunday, March 8, 2026

𝗙𝗼𝗼𝗱 𝗳𝗼𝗿 𝗧𝗵𝗼𝘂𝗴𝗵𝘁 - 𝗦𝗮𝗻𝗱𝘆’𝘀 𝘁𝗮𝗸𝗲 𝗼𝗻 𝗟𝗟𝗠 - 𝗣𝗮𝗿𝘁 𝟮

If your near-and-dear one was having a health issue, who would you go to?

• 𝘛𝘩𝘦 𝘣𝘦𝘴𝘵 𝘈𝘨𝘦𝘯𝘵𝘪𝘤 𝘈𝘐 𝘋𝘰𝘤𝘵𝘰𝘳 𝘰𝘶𝘵 𝘵𝘩𝘦𝘳𝘦

• 𝘈 "𝘷𝘪𝘣𝘦-𝘤𝘰𝘥𝘦𝘥" 𝘋𝘰𝘤𝘵𝘰𝘳

• 𝘈𝘯 𝘦𝘹𝘱𝘦𝘳𝘵, 𝘢𝘤𝘵𝘶𝘢𝘭 𝘥𝘰𝘤𝘵𝘰𝘳

• 𝘈𝘯 𝘢𝘤𝘵𝘶𝘢𝘭 𝘥𝘰𝘤𝘵𝘰𝘳 𝘸𝘪𝘵𝘩 𝘈𝘐 𝘵𝘰𝘰𝘭𝘴 𝘢𝘵 𝘵𝘩𝘦𝘪𝘳 𝘥𝘪𝘴𝘱𝘰𝘴𝘢𝘭

One could say this is an extreme case and perhaps not worth a comparison, but I want to drop this here as food for thought.

I work in the AI/ML domain and I see its potential. I am all for change and adoption, but I am not yet fully bought into a 100% replacement.

Checkout my article I wrote 2 year back on LLM -- Sandy’s take on LLM and RAG so Far

𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗗𝗲𝗯𝘁

I am somewhat starting to like the term "Technical Debt." Let me use myself and some of my colleagues as examples.

I have been using GitHub Copilot for almost six months now. In the early days, I would prompt it, refine it, and let it create entire repositories and solutions for me—my day-to-day tasks and everything else. Once in a while, things would go wrong, and it would take me days to fix them. Also, when I started looking closer, I realized there could be some unnecessary code blocks mixed in with some really smart ones.

After a learning curve, I now go function-by-function or block-by-block, and I have my own way of testing the accuracy of the outputs. The majority of my tasks involve processing large chunks of data and making inferences from them—sometimes processing and passing them to domain experts or top management for decision-making. In this case, I have to be triply sure of what I deliver, so I go step-by-step.

For sure, a week's worth of tasks can now be done in a day or two, and I can generate code that is much more scalable and reusable. The point is: I wouldn't let it run on "auto-pilot" for my entire task just yet.

𝗩𝗶𝗯𝗲 𝗖𝗼𝗱𝗶𝗻𝗴

"Vibe coding"—well, for sure, some have successfully done it. Having gone through it myself, I would classify those successes as 0.01% or even less. If you think about it, no matter the field or the task, you will always find some outliers—those who defy the norm. As for the majority (including myself), we either aren't sure what we are doing or need more practice with the tools.

I like to use the example of Excel a lot. Corporate employees know how to use Excel, but how much one achieves with it depends on their skills and the effort they took to master it. I remember in my Quantitative Finance course, I was doing heavy Python coding for some bond pricing, and the instructor—an expert—did it in Excel right then and there with us. It's the same with my cousin; in five minutes or so, he made an entire loan repayment and amortization sheet for me in Excel.

LLM tools and agents are getting much smarter and faster, but without knowing the basics of the task at hand, things might spiral out of control, and that "Technical Debt" would keep on growing.

𝗜𝘁 𝗶𝘀 𝗔𝗹𝗹 𝗔𝗯𝗼𝘂𝘁 𝗡𝗮𝗿𝗿𝗮𝘁𝗶𝘃𝗲𝘀

Apple, now Anthropic—even recent politics and whatnot—throughout history, it has always been about the narratives one sets and how fast people catch up to them. Yes, you then need a product to support it.

This reminds me of Freedom 251 (India). It was advertised as a smartphone for ₹251. It looked like a scam, but the narrative it set got tons of bookings based on that story alone.

𝗧𝗵𝗲 𝗘𝗰𝗼𝗻𝗼𝗺𝗶𝗰 𝗟𝗼𝗼𝗽

I read this somewhere—imagine AI and robots can do everything and take over most jobs. If people don't have jobs, they have no income to buy stuff—both essentials and non-essentials. If that happens, who will these AI companies sell to? Who will buy the robots?

It is said that an equilibrium will be reached, but one cannot expect everything to go "all-in" in an instant. The world runs on consumerism.

Lastly, in one of his interviews, I heard Jamie Dimon, JP Morgan Chase CEO, saying (based on what I recollect):

We have autonomous driving—does that mean you take 2 million drivers out of work and the next best job they have pays only $25,000 a year? No, you can do it gradually, or have the government pitch in to say, "No, you can't do that," or "Let us do it sensibly."

Anyway, don’t get me wrong—I am all for AI, the change, and the new ways of working, as well as the new skills and job openings that will be brought about by it.

Thinking that AI can do it all? I am still not sold on it.