Introduction: The "Hello World" of Phase 2
Welcome back to the practicum diary.
If you’ve been following along (or if you’re just joining this chaotic journey now), you know the mission: We are trying to teach an AI to understand "Why" instead of just "What."
We are building a Hybrid LLM—a Frankenstein monster that stitches together the linguistic fluency of a Transformer (BERT) with the logical rigor of a Structural Causal Model (SCM). The goal? To stop AI from thinking that yellow fingers cause lung cancer just because they often appear together. We want it to understand that smoking causes both.
🧠 The Theory: Why Correlation ≠ Causation
Standard AI models are correlation machines. They see patterns. Causal AI tries to see mechanisms.
The dotted red line represents the spurious correlation that a standard model might learn. Our goal is to teach the model the solid arrows.
In the last few blogs, we lived in the comfortable world of Theory. We wrote proposals, we read Judea Pearl’s The Book of Why, and we designed beautiful diagrams of how our system should work.
Well, the honeymoon phase is over.
Over the last two weeks, we moved from Phase 1 (Foundation) to Phase 2 (Implementation). We stopped reading papers and started writing Python code. We loaded datasets, fired up GPUs, and ran our first experiments.
And guess what? It broke.
But that’s where the real story begins. This blog is about the gritty reality of turning a research proposal into running code.
Chapter 1: The Dataset Deep Dive (Balanced COPA)
Before we could build our "Causal Brain," we needed a test to prove it works. You can’t just ask an AI, "Are you causal now?" You need a benchmark.
We chose COPA (Choice of Plausible Alternatives). But not just any COPA. We are using Balanced COPA.
The Technical Stuff (Accurate & Careful)
Standard COPA is a dataset where the model is given a premise and two alternatives, and it must pick the more plausible cause or effect.
Premise: The man turned on the tap.
Question: What happened as a result?
Choice 1: Water flowed out. (Correct)
Choice 2: The sink broke. (Incorrect)
The problem with standard COPA is that LLMs are lazy. They cheat. They often pick the answer that just "sounds" more related, based on word frequency (correlations), without actually doing any causal reasoning.
Balanced COPA fixes this by introducing Mirrored Pairs. For every question in the dataset, there is a "mirror" version where the correct answer is flipped or the context is changed slightly to neutralize simple keyword matching.
The Practical Experience
Loading this wasn't just import dataset. We had to inspect the schema carefully.
We spent a good chunk of time just writing the preprocessing function. In BERT, you can't
just feed in three sentences. You have to format them like this:
[CLS] Premise [SEP] Choice 1 [SEP]
And then run a second pass for:
[CLS] Premise [SEP] Choice 2 [SEP]
This "siamese network" approach forces the model to score both options independently.
Chapter 2: The Baseline (Setting the Bar)
Every good scientist needs a control group. Before we could inject our fancy Causal Graph, we needed to know how a "dumb" model performs.
We set up a BERT-base-uncased model. This is our "Correlation Machine." It
doesn’t know causality; it only knows statistics.
The Hypothesis vs. The Reality
We expected BERT to get around 60-70% accuracy. The reality? Getting the training loop to run was a lesson in PyTorch patience.
- Device checks: ensuring the tensors were actually on the GPU.
- Tokenization: We had to ensure our max length wasn't chopping off the important parts of the sentences.
- The "Story" of the Loss Curve: Watching the training loss go down is the most satisfying thing in AI. It started high, dipped, and then... plateaued.
We established our baseline. This number is now the "Floor." Our Hybrid model must beat this number. If it doesn't, we have failed. (No pressure).
Chapter 3: The Hybrid Experiment (and the Failure)
Here is the meat of the update. This is where things got interesting—and frustrating.
The Architecture
We designed a system with a tunable parameter: Alpha ($\alpha$).
🎛️ Interactive Demo: The Alpha Parameter
Adjust the slider to see how the Hybrid model balances between BERT (Correlation) and SCM (Causality).
Alpha: 0.5
Hybrid Mode: Balancing statistics and logic.
- $\alpha = 0.0$: Pure BERT (100% Correlation).
- $\alpha = 1.0$: Pure Causal SCM (100% Logic).
- $\alpha = 0.5$: The Hybrid (50% Stats + 50% Logic).
We theorized that as we increased Alpha from 0 to 0.5, the accuracy should go up. The SCM should act as a "Logic Police," catching BERT when it makes a stupid mistake.
The "48% Anomaly"
We ran the experiment. We waited. We checked the logs. And we saw something strange.
For Alpha 0.0, 0.2, 0.4, 0.6, 0.8... the accuracy was identical. It was stuck at roughly 48%. It was flat. The line wasn't moving.
Only when we hit Alpha 1.0 (Pure Causal) did the number jump/change (to about 50%).
The Diagnosis
We sat there staring at the screen. Why is the Causal Module doing nothing? Then it hit us: The Causal Graph was empty.
Or, more accurately, our Extraction Logic was failing. We were trying to build the causal graph automatically using rule-based extraction (looking for words like "because", "so", "therefore"). But COPA sentences are often subtle.
Because our extractor missed the connection, the SCM graph had no edges. If the graph has no
edges, it provides zero information. So, Hybrid Prediction = BERT + (0 * SCM).
💡 The Lesson: Garbage In, Garbage Out
We were so focused on the integration (the Alpha parameter, the pipeline) that we neglected the input (the graph construction). A Hybrid Neuro-Symbolic system is only as strong as its Symbolic component.
Chapter 4: Side Quests (University Life)
While wrestling with empty graphs, the rest of the MSc didn't stop.
- The Ethics Assignment (Samsung Case Study): We analyzed a real-world case using Liffick’s Analysis Method. It reminded me that what we are building has real consequences. Accuracy isn't just a scoreboard; it's an ethical obligation.
- The Literature Review: We condensed 40+ papers into a coherent narrative. Status: Done, dusted, and submitted.
Chapter 5: The Road Ahead (Fixing the Graph)
So, where does that leave us? We are currently in Debug Mode. The priority for the next sprint is clear: Better Causal Extraction.
We cannot rely on simple "keyword matching". It's too brittle. The new plan involves:
- Manual/Semi-Automatic Construction: Hand-crafting graphs for a subset of data to prove the concept.
- Prompt-Based Extraction: Using GPT-4 or Llama 3 to extract causal triples.
Conclusion
This month was a reality check. We went from "We are going to solve Causal AI!" to "Why is my graph empty?"
But that is exactly what a practicum is supposed to be. If it worked the first time, it wouldn't be research; it would just be development. We found the bottleneck. We have the baseline. We have the dataset. Now, we just need to bridge the gap.
Stay tuned for Blog 9, where hopefully, I’ll be sharing a chart where the line actually goes up.
Next milestone: Fix Causal Extraction logic. Run the Hybrid experiment again. Survive the Dublin rain. 🌧️
📝 Author's Note for the Reader (Beginner Friendly)
- LLM: Large Language Model (like ChatGPT or BERT). Good at talking, bad at logic.
- SCM: Structural Causal Model. A graph of nodes and arrows that maps out logic (A causes B). Good at logic, can't talk.
- Hybrid: Smashing them together so we get an AI that talks well and makes sense.
- Epoch: One full run through the training data.
- Alpha: A volume dial. Turn it left for "Pure Stats", turn it right for "Pure Logic".