Artificial Intelligence Research Practicum ⏱️ ~25 Minutes November 20, 2025

From Theory to Tensor – The Reality of Building Causal AI

The gritty reality of turning a research proposal into running code—the data cleaning, the baseline failures, and the "Aha!" moments that only come at 2 AM.

Ranjith Kumar

Introduction: The "Hello World" of Phase 2

Welcome back to the practicum diary.

If you’ve been following along (or if you’re just joining this chaotic journey now), you know the mission: We are trying to teach an AI to understand "Why" instead of just "What."

We are building a Hybrid LLM—a Frankenstein monster that stitches together the linguistic fluency of a Transformer (BERT) with the logical rigor of a Structural Causal Model (SCM). The goal? To stop AI from thinking that yellow fingers cause lung cancer just because they often appear together. We want it to understand that smoking causes both.

🧠 The Theory: Why Correlation ≠ Causation

Standard AI models are correlation machines. They see patterns. Causal AI tries to see mechanisms.

graph LR A[Smoking] --> B[Yellow Fingers] A --> C[Lung Cancer] B -.-> C style A fill:#f9f,stroke:#333,stroke-width:2px style B fill:#ccf,stroke:#333,stroke-width:2px style C fill:#ccf,stroke:#333,stroke-width:2px linkStyle 2 stroke:red,stroke-width:2px,stroke-dasharray: 5 5;

The dotted red line represents the spurious correlation that a standard model might learn. Our goal is to teach the model the solid arrows.

In the last few blogs, we lived in the comfortable world of Theory. We wrote proposals, we read Judea Pearl’s The Book of Why, and we designed beautiful diagrams of how our system should work.

Well, the honeymoon phase is over.

Over the last two weeks, we moved from Phase 1 (Foundation) to Phase 2 (Implementation). We stopped reading papers and started writing Python code. We loaded datasets, fired up GPUs, and ran our first experiments.

And guess what? It broke.

But that’s where the real story begins. This blog is about the gritty reality of turning a research proposal into running code.

Chapter 1: The Dataset Deep Dive (Balanced COPA)

Before we could build our "Causal Brain," we needed a test to prove it works. You can’t just ask an AI, "Are you causal now?" You need a benchmark.

We chose COPA (Choice of Plausible Alternatives). But not just any COPA. We are using Balanced COPA.

The Technical Stuff (Accurate & Careful)

Standard COPA is a dataset where the model is given a premise and two alternatives, and it must pick the more plausible cause or effect.

Premise: The man turned on the tap.
Question: What happened as a result?
Choice 1: Water flowed out. (Correct)
Choice 2: The sink broke. (Incorrect)

The problem with standard COPA is that LLMs are lazy. They cheat. They often pick the answer that just "sounds" more related, based on word frequency (correlations), without actually doing any causal reasoning.

Balanced COPA fixes this by introducing Mirrored Pairs. For every question in the dataset, there is a "mirror" version where the correct answer is flipped or the context is changed slightly to neutralize simple keyword matching.

The Practical Experience

Loading this wasn't just import dataset. We had to inspect the schema carefully. We spent a good chunk of time just writing the preprocessing function. In BERT, you can't just feed in three sentences. You have to format them like this:

[CLS] Premise [SEP] Choice 1 [SEP]

And then run a second pass for:

[CLS] Premise [SEP] Choice 2 [SEP]

This "siamese network" approach forces the model to score both options independently.

Chapter 2: The Baseline (Setting the Bar)

Every good scientist needs a control group. Before we could inject our fancy Causal Graph, we needed to know how a "dumb" model performs.

We set up a BERT-base-uncased model. This is our "Correlation Machine." It doesn’t know causality; it only knows statistics.

The Hypothesis vs. The Reality

We expected BERT to get around 60-70% accuracy. The reality? Getting the training loop to run was a lesson in PyTorch patience.

Device checks: ensuring the tensors were actually on the GPU.
Tokenization: We had to ensure our max length wasn't chopping off the important parts of the sentences.
The "Story" of the Loss Curve: Watching the training loss go down is the most satisfying thing in AI. It started high, dipped, and then... plateaued.

We established our baseline. This number is now the "Floor." Our Hybrid model must beat this number. If it doesn't, we have failed. (No pressure).

Chapter 3: The Hybrid Experiment (and the Failure)

Here is the meat of the update. This is where things got interesting—and frustrating.

The Architecture

We designed a system with a tunable parameter: Alpha ($\alpha$).

🎛️ Interactive Demo: The Alpha Parameter

Adjust the slider to see how the Hybrid model balances between BERT (Correlation) and SCM (Causality).

Alpha: 0.5

BERT (50%)

SCM (50%)

Hybrid Mode: Balancing statistics and logic.

$\alpha = 0.0$: Pure BERT (100% Correlation).
$\alpha = 1.0$: Pure Causal SCM (100% Logic).
$\alpha = 0.5$: The Hybrid (50% Stats + 50% Logic).

We theorized that as we increased Alpha from 0 to 0.5, the accuracy should go up. The SCM should act as a "Logic Police," catching BERT when it makes a stupid mistake.

The "48% Anomaly"

We ran the experiment. We waited. We checked the logs. And we saw something strange.

For Alpha 0.0, 0.2, 0.4, 0.6, 0.8... the accuracy was identical. It was stuck at roughly 48%. It was flat. The line wasn't moving.

Only when we hit Alpha 1.0 (Pure Causal) did the number jump/change (to about 50%).

The Diagnosis

We sat there staring at the screen. Why is the Causal Module doing nothing? Then it hit us: The Causal Graph was empty.

Or, more accurately, our Extraction Logic was failing. We were trying to build the causal graph automatically using rule-based extraction (looking for words like "because", "so", "therefore"). But COPA sentences are often subtle.

Because our extractor missed the connection, the SCM graph had no edges. If the graph has no edges, it provides zero information. So, Hybrid Prediction = BERT + (0 * SCM).

💡 The Lesson: Garbage In, Garbage Out

We were so focused on the integration (the Alpha parameter, the pipeline) that we neglected the input (the graph construction). A Hybrid Neuro-Symbolic system is only as strong as its Symbolic component.

Chapter 4: Side Quests (University Life)

While wrestling with empty graphs, the rest of the MSc didn't stop.

The Ethics Assignment (Samsung Case Study): We analyzed a real-world case using Liffick’s Analysis Method. It reminded me that what we are building has real consequences. Accuracy isn't just a scoreboard; it's an ethical obligation.
The Literature Review: We condensed 40+ papers into a coherent narrative. Status: Done, dusted, and submitted.

Chapter 5: The Road Ahead (Fixing the Graph)

So, where does that leave us? We are currently in Debug Mode. The priority for the next sprint is clear: Better Causal Extraction.

We cannot rely on simple "keyword matching". It's too brittle. The new plan involves:

Manual/Semi-Automatic Construction: Hand-crafting graphs for a subset of data to prove the concept.
Prompt-Based Extraction: Using GPT-4 or Llama 3 to extract causal triples.

graph TD A[Input Text] --> B{BERT Prediction} A --> C{SCM Validation} B --> D[Hybrid Score] C --> D D --> E[Final Answer] style A fill:#f9f,stroke:#333,stroke-width:2px style D fill:#ff9,stroke:#333,stroke-width:2px

Conclusion

This month was a reality check. We went from "We are going to solve Causal AI!" to "Why is my graph empty?"

But that is exactly what a practicum is supposed to be. If it worked the first time, it wouldn't be research; it would just be development. We found the bottleneck. We have the baseline. We have the dataset. Now, we just need to bridge the gap.

Stay tuned for Blog 9, where hopefully, I’ll be sharing a chart where the line actually goes up.

Next milestone: Fix Causal Extraction logic. Run the Hybrid experiment again. Survive the Dublin rain. 🌧️

                            📝 Author's Note for the Reader (Beginner Friendly)
                            LLM: Large Language Model (like ChatGPT or BERT). Good at talking,
                                    bad at logic.
SCM: Structural Causal Model. A graph of nodes and arrows that maps
                                    out logic (A causes B). Good at logic, can't talk.
Hybrid: Smashing them together so we get an AI that talks well and
                                    makes sense.
Epoch: One full run through the training data.
Alpha: A volume dial. Turn it left for "Pure Stats", turn it right
                                    for "Pure Logic".

                        

Tags: Artificial Intelligence Causal AI BERT Research Practicum Python

Ranjith Kumar

MSc Computing (AI) student at DCU, Dublin. Exploring the intersection of Deep Learning and Causal Inference.

From Theory to Tensor – The Reality of Building Causal AI

Introduction: The "Hello World" of Phase 2

🧠 The Theory: Why Correlation ≠ Causation

Chapter 1: The Dataset Deep Dive (Balanced COPA)

The Technical Stuff (Accurate & Careful)

The Practical Experience

Chapter 2: The Baseline (Setting the Bar)

The Hypothesis vs. The Reality

Chapter 3: The Hybrid Experiment (and the Failure)

The Architecture

🎛️ Interactive Demo: The Alpha Parameter

The "48% Anomaly"

The Diagnosis

💡 The Lesson: Garbage In, Garbage Out

Chapter 4: Side Quests (University Life)

Chapter 5: The Road Ahead (Fixing the Graph)

Conclusion

📝 Author's Note for the Reader (Beginner Friendly)

Share This Post

Ranjith Kumar

Previous & Next Posts

← Previous: From Coimbatore to Dublin

Next Post: Coming Soon... →