Skip to content

Conversation

cetagostini
Copy link
Contributor

@cetagostini cetagostini commented Sep 19, 2025

Description

This PR introduces two experimental causal discovery algorithms designed for fast, interpretable exploration of causal structures in tabular and time-series data.

Both methods are inspired by the constraint-based PC/FCI family (Spirtes, Glymour, Scheines, 2000) but replace the traditional partial correlation tests with a Bayes factor criterion based on ΔBIC (Kass & Raftery, 1995; Schwarz, 1978). This makes them simple, and computationally efficient — while still grounded in probability theory.

Includes comprehensive docstrings, public APIs, and extensive tests for both algorithms, covering edge rules, forbidden edges, time series support, and error handling.

Warning

  1. Both algorithms are experimental and should be seen as initial structure-learning tools rather than definitive causal discovery.
  2. They perform well for linear Gaussian/Gamma-like settings and provide interpretable graphs suitable for guiding further analysis.
  3. For more complex or nonlinear systems, or where latent confounding is expected, more sophisticated methods (e.g. full FCI, PCMCI+, or Bayesian graph models) may be necessary.

Why this approach?

Target-first: aligns with many applied settings (e.g. economics, epidemiology, marketing) where the focus is on causes of a particular outcome. We don't want to explore a full grid of possible combinations of nodes, we know by a fact that nodes must direct ultimately to a specific other node, and certain node relations are not allowed or doesn't make sense in the ecosystem.

Draft example:
Code here - Only available to PyMC team

To-do's:

Note

These methods return a skeleton graph with partial orientations (e.g. into the target variable and lag-based constraints). They should be considered as initial structure learners rather than complete CDAG/CPDAG estimators. Full orientation (e.g. Meek rules, v-structure identification) is possible as a later extension.

Related Issue

Checklist


📚 Documentation preview 📚: https://pymc-marketing--1947.org.readthedocs.build/en/1947/

@cetagostini cetagostini self-assigned this Sep 19, 2025
@cetagostini cetagostini marked this pull request as draft September 19, 2025 21:22
Copy link

codecov bot commented Sep 19, 2025

Codecov Report

❌ Patch coverage is 97.90698% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.54%. Comparing base (54de250) to head (be58888).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pymc_marketing/mmm/causal.py 97.90% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1947      +/-   ##
==========================================
+ Coverage   93.41%   93.54%   +0.12%     
==========================================
  Files          67       67              
  Lines        8521     8949     +428     
==========================================
+ Hits         7960     8371     +411     
- Misses        561      578      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cetagostini cetagostini added this to the 0.16.0 milestone Sep 19, 2025
@cetagostini
Copy link
Contributor Author

Failing test related to: #1949 not the PR

@juanitorduz juanitorduz modified the milestones: 0.16.0, 0.17.0 Sep 21, 2025
@cetagostini cetagostini marked this pull request as ready for review September 22, 2025 20:46
@juanitorduz
Copy link
Collaborator

@carlosagostini I'm not familiar with these methods, so I'm unsure how to provide feedback. I can give you a code PR review, but is there anyone else who could potentially double-check the logic?

@cetagostini
Copy link
Contributor Author

cetagostini commented Sep 24, 2025

@carlosagostini I'm not familiar with these methods, so I'm unsure how to provide feedback. I can give you a code PR review, but is there anyone else who could potentially double-check the logic?

Sure, both are inspire in more formal methods, but they are experimental still. I'll add a notebook soon, showing more all the recent PRs this one included. Lets kick-off with the code review.

@daniel-saunders-phil
Copy link
Contributor

Is the use of Bayes factors in the Spirites, Scheines, Glymour algorithm something you have developed or is it an approach based in some published paper?

I'm also wondering whether there are existing causal discovery libraries that perform similar roles (like implementations of causal discovery algorithms) or whether those libraries would appreciate this algorithm? Do you know what the landscape is like?

@cetagostini
Copy link
Contributor Author

cetagostini commented Sep 26, 2025

Is the use of Bayes factors in the Spirites, Scheines, Glymour algorithm something you have developed or is it an approach based in some published paper?

I'm also wondering whether there are existing causal discovery libraries that perform similar roles (like implementations of causal discovery algorithms) or whether those libraries would appreciate this algorithm? Do you know what the landscape is like?

@daniel-saunders-phil good point, short answer the algorithms in place are not grounded in some published paper.

Why? I have been during quite a while playing with causal discovery, mostly for marketing. Sadly, all these algorithms are great for biology, environmental science, medicine or engineering but not for marketing.

Over raw marketing data they bring big nonsensical answers, because in marketing we violate many of the assumptions those algorithms require. E.g: Some are not even suitable for time series, others are not suitable for dependent processes, and so on.

Does this mean it is impossible to make causal discovery for marketing? My opinion is no, marketing problems are quite structural. We know by the fact that certain relationships can't exist, all edges must ultimately direct to a specific target or kpi node, and we are certain about which latent process can be influencing certain nodes.

So, all this knowledge can reduce the search space and allow the algorithms to investigate over it, and return the most possible skeleton behind this.

That's what I am doing here, because I don't know about any other library doing the same. DoWhy and Causal learn are quite famous (They have traditional PC/GES/LinGAM) but at least with DoWhy you don't have anything quite connected or made for marketing, not sure causal learn, but if they have I would probably found it already.

Conclusion, the class is a target-oriented skeleton-discovery algorithm inspired by the PC/FCI family. Not a regular PC or FCI. Like PC, it discovers an undirected skeleton by removing edges when conditional independences are found. Like FCI, it allows forbidden edges as a way of encoding prior knowledge.

What are the differences? It does not perform full orientation (no Meek rules, no PAG). It adds a target-edge rule mechanism (any, conservative, full conditioning set), which biases discovery toward identifying direct drivers of a chosen target. It uses a Bayes factor (ΔBIC) conditional independence test instead of frequentist tests.

The time series one, on addition consider time as a dimension picking the lags from certain series and checking if these doesn't break conditional independence.

I have been experimenting with this for a few months with clients, and internally, I have a few examples showing limitations and capabilities (My plan is to add them in another PR). Ultimately they return skeletons, and users will need to decide the directionality of majority of the DAG. Coming next will be a PR to add falsification and sensitivity tests to see if the DAG is reliable or not.

Copy link
Collaborator

@juanitorduz juanitorduz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cetagostini . Here is a first review on the code (but not really the logic as this goes beyond my area of expertise)

import numpy as np
import pandas as pd
import pytensor
import pytensor.tensor as tt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually use pt.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cetagostini would you mind changing this to pt to have harmony in the code base? Usually tt remind us the old theano times :) ?

target_edge_rule: str = "any",
bf_thresh: float = 1.0,
forbidden_edges: Sequence[tuple[str, str]] | None = None,
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add doctrings for this init method? Also, can we use Pydantic to validate the call and use Field ensuting target is within a lsit of possible values (say with Literal type)?

@juanitorduz juanitorduz requested a review from Copilot October 4, 2025 08:49
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces two experimental causal discovery algorithms (TBFPC and TBF_FCI) designed for fast, interpretable exploration of causal structures in tabular and time-series data. Both methods are inspired by the PC/FCI family but use Bayes factor criterion based on ΔBIC instead of traditional partial correlation tests.

Key changes:

  • Implementation of TBFPC algorithm for cross-sectional causal discovery with target-focused edge rules
  • Implementation of TBF_FCI algorithm for time-series causal discovery with lag support and contemporaneous relationships
  • Comprehensive test coverage for both algorithms including API validation, error handling, and edge case testing

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
pymc_marketing/mmm/causal.py Implements the core TBFPC and TBF_FCI causal discovery algorithms with comprehensive docstrings and public APIs
tests/mmm/test_causal.py Adds extensive test coverage for both algorithms including parameterized tests for different configurations and error conditions

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@juanitorduz
Copy link
Collaborator

I asked Copilot for a review as well 🤗

Introduces the TBFPC (Target-first Bayes Factor PC) and TBF_FCI (Target-first Bayes Factor Temporal PC) classes for causal discovery using Bayes factor conditional independence tests. Includes comprehensive docstrings, public APIs, and extensive tests for both algorithms, covering edge rules, forbidden edges, time series support, and error handling.
@cetagostini cetagostini force-pushed the cetagostini/causal_discovery_utilities branch from b7e097c to 2db43b0 Compare October 5, 2025 15:30
Copy link
Collaborator

@juanitorduz juanitorduz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little ask 🙏

import numpy as np
import pandas as pd
import pytensor
import pytensor.tensor as tt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cetagostini would you mind changing this to pt to have harmony in the code base? Usually tt remind us the old theano times :) ?

@cetagostini
Copy link
Contributor Author

I did 😄 but LLM revert it, looks like they like theano more @juanitorduz 🫠

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants