-
Notifications
You must be signed in to change notification settings - Fork 322
Introduce TBFPC and TBF_FCI causal discovery algorithms #1947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1947 +/- ##
==========================================
+ Coverage 93.41% 93.54% +0.12%
==========================================
Files 67 67
Lines 8521 8949 +428
==========================================
+ Hits 7960 8371 +411
- Misses 561 578 +17 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Failing test related to: #1949 not the PR |
@carlosagostini I'm not familiar with these methods, so I'm unsure how to provide feedback. I can give you a code PR review, but is there anyone else who could potentially double-check the logic? |
Sure, both are inspire in more formal methods, but they are experimental still. I'll add a notebook soon, showing more all the recent PRs this one included. Lets kick-off with the code review. |
Is the use of Bayes factors in the Spirites, Scheines, Glymour algorithm something you have developed or is it an approach based in some published paper? I'm also wondering whether there are existing causal discovery libraries that perform similar roles (like implementations of causal discovery algorithms) or whether those libraries would appreciate this algorithm? Do you know what the landscape is like? |
@daniel-saunders-phil good point, short answer the algorithms in place are not grounded in some published paper. Why? I have been during quite a while playing with causal discovery, mostly for marketing. Sadly, all these algorithms are great for biology, environmental science, medicine or engineering but not for marketing. Over raw marketing data they bring big nonsensical answers, because in marketing we violate many of the assumptions those algorithms require. E.g: Some are not even suitable for time series, others are not suitable for dependent processes, and so on. Does this mean it is impossible to make causal discovery for marketing? My opinion is no, marketing problems are quite structural. We know by the fact that certain relationships can't exist, all edges must ultimately direct to a specific target or kpi node, and we are certain about which latent process can be influencing certain nodes. So, all this knowledge can reduce the search space and allow the algorithms to investigate over it, and return the most possible skeleton behind this. That's what I am doing here, because I don't know about any other library doing the same. DoWhy and Causal learn are quite famous (They have traditional PC/GES/LinGAM) but at least with DoWhy you don't have anything quite connected or made for marketing, not sure causal learn, but if they have I would probably found it already. Conclusion, the class is a target-oriented skeleton-discovery algorithm inspired by the PC/FCI family. Not a regular PC or FCI. Like PC, it discovers an undirected skeleton by removing edges when conditional independences are found. Like FCI, it allows forbidden edges as a way of encoding prior knowledge. What are the differences? It does not perform full orientation (no Meek rules, no PAG). It adds a target-edge rule mechanism (any, conservative, full conditioning set), which biases discovery toward identifying direct drivers of a chosen target. It uses a Bayes factor (ΔBIC) conditional independence test instead of frequentist tests. The time series one, on addition consider time as a dimension picking the lags from certain series and checking if these doesn't break conditional independence. I have been experimenting with this for a few months with clients, and internally, I have a few examples showing limitations and capabilities (My plan is to add them in another PR). Ultimately they return skeletons, and users will need to decide the directionality of majority of the DAG. Coming next will be a PR to add falsification and sensitivity tests to see if the DAG is reliable or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cetagostini . Here is a first review on the code (but not really the logic as this goes beyond my area of expertise)
pymc_marketing/mmm/causal.py
Outdated
import numpy as np | ||
import pandas as pd | ||
import pytensor | ||
import pytensor.tensor as tt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually use pt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cetagostini would you mind changing this to pt to have harmony in the code base? Usually tt remind us the old theano times :) ?
target_edge_rule: str = "any", | ||
bf_thresh: float = 1.0, | ||
forbidden_edges: Sequence[tuple[str, str]] | None = None, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add doctrings for this init method? Also, can we use Pydantic to validate the call and use Field ensuting target
is within a lsit of possible values (say with Literal
type)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces two experimental causal discovery algorithms (TBFPC and TBF_FCI) designed for fast, interpretable exploration of causal structures in tabular and time-series data. Both methods are inspired by the PC/FCI family but use Bayes factor criterion based on ΔBIC instead of traditional partial correlation tests.
Key changes:
- Implementation of TBFPC algorithm for cross-sectional causal discovery with target-focused edge rules
- Implementation of TBF_FCI algorithm for time-series causal discovery with lag support and contemporaneous relationships
- Comprehensive test coverage for both algorithms including API validation, error handling, and edge case testing
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
File | Description |
---|---|
pymc_marketing/mmm/causal.py | Implements the core TBFPC and TBF_FCI causal discovery algorithms with comprehensive docstrings and public APIs |
tests/mmm/test_causal.py | Adds extensive test coverage for both algorithms including parameterized tests for different configurations and error conditions |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
I asked Copilot for a review as well 🤗 |
Introduces the TBFPC (Target-first Bayes Factor PC) and TBF_FCI (Target-first Bayes Factor Temporal PC) classes for causal discovery using Bayes factor conditional independence tests. Includes comprehensive docstrings, public APIs, and extensive tests for both algorithms, covering edge rules, forbidden edges, time series support, and error handling.
b7e097c
to
2db43b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Little ask 🙏
pymc_marketing/mmm/causal.py
Outdated
import numpy as np | ||
import pandas as pd | ||
import pytensor | ||
import pytensor.tensor as tt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cetagostini would you mind changing this to pt to have harmony in the code base? Usually tt remind us the old theano times :) ?
I did 😄 but LLM revert it, looks like they like theano more @juanitorduz 🫠 |
Description
This PR introduces two experimental causal discovery algorithms designed for fast, interpretable exploration of causal structures in tabular and time-series data.
Both methods are inspired by the constraint-based PC/FCI family (Spirtes, Glymour, Scheines, 2000) but replace the traditional partial correlation tests with a Bayes factor criterion based on ΔBIC (Kass & Raftery, 1995; Schwarz, 1978). This makes them simple, and computationally efficient — while still grounded in probability theory.
Includes comprehensive docstrings, public APIs, and extensive tests for both algorithms, covering edge rules, forbidden edges, time series support, and error handling.
Warning
Why this approach?
Target-first: aligns with many applied settings (e.g. economics, epidemiology, marketing) where the focus is on causes of a particular outcome. We don't want to explore a full grid of possible combinations of nodes, we know by a fact that nodes must direct ultimately to a specific other node, and certain node relations are not allowed or doesn't make sense in the ecosystem.
Draft example:
Code here - Only available to PyMC team
To-do's:
Note
These methods return a skeleton graph with partial orientations (e.g. into the target variable and lag-based constraints). They should be considered as initial structure learners rather than complete CDAG/CPDAG estimators. Full orientation (e.g. Meek rules, v-structure identification) is possible as a later extension.
Related Issue
Checklist
pre-commit.ci autofix
to auto-fix.📚 Documentation preview 📚: https://pymc-marketing--1947.org.readthedocs.build/en/1947/