Sigma

Sigma

  • Conditional inference trees library in Python, designed for quick, robust and easy diagnostics of data in production
  • Fully transparent and statistically rigorous, with significance tests and confidence intervals at every step
  • Source-available, production-ready with minimal dependencies and a simple API

Source-Available

Sigma is freely available under a source-available license. Contributions, issues and feedback are welcome.

View on GitHub

See It in Action

Key Features

Instant Diagnostic

On a fresh dataset, one fit is often enough to surface the prediction potential, key predictors, leading interactions, and obvious data-quality issues. This avoids committing time to a full modeling effort. All feature types are supported, so an initial diagnostic needs no upstream preprocessing pipeline.

When a production model starts drifting or failing to pick up a new behavior, Sigma quickly reveals where and how the data or model is shifting, without diverting significant data-scientist time.

Point Sigma at an existing model to compare its predictions against observed outcomes. The resulting tree pinpoints exactly where, why and how the model fails.

Full Transparency

Sigma's model is a tree, not a black box: a few compact, readable rules, inspectable end-to-end.

Splits use significance tests (Hothorn, Hornik, and Zeileis, 2006) with multiplicity correction, so signals are separated from noise on statistical grounds. Tree branches are statistically validated, and growth stops once no signal remains. CART's bias toward features with many categories is also removed.

Leaf predictions come with confidence intervals. Several variants (e.g., Jeffreys, Clopper-Pearson, Bayesian bootstrap, Student-t) let you pick the coverage guarantee you need, so uncertainty stays explicit at every leaf.

Production-Ready

Sigma installs with pip install ars-sigma and depends only on the industry-standard NumPy, SciPy, and scikit-learn. It runs on Python 3.10+ with no compiled extensions to build.

RegressionTree, ClassificationTree, and SurvivalTree implement the scikit-learn API, fitting into your existing pipelines, cross-validation, and serialization without extra wiring. Pandas DataFrames are supported throughout.

Sample weights are supported end-to-end, and fits are reproducible from a fixed random_state. Trained models are plain Python state that pickles cleanly and moves from a training notebook to a batch job or serverless runtime with no platform-specific build step.

Grounded in Research

Every statistical method in Sigma comes from a peer-reviewed paper.

Hothorn, T., & Zeileis, A. (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. Journal of Machine Learning Research, 16, 3905-3909. jmlr.org/papers/v16/hothorn15a

Patil, V. V., & Kulkarni, H. V. (2012). Comparison of Confidence Intervals for the Poisson Mean: Some New Aspects. REVSTAT - Statistical Journal, 10(2), 211-227. doi:10.57805/revstat.v10i2.117

Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651-674. doi:10.1198/106186006X133933

Hothorn, T., Hornik, K., van de Wiel, M. A., & Zeileis, A. (2006). A Lego System for Conditional Inference. The American Statistician, 60(3), 257-263. doi:10.1198/000313006X118430

Olsson, U. (2005). Confidence Intervals for the Mean of a Log-Normal Distribution. Journal of Statistics Education, 13(1). doi:10.1080/10691898.2005.11910638

Hothorn, T., & Lausen, B. (2003). On the Exact Distribution of Maximally Selected Rank Statistics. Computational Statistics & Data Analysis, 43(2), 121-137. doi:10.1016/S0167-9473(02)00225-6

Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval Estimation for a Binomial Proportion. Statistical Science, 16(2), 101-133. doi:10.1214/ss/1009213286

Agresti, A., & Coull, B. A. (1998). Approximate is Better than "Exact" for Interval Estimation of Binomial Proportions. The American Statistician, 52(2), 119-126. doi:10.1080/00031305.1998.10480550

Newcombe, R. G. (1998). Two-Sided Confidence Intervals for the Single Proportion: Comparison of Seven Methods. Statistics in Medicine, 17(8), 857-872. doi:10.1002/sim.777

DiCiccio, T. J., & Efron, B. (1996). Bootstrap Confidence Intervals. Statistical Science, 11(3), 189-228. doi:10.1214/ss/1032280214

Frequently Asked Questions

What kinds of challenges is Sigma built for?

Sigma turns raw tabular data into a readable, statistically grounded binary tree. It is built to give a statistically robust, quick, fully automatic look at a dataset. It is not designed to compete with state-of-the-art but complex and time-consuming machine learning models on raw predictive performance. It is a good fit for reliable diagnostics on new datasets, one-off segmentation studies, and model drift investigations, and it helps examine the outputs of more complex models before relying on them.

How does Sigma differ from scikit-learn's decision trees?

scikit-learn's decision trees tend to be biased in their variable selection, require manual tuning of depth and minimum-samples thresholds to stay readable, and routinely grow into hairy, hard-to-interpret structures. Under the hood they pick splits by exhaustively optimizing a purity measure, which biases selection toward features with many possible split points and gives no statistically grounded stopping rule. Sigma uses permutation-based conditional-inference splits instead: at every node it tests each candidate variable for a real association with the response, picks the strongest, and stops the moment none is statistically significant. The tree finds its own depth automatically: no manual cutoff, no rule of thumb on samples per leaf. The single knob is the false-positive rate of each split test (the alpha level, 5% by default); raise it for a deeper exploratory tree, lower it for a stricter one. The result is a tree you can trust as a diagnostic.

Can I use Sigma for predictive modeling?

Yes, but I have something better. Fitting a Sigma tree gives you a robust first-approximation predictive model with very little effort: no tuning, no feature engineering, no encoding of categoricals, and a tree you can read end to end. It is a sound baseline you can deploy and trust. The tradeoff is subtlety for simplicity. A Sigma tree captures the dominant structure in the data but misses the finer interactions and gradients a richer model could exploit. When you need that accuracy without giving up interpretability, reach for Tau, my state-of-the-art engine for tabular data. It turns weak signals into outstanding predictions.

Does Sigma work with scikit-learn pipelines and Pandas DataFrames?

Yes. ClassificationTree, RegressionTree, and SurvivalTree implement the standard scikit-learn estimator API, so they slot into Pipeline, ColumnTransformer, GridSearchCV, cross-validation utilities, and model serialization without any custom glue. Pandas DataFrames are first-class inputs: pass them directly to fit and predict, declare categorical columns by name, and keep your column metadata throughout training and inference. NumPy arrays are of course also supported. Categorical features are handled natively (no upstream one-hot encoding needed), and sample weights are fully supported through all the statistical significance tests.

What confidence intervals does Sigma report on leaves?

Sigma supports the classic confidence intervals; the right one depends on the coverage guarantees you want. For classification: Clopper-Pearson, Jeffreys, and Wilson. For regression: Bayesian bootstrap, normal, Student-t, log-normal, gamma, Poisson, exponential, and beta. For survival outcomes: Brookmeyer-Crowley for the median survival time, log-log Greenwood for the survival probability at a fixed time, and Klein-Moeschberger integrated Greenwood for the restricted mean survival time (RMST). You pick the one that best fits your target's distribution.

Can Sigma handle very small or very large datasets?

On small datasets, the statistical stopping rule and the confidence intervals on leaves keep the tree honest about what it does and does not know. On larger datasets, fit time scales linearly with the number of observations and with the number of candidate variables, so Sigma's runtimes stay predictable as the data grows.

What Python versions and dependencies are required?

Python 3.10 and above, on every platform supported by NumPy and SciPy. Sigma depends only on NumPy, SciPy, and scikit-learn. Graphviz rendering is available as an optional extra for visualization, but it is not required to fit or use models.

How do I install Sigma?

A single pip install ars-sigma command installs the library. There are no compiled extensions to build, which makes installation simple. Everything else (visualization examples, deployment notes, source, and issue tracker) is on the GitHub repository.

What is Sigma's license, and where can I contribute?

Sigma is source-available: under the Sigma Source License you can freely read, use, modify, and redistribute the library, provided the license file and attribution travel with every copy. This is, however, not open source: ArsChitectura SAS retains all rights and may revoke the license at will. For broader rights and professional support, a commercial license is available on request through the contact page. Contributions, issues, and feedback are welcome on the GitHub repository.

How can I get support?

Reach out through the contact page. Your message goes directly to me.