A method for constructing fully interpretable linear regressions with statistically significant estimates according to the Student’s criterion and insignificant intercorrelation coefficients
Authors
-
Mikhail P. Bazilevskiy
Keywords:
regression analysis
fully interpretable linear regression
ordinary least squares
subset selection
multicollinearity
Student’s t-test
generating all subsets
mixed 0-1 integer linear programming
Abstract
The article is devoted to the urgent problem of constructing interpretable machine learning models, namely, multiple linear regression models. To estimate their unknown parameters, ordinary least squares is used. A probabilistically-statistical definition of fully interpretable linear regression is formulated. Its construction involves selecting the optimal number of most informative regressors based on determination coefficients in order to ensure consistency between the signs of regression coefficients and the substantive meanings of variables. Additionally, estimates are ensured to be significant and intercorrelation coefficients insignificant by using Student’s t-test. To construct fully interpretable regressions, we propose a method that uses mixed 0-1 integer linear programming. Both strict and non-strict versions of this method are considered. Computational experiments were carried out, which in most cases showed the effectiveness of the proposed method compared to the generating all subsets method.
Section
Methods and algorithms of computational mathematics and their applications
References
- F. Doshi-Velez and B. Kim, “Towards a Rigorous Science of Interpretable Machine Learning,” arXiv preprint. (2017).
doi 10.48550/arXiv.1702.08608
- C. Molnar, Interpretable Machine Learning. A Guide for Making Black Box Models Explainable.(2020).
https://christophm.github.io/interpretable-ml-book/.Cited November 21, 2025.
- S. A. Aivazjan and V. S. Mhitarjan, Applied Statistics and Basics of Econometrics (YUNITI, Moscow, 1998) [in Russian].
- A. Miller, Subset Selection in Regression (Chapman and Hall/CRC, New York, 2002).
doi 10.1201/9781420035933
- V. V. Strizhov and E. A. Krymova, Methods Selection of Regression Models (Comp. Cent. of RAS, Moscow, 2010) [in Russian].
- T. Koch, T. Berthold, J. Pedersen, and C. Vanaret, “Progress in Mathematical Programming Solvers from 2001 to 2020,” EURO J. Comp. Opt. 10, Article Number 100031 (2022).
doi 10.1016/j.ejco.2022.100031
https://doi.org/10.1016/j.ejco.2022.100031 . Cited November 21, 2025.
- H. Konno and R. Yamamoto, “Choosing the Best Set of Variables in Regression Analysis Using Integer Programming,” J. Glob. Opt. 44 (2), 273-282 (2009).
doi 10.1007/s10898-008-9323-9
- R. Miyashiro and Y. Takano, “Mixed Integer Second-Order Cone Programming Formulations for Variable Selection in Linear Regression,” Europ. J. Oper. Res. 247 (3), 721-731 (2015).
doi 10.1016/j.ejor.2015.06.081
https://doi.org/10.1016/j.ejor.2015.06.081 . Cited November 21, 2025.
- R. Miyashiro and Y. Takano, “Subset Selection by Mallows’ C_p: A Mixed Integer Programming Approach,” Exp. Syst. Appl. 42 (1), 325-331 (2015).
doi 10.1016/j.eswa.2014.07.056
https://doi.org/10.1016/j.eswa.2014.07.056 . Cited November 21, 2025.
- M. P. Bazilevskiy, “Reduction the Problem of Selecting Informative Regressors when Estimating a Linear Regression Model by the Method of Least Squares to the Problem of Partial-Boolean Linear Programming,” Mod. Opt. Inf. Tech. 6 (1), 118-127 (2018).
https://moitvivt.ru/ru/journal/pdf?id=434 . Cited November 21, 2025.
- N. Shrestha, “Detecting Multicollinearity in Regression Analysis,” Amer. J. Appl. Math. Stat. 8 (2), 39-42 (2020).
doi 10.12691/ajams-8-2-1
- M. Aslam, “The T-Test of a Regression Coefficient for Imprecise Data,” Hac. J. Math. Stat. 53 (4), 1130-1140 (2024).
doi 10.15672/hujms.1342344
- A. N. Gorbach and N. A. Tseytlin, Buying Behavior: Analysis of Spontaneous Sequences and Regression Models in Marketing Research (Education of Ukraine, Kyiv, 2011) [in Russian].
- S. Chung, Y. W. Park, and T. Cheong, “A Mathematical Programming Approach for Integrated Multiple Linear Regression Subset Selection and Validation,” Pat. Recogn. 108, Article Number 107565 (2020).
doi 10.1016/j.patcog.2020.107565
- D. Bertsimas and M. L. Li, “Scalable Holistic Linear Regression,” Oper. Res. Let. 48 (3), 203-208 (2020).
doi 10.1016/j.orl.2020.02.008
- M. P. Bazilevskiy, “Comparative Analysis of the Effectiveness of Methods for ConstructingQuite Interpretable Linear Regression Models,” Mod. D. Anal. 13 (4), 59-83 (2023).
https://psyjournals.ru/journals/mda/archive/2023_n4/mda_2023_n4_Bazilevskiy.pdf . Cited November 21, 2025.
- M. P. Bazilevskiy, “Selection of Informative Regressors Significant by Student’s T-Test in Regression Models Estimated Using OLS as a Partial Boolean Linear Programming Problem,” Proc. VSU. Ser.: Syst. Anal. Inform. Tech. N 3, 5-16 (2021).
https://journals.vsu.ru/sait/article/view/3731/3801 . Cited November 21, 2025.
- E. Ferster and B. Rentz, Methods of Correlation and Regression Analysis (Finance and Statistics, Moscow, 1983) [in Russian].
- I. I. Eliseeva, S. V. Kurysheva, T. V. Kosteeva, et al., Econometrics (Finance and Statistics, Moscow, 2007) [in Russian].
- M. P. Bazilevskiy, “Optimization Problems of Subsets Selection in Linear Regression with Control of Its Significance Using F-Test,” Izv. RAS SamSC. 26 (6), 200-207 (2024).
https://ssc.smr.ru/media/journals/izvestia/2024/2024_6_200_207.pdf . Cited November 21, 2025.
- D. Ge, Q. Huangfu, Z. Wang, et al., Cardinal Optimizer (COPT) User Guide.
https://guide.coap.online/copt/en-doc . Cited November 21, 2025.
- UCI Machine Learning Repository.
https://doi.org/10.24432/C50K61.
https://archive.ics.uci.edu/dataset/203/yearpredictionmsd . Cited November 21, 2025.