Cubic regression spline is a form of generalized linear models in regression analysis. Also known as B-spline, it is supported by a series of interior basis functions on the interval with chosen knots. Cubic regression splines are widely used on modeling nonlinear data and interaction between variables. Unlike traditional methods such as polynomial regression or broken stick regression, cubic regression spline takes both smoothness and local influence into consideration (Faraway, 2015). According to Julian J Faraway, the basis functions need to be a continuous cubic polynomial that is nonzero on an interval defined by knots and zero anywhere else to ensure the local influence. Also, the first and second derivatives of the polynomials at each knotpoint should be continuous to achieve the overall smoothness. (Faraway, 2015). More details of cubic regression spline can be found at link
We will provides three analysis examples produced by R with package splines, STATA with package bspline and Python with package statsmodels, sklearn.metrics and patsy. In each example, we will first clean the data and remove outliers, fit the ols and polynomial regression models as alternative and finally fit the cubic regression spline models and compare their goodness-of-fit to the alternative models.
The result shows that cubic splines have the best fit compared to ols and polynomial regression, which benefits from its focus on local influence as stated before. In addition, the fitted model of cubic splines are smooth and will be significantly better than the other two methods when applied to higher dimensional data. Limited by the scale of this tutorial, we only focus on the local influence advantage of cubic splines.