I’m taking the Coursera Regression
class to keep my skills sharp, and
to get more comfortable using Python for Data Science instead of R. I used to
use R for my data tasks, but it was always a frustrating experience. There are
some things that R definitely does better, such as graphics, but now it’s
straightforward to call R from Python for these one-off tasks, while the same
can’t be said for calling Python from R.
The Coursera class uses R, so I translate all of the code into Python and
usually get the same result, but not always, and here is one example.
The original point of this example is to show that if you have a linear regression in the form
then if you create another regressor that is a linear combination
of any of the other ’s, then you are not adding any new information and
you can throw out the new . The technical way to say this is that
the new regressor has perfect collinearity with existing regressors. See this
for more information on collinearity.
For example, in the swiss dataset I create another regressor z
that is the linear combination of two of the existing regressors.
R gives a NA value for the coefficient of z, which is what we expect.
Python’s statsmodelsols method does not recognize that z has perfect
collinearity with the other regressors, though there is a warning about the
zero eigenvalue and the strong possibility of multicollinearity.
 The smallest eigenvalue is 3.87e-11. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
I originally did this work in
thisIPython notebook. In the IPYthon notebook,
there is no warning about the eigenvalues, so you have to be careful to check
for collinearity yourself. Beacuse Python does not elimiate z, it also gives
it a value, and therefore some of the values for the other
regressors are also different!
If you’re curious for more details, see the
I posted about this topic on StatsExchange.com.
The table below summarizes the differences and you can see how 3 of the coefficients are different.
In the end, the lesson is that be sure you need to understand some of the
details of what your stats package is doing. I like the idea of running models
in Python and R, to make sure you get the same answer, and if not, I know then
to investigate further.
data(swiss)swiss$z <- swiss$Agriculture + swiss$Education
formula <-'Fertility ~ .'print(lm(formula, data=swiss))
lm(formula = formula, data = swiss)Coefficients:
(Intercept) Agriculture Examination Education
66.9152-0.1721-0.2580-0.8709 Catholic Infant.Mortality z
importstatsmodels.formula.apiassm# load swiss dataset from Rimportpandas.rpy.commonascomswiss=com.load_data('swiss')# get rid of periods in column namesswiss.columns=[_.replace('.','_')for_inswiss.columns]# add clearly duplicative dataswiss['z']=swiss['Agriculture']+swiss['Education']y='Fertility'x="+".join(swiss.columns-[y])formula='%s ~ %s'%(y,x)printsm.ols(formula,data=swiss).fit().summary()