.. currentmodule:: patsy
If you prefer to learn by diving in and getting your feet wet, then here are some cut-and-pasteable examples to play with.
First, let's import stuff and get some data to work with:
.. ipython:: python
import numpy as np
from patsy import dmatrices, dmatrix, demo_data
data = demo_data("a", "b", "x1", "x2", "y")
:func:`demo_data` gives us a mix of categorical and numerical variables:
.. ipython:: python data
Of course Patsy doesn't much care what sort of object you store
your data in, so long as it can be indexed like a Python dictionary,
data[varname]. You may prefer to store your data in a pandas DataFrame, or a numpy
record array... whatever makes you happy.
Now, let's generate design matrices suitable for regressing y onto
x1 and x2.
.. ipython:: python
dmatrices("y ~ x1 + x2", data)
The return value is a Python tuple containing two DesignMatrix objects, the first representing the left-hand side of our formula, and the second representing the right-hand side. Notice that an intercept term was automatically added to the right-hand side. These are just ordinary numpy arrays with some extra metadata and a fancy __repr__ method attached, so we can pass them directly to a regression function like :func:`np.linalg.lstsq`:
.. ipython:: python
outcome, predictors = dmatrices("y ~ x1 + x2", data)
betas = np.linalg.lstsq(predictors, outcome)[0].ravel()
for name, beta in zip(predictors.design_info.column_names, betas):
print("%s: %s" % (name, beta))
Of course the resulting numbers aren't very interesting, since this is just random data.
If you just want the design matrix alone, without the y values,
use :func:`dmatrix` and leave off the y ~ part at the beginning:
.. ipython:: python
dmatrix("x1 + x2", data)
We'll use dmatrix for the rest of the examples, since seeing the
outcome matrix over and over would get boring. This matrix's metadata
is stored in an extra attribute called .design_info, which is a
:class:`DesignInfo` object you can explore at your leisure:
.. ipython::
In [0]: d = dmatrix("x1 + x2", data)
@verbatim
In [0]: d.design_info.<TAB>
d.design_info.builder d.design_info.slice
d.design_info.column_name_indexes d.design_info.term_name_slices
d.design_info.column_names d.design_info.term_names
d.design_info.describe d.design_info.term_slices
d.design_info.linear_constraint d.design_info.terms
Usually the intercept is useful, but if we don't want it we can get rid of it:
.. ipython:: python
dmatrix("x1 + x2 - 1", data)
We can transform variables using arbitrary Python code:
.. ipython:: python
dmatrix("x1 + np.log(x2 + 10)", data)
Notice that np.log is being pulled out of the environment where
:func:`dmatrix` was called -- np.log is accessible because we did
import numpy as np up above. Any functions or variables that you
could reference when calling :func:`dmatrix` can also be used inside
the formula passed to :func:`dmatrix`. For example:
.. ipython:: python
new_x2 = data["x2"] * 100
dmatrix("new_x2")
Patsy has some transformation functions "built in", that are automatically accessible to your code:
.. ipython:: python
dmatrix("center(x1) + standardize(x2)", data)
See :mod:`patsy.builtins` for a complete list of functions made available to formulas. You can also define your own transformation functions in the ordinary Python way:
.. ipython:: python
def double(x):
return 2 * x
dmatrix("x1 + double(x1)", data)
Arithmetic transformations are also possible, but you'll need to
"protect" them by wrapping them in I(), so that Patsy knows
that you really do want + to mean addition:
.. ipython:: python
dmatrix("I(x1 + x2)", data) # compare to "x1 + x2"
Note that while Patsy goes to considerable efforts to take in data represented using different Python data types and convert them into a standard representation, all this work happens after any transformations you perform as part of your formula. So, for example, if your data is in the form of numpy arrays, "+" will perform element-wise addition, but if it is in standard Python lists, it will perform concatentation:
.. ipython:: python
dmatrix("I(x1 + x2)", {"x1": np.array([1, 2, 3]), "x2": np.array([4, 5, 6])})
dmatrix("I(x1 + x2)", {"x1": [1, 2, 3], "x2": [4, 5, 6]})
Patsy becomes particularly useful when you have categorical data. If you use a predictor that has a categorical type (e.g. strings or bools), it will be automatically coded. Patsy automatically chooses an appropriate way to code categorical data to avoid producing a redundant, overdetermined model.
If there is just one categorical variable alone, the default is to dummy code it:
.. ipython:: python
dmatrix("0 + a", data)
But if you did that and put the intercept back in, you'd get a redundant model. So if the intercept is present, Patsy uses a reduced-rank contrast code (treatment coding by default):
.. ipython:: python
dmatrix("a", data)
The T. notation is there to remind you that these columns are
treatment coded.
Interactions are also easy -- they represent the cartesian product of
all the factors involved. Here's a dummy coding of each combination
of values taken by a and b:
.. ipython:: python
dmatrix("0 + a:b", data)
But interactions also know how to use contrast coding to avoid redundancy. If you have both main effects and interactions in a model, then Patsy goes from lower-order effects to higher-order effects, adding in just enough columns to produce a well-defined model. The result is that each set of columns measures the additional contribution of this effect -- just what you want for a traditional ANOVA:
.. ipython:: python
dmatrix("a + b + a:b", data)
Since this is so common, there's a convenient short-hand:
.. ipython:: python
dmatrix("a*b", data)
Of course you can use :ref:`other coding schemes <categorical-coding-ref>` too (or even :ref:`define your own <categorical-coding>`). Here's :class:`orthogonal polynomial coding <Poly>`:
.. ipython:: python
dmatrix("C(c, Poly)", {"c": ["c1", "c1", "c2", "c2", "c3", "c3"]})
You can even write interactions between categorical and numerical
variables. Here we fit two different slope coefficients for x1;
one for the a1 group, and one for the a2 group:
.. ipython:: python
dmatrix("a:x1", data)
The same redundancy avoidance code works here, so if you'd rather have
treatment-coded slopes (one slope for the a1 group, and a second
for the difference between the a1 and a2 group slopes), then
you can request it like this:
.. ipython:: python
# compare to the difference between "0 + a" and "1 + a"
dmatrix("x1 + a:x1", data)
And more complex expressions work too:
.. ipython:: python
dmatrix("C(a, Poly):center(x1)", data)