Skip to content

Latest commit

 

History

History
222 lines (149 loc) · 6.83 KB

File metadata and controls

222 lines (149 loc) · 6.83 KB

Quickstart

.. currentmodule:: patsy

If you prefer to learn by diving in and getting your feet wet, then here are some cut-and-pasteable examples to play with.

First, let's import stuff and get some data to work with:

.. ipython:: python

   import numpy as np
   from patsy import dmatrices, dmatrix, demo_data
   data = demo_data("a", "b", "x1", "x2", "y")

:func:`demo_data` gives us a mix of categorical and numerical variables:

.. ipython:: python

   data

Of course Patsy doesn't much care what sort of object you store your data in, so long as it can be indexed like a Python dictionary, data[varname]. You may prefer to store your data in a pandas DataFrame, or a numpy record array... whatever makes you happy.

Now, let's generate design matrices suitable for regressing y onto x1 and x2.

.. ipython:: python

   dmatrices("y ~ x1 + x2", data)

The return value is a Python tuple containing two DesignMatrix objects, the first representing the left-hand side of our formula, and the second representing the right-hand side. Notice that an intercept term was automatically added to the right-hand side. These are just ordinary numpy arrays with some extra metadata and a fancy __repr__ method attached, so we can pass them directly to a regression function like :func:`np.linalg.lstsq`:

.. ipython:: python

   outcome, predictors = dmatrices("y ~ x1 + x2", data)
   betas = np.linalg.lstsq(predictors, outcome)[0].ravel()
   for name, beta in zip(predictors.design_info.column_names, betas):
       print("%s: %s" % (name, beta))

Of course the resulting numbers aren't very interesting, since this is just random data.

If you just want the design matrix alone, without the y values, use :func:`dmatrix` and leave off the y ~ part at the beginning:

.. ipython:: python

   dmatrix("x1 + x2", data)

We'll use dmatrix for the rest of the examples, since seeing the outcome matrix over and over would get boring. This matrix's metadata is stored in an extra attribute called .design_info, which is a :class:`DesignInfo` object you can explore at your leisure:

.. ipython::

   In [0]: d = dmatrix("x1 + x2", data)

   @verbatim
   In [0]: d.design_info.<TAB>
   d.design_info.builder              d.design_info.slice
   d.design_info.column_name_indexes  d.design_info.term_name_slices
   d.design_info.column_names         d.design_info.term_names
   d.design_info.describe             d.design_info.term_slices
   d.design_info.linear_constraint    d.design_info.terms

Usually the intercept is useful, but if we don't want it we can get rid of it:

.. ipython:: python

   dmatrix("x1 + x2 - 1", data)

We can transform variables using arbitrary Python code:

.. ipython:: python

   dmatrix("x1 + np.log(x2 + 10)", data)

Notice that np.log is being pulled out of the environment where :func:`dmatrix` was called -- np.log is accessible because we did import numpy as np up above. Any functions or variables that you could reference when calling :func:`dmatrix` can also be used inside the formula passed to :func:`dmatrix`. For example:

.. ipython:: python

   new_x2 = data["x2"] * 100
   dmatrix("new_x2")

Patsy has some transformation functions "built in", that are automatically accessible to your code:

.. ipython:: python

   dmatrix("center(x1) + standardize(x2)", data)

See :mod:`patsy.builtins` for a complete list of functions made available to formulas. You can also define your own transformation functions in the ordinary Python way:

.. ipython:: python

   def double(x):
       return 2 * x
   dmatrix("x1 + double(x1)", data)

Arithmetic transformations are also possible, but you'll need to "protect" them by wrapping them in I(), so that Patsy knows that you really do want + to mean addition:

.. ipython:: python

   dmatrix("I(x1 + x2)", data)  # compare to "x1 + x2"

Note that while Patsy goes to considerable efforts to take in data represented using different Python data types and convert them into a standard representation, all this work happens after any transformations you perform as part of your formula. So, for example, if your data is in the form of numpy arrays, "+" will perform element-wise addition, but if it is in standard Python lists, it will perform concatentation:

.. ipython:: python

   dmatrix("I(x1 + x2)", {"x1": np.array([1, 2, 3]), "x2": np.array([4, 5, 6])})
   dmatrix("I(x1 + x2)", {"x1": [1, 2, 3], "x2": [4, 5, 6]})

Patsy becomes particularly useful when you have categorical data. If you use a predictor that has a categorical type (e.g. strings or bools), it will be automatically coded. Patsy automatically chooses an appropriate way to code categorical data to avoid producing a redundant, overdetermined model.

If there is just one categorical variable alone, the default is to dummy code it:

.. ipython:: python

   dmatrix("0 + a", data)

But if you did that and put the intercept back in, you'd get a redundant model. So if the intercept is present, Patsy uses a reduced-rank contrast code (treatment coding by default):

.. ipython:: python

   dmatrix("a", data)

The T. notation is there to remind you that these columns are treatment coded.

Interactions are also easy -- they represent the cartesian product of all the factors involved. Here's a dummy coding of each combination of values taken by a and b:

.. ipython:: python

   dmatrix("0 + a:b", data)

But interactions also know how to use contrast coding to avoid redundancy. If you have both main effects and interactions in a model, then Patsy goes from lower-order effects to higher-order effects, adding in just enough columns to produce a well-defined model. The result is that each set of columns measures the additional contribution of this effect -- just what you want for a traditional ANOVA:

.. ipython:: python

   dmatrix("a + b + a:b", data)

Since this is so common, there's a convenient short-hand:

.. ipython:: python

   dmatrix("a*b", data)

Of course you can use :ref:`other coding schemes <categorical-coding-ref>` too (or even :ref:`define your own <categorical-coding>`). Here's :class:`orthogonal polynomial coding <Poly>`:

.. ipython:: python

   dmatrix("C(c, Poly)", {"c": ["c1", "c1", "c2", "c2", "c3", "c3"]})

You can even write interactions between categorical and numerical variables. Here we fit two different slope coefficients for x1; one for the a1 group, and one for the a2 group:

.. ipython:: python

   dmatrix("a:x1", data)

The same redundancy avoidance code works here, so if you'd rather have treatment-coded slopes (one slope for the a1 group, and a second for the difference between the a1 and a2 group slopes), then you can request it like this:

.. ipython:: python

   # compare to the difference between "0 + a" and "1 + a"
   dmatrix("x1 + a:x1", data)

And more complex expressions work too:

.. ipython:: python

   dmatrix("C(a, Poly):center(x1)", data)