LECTURE 9
DATA STRUCTURES
AND ANALYSIS WITH
PANDAS
Thus, whenever you see pd. in code, it’s referring to pandas. You
may also find it easier to import Series and DataFrame into the
local namespace since they are so frequently used:
Introduction to pandas Data Structures
Series.
■ A Series is a one-
dimensional array-like
object containing a
sequence of values (of
similar types to NumPy
types) and an associated
array of data labels, called
its index. The simplest
Series is formed from only
an array of data:
■ Using NumPy functions or NumPy-like operations, such as
filtering with a boolean array, scalar multiplication, or applying
math functions, will preserve the index-value link:
■ Another way to think about a Series is as a fixed-length, ordered dict, as it is a
map ping of index values to data values. It can be used in many contexts
‐
where you might use a dict:
■ Should you have data contained in a Python dict, you can create a Series
from it by passing the dict:
Data Frame
■ A DataFrame represents a rectangular table of data and
contains an ordered collec tion of columns, each of which
‐
can be a different value type (numeric, string, boolean, etc.).
Data Frame
■ When you are assigning lists or arrays to a column, the value’s
length must match the length of the DataFrame. If you assign a
Series, its labels will be realigned exactly to the DataFrame’s index,
inserting missing values in any holes:
Possible data inputs to Data Frame constructor
Index Objects
■ pandas’s Index objects are responsible for holding the axis labels
and other metadata (like the axis name or names). Any array or
other sequence of labels you use when constructing a Series or
DataFrame is internally converted to an Index:
Index methods and properties
Reindexing
■ An important method on pandas objects is reindex, which
means to create a new object with the data conformed to a
new index.
Reindex function arguments
Dropping Entries from an Axis
■ Dropping one or more entries from an axis is easy if you already have an index array
or list without those entries. As that can require a bit of munging and set logic, the
drop method will return a new object with the indicated value or values deleted from
an axis:
Indexing, Selection and Filtering
■ Series indexing (obj[...]) works analogously to NumPy array
indexing, except you can use the Series’s index values
instead of only integers
■ Slicing with labels behaves differently than normal Python
slicing in that the end point is inclusive
‐
■ Setting using these methods modifies the corresponding
section of the Series
■ Indexing into a DataFrame is for retrieving one or more
columns either with a single value or sequence
Selection with loc and iloc
■ For DataFrame label-indexing on the rows, the special
indexing operators loc and iloc are introduced. They
enable you to select a subset of the rows and columns
from a DataFrame with NumPy-like notation using either
axis labels (loc) or integers (iloc).
Indexing options with DataFrame
Arithmetic and Data Alignment
■ An important pandas feature for some applications is the behavior of arithmetic
between objects with different indexes. When you are adding together objects, if any
index pairs are not the same, the respective index in the result will be the union of
the index pairs. This is similar to an automatic outer join on the index labels
■ The internal data alignment introduces missing values in the label locations that
don’t overlap. Missing values will then propagate in further arithmetic computations.
Flexible arithmetic methods
Function Application and Mapping
■ NumPy ufuncs (element-wise array methods) also work with pandas objects
■ Another frequent operation is applying a function on one-dimensional arrays to each
column or row. DataFrame’s apply method does exactly this
■ Many of the most common array statistics (like sum and mean) are DataFrame
methods, so using apply is not necessary.
Sorting and Ranking
■ Sorting a dataset by some criterion is another important built-in operation. To sort
lexicographically by row or column index, use the sort_index method, which returns a new,
sorted object:
■ Ranking assigns ranks from one through the number of valid data points in an array. The
rank methods for Series and DataFrame are the place to look; by default rank breaks ties by
assigning each group the mean rank:
Tie-breaking methods with rank
Axis Indexes with Duplicate Labels
Summarizing and Computing
Descriptive Statistics
Descriptive and summary statistics
Unique Values, Value Counts and
Membership

Python_Lecture9_DataStuctureswithPandas.pptx

  • 1.
    LECTURE 9 DATA STRUCTURES ANDANALYSIS WITH PANDAS
  • 2.
    Thus, whenever yousee pd. in code, it’s referring to pandas. You may also find it easier to import Series and DataFrame into the local namespace since they are so frequently used:
  • 3.
    Introduction to pandasData Structures Series. ■ A Series is a one- dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:
  • 4.
    ■ Using NumPyfunctions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:
  • 5.
    ■ Another wayto think about a Series is as a fixed-length, ordered dict, as it is a map ping of index values to data values. It can be used in many contexts ‐ where you might use a dict: ■ Should you have data contained in a Python dict, you can create a Series from it by passing the dict:
  • 6.
    Data Frame ■ ADataFrame represents a rectangular table of data and contains an ordered collec tion of columns, each of which ‐ can be a different value type (numeric, string, boolean, etc.).
  • 7.
    Data Frame ■ Whenyou are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any holes:
  • 8.
    Possible data inputsto Data Frame constructor
  • 9.
    Index Objects ■ pandas’sIndex objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index:
  • 10.
  • 11.
    Reindexing ■ An importantmethod on pandas objects is reindex, which means to create a new object with the data conformed to a new index.
  • 12.
  • 13.
    Dropping Entries froman Axis ■ Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis:
  • 14.
    Indexing, Selection andFiltering ■ Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers ■ Slicing with labels behaves differently than normal Python slicing in that the end point is inclusive ‐ ■ Setting using these methods modifies the corresponding section of the Series ■ Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence
  • 15.
    Selection with locand iloc ■ For DataFrame label-indexing on the rows, the special indexing operators loc and iloc are introduced. They enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notation using either axis labels (loc) or integers (iloc).
  • 16.
  • 17.
    Arithmetic and DataAlignment ■ An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. This is similar to an automatic outer join on the index labels ■ The internal data alignment introduces missing values in the label locations that don’t overlap. Missing values will then propagate in further arithmetic computations.
  • 18.
  • 19.
    Function Application andMapping ■ NumPy ufuncs (element-wise array methods) also work with pandas objects ■ Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this ■ Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.
  • 20.
    Sorting and Ranking ■Sorting a dataset by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:
  • 21.
    ■ Ranking assignsranks from one through the number of valid data points in an array. The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank:
  • 22.
  • 23.
    Axis Indexes withDuplicate Labels
  • 24.
  • 25.
  • 26.
    Unique Values, ValueCounts and Membership