Python/pandas

From Christoph's Personal Wiki
Jump to: navigation, search

pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The name is derived from the term "Panel data", an econometrics term for multidimensional structured data sets.

Pandas in deep learning / machine learning

Loading in data
  • Deep learning / machine learning learns from data, so you need data loading to be an automatic reflex
  • Unstructured data => the Internet
  • Semi-structured data => Apache logs
  • Structured data => Kaggle and other datasets (usually in CSV format)
  • Each row is a record
  • Each record's values are separated by commas
Basics

Note: The following examples and datasets are taken from here.

# machine_learning_examples/linear_regression_class/
import numpy as np

X = []
for line in open("data_2d.csv"):
    row = line.split(',')
    sample = map(float, row)
    X.append(sample)
# X => a list of lists
X = np.array(X)
X.shape

A better way of doing the above is to use pandas:

import pandas as pd

X = pd.read_csv("data_2d.csv", header=None)
X.head()
type(X) # => pandas.core.frame.DataFrame
X.info()
X.head(10)
M = X.as_matrix()
type(M) # => numpy.ndarray
type(X[0]) # => pandas.core.series.Series
X.iloc[0] # => first row
X.ix[0] # => first row
type(X.ix[0]) # => pandas.core.series.Series (anything 1D in pandas is a series)
X[[0,2]] # => entire first two columns
# Find all rows where the zeroth column < 5
X[ X[0] < 5 ]
X[0] < 5 # => boolean
type(X[0] < 5) # => pandas.core.series.Series
  • Working with CSV files with headers and footers:
# machine_learning_examples/airline/
import pandas as pd

df = pd.read_csv("international-airline-passengers.csv", engine="python", skipfooter=3)
df.columns

# rename columns
df.columns = ["month", "passengers"]
df["passengers"]
df.passengers

# add a new column
df['ones'] = 1
df.columns
df.head()
The apply() function
  • What if we want to assign a new column value, where each cell is derived from the values already in its row?
  • Example: Model interaction between X1 and X2 => X1*X2
  • We use the apply function:
df['x1x2'] = df.apply(lambda row: row['x1'] * row['x2'], axis=1)
  • Pass in axis=1 so the function gets applied across each row instead of each column
  • Think of it like Python's map() function.
  • The lambda function is the same as doing the following:
def get_interaction(row):
    return row['x1'] * row['x2']

df['x1x2'] = df.apply(get_interaction, axis=1)

Apply this logic to our dataset:

from datetime import datetime
datetime.strptime("1949-01", "%Y-%m")
df['dt'] = df.apply(lambda row: datetime.strptime(row['month'], "%Y-%m"), axis=1)
df.info()
df.head()
Joins
# machine_learning_examples/numpy_class
import pandas as pd
t1 = pd.read_csv("table1.csv")
t2 = pd.read_csv("table2.csv")
t1
t2
m = pd.merge(t1, t2, on='user_id')
m
t1.merge(t2, on="user_id") # => same result

See also

External links