Difference between revisions of "Python/pandas"

Latest revision as of 23:53, 30 March 2017

pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The name is derived from the term "Panel data", an econometrics term for multidimensional structured data sets.

Pandas in deep learning / machine learning

Loading in data

Deep learning / machine learning learns from data, so you need data loading to be an automatic reflex
Unstructured data => the Internet
Semi-structured data => Apache logs
Structured data => Kaggle and other datasets (usually in CSV format)
Each row is a record
Each record's values are separated by commas

Basics

Note: The following examples and datasets are taken from here.

# machine_learning_examples/linear_regression_class/
import numpy as np

X = []
for line in open("data_2d.csv"):
    row = line.split(',')
    sample = map(float, row)
    X.append(sample)
# X => a list of lists
X = np.array(X)
X.shape

A better way of doing the above is to use pandas:

import pandas as pd

X = pd.read_csv("data_2d.csv", header=None)
X.head()
type(X) # => pandas.core.frame.DataFrame
X.info()
X.head(10)
M = X.as_matrix()
type(M) # => numpy.ndarray
type(X[0]) # => pandas.core.series.Series
X.iloc[0] # => first row
X.ix[0] # => first row
type(X.ix[0]) # => pandas.core.series.Series (anything 1D in pandas is a series)
X[[0,2]] # => entire first two columns
# Find all rows where the zeroth column < 5
X[ X[0] < 5 ]
X[0] < 5 # => boolean
type(X[0] < 5) # => pandas.core.series.Series

Working with CSV files with headers and footers:

# machine_learning_examples/airline/
import pandas as pd

df = pd.read_csv("international-airline-passengers.csv", engine="python", skipfooter=3)
df.columns

# rename columns
df.columns = ["month", "passengers"]
df["passengers"]
df.passengers

# add a new column
df['ones'] = 1
df.columns
df.head()

The apply() function

What if we want to assign a new column value, where each cell is derived from the values already in its row?
Example: Model interaction between X1 and X2 => X1*X2
We use the apply function:

df['x1x2'] = df.apply(lambda row: row['x1'] * row['x2'], axis=1)

Pass in axis=1 so the function gets applied across each row instead of each column
Think of it like Python's map() function.
The lambda function is the same as doing the following:

def get_interaction(row):
    return row['x1'] * row['x2']

df['x1x2'] = df.apply(get_interaction, axis=1)

Apply this logic to our dataset:

from datetime import datetime
datetime.strptime("1949-01", "%Y-%m")
df['dt'] = df.apply(lambda row: datetime.strptime(row['month'], "%Y-%m"), axis=1)
df.info()
df.head()

Joins

# machine_learning_examples/numpy_class
import pandas as pd
t1 = pd.read_csv("table1.csv")
t2 = pd.read_csv("table2.csv")
t1
t2
m = pd.merge(t1, t2, on='user_id')
m
t1.merge(t2, on="user_id") # => same result

External links

Official website

Difference between revisions of "Python/pandas"

Latest revision as of 23:53, 30 March 2017

Pandas in deep learning / machine learning

See also

External links

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools