Difference between revisions of "Python/pandas"
From Christoph's Personal Wiki
(Created page with "'''pandas''' is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for...") |
(→External links) |
||
(4 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
'''pandas''' is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The name is derived from the term "Panel data", an econometrics term for multidimensional structured data sets. | '''pandas''' is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The name is derived from the term "Panel data", an econometrics term for multidimensional structured data sets. | ||
+ | |||
+ | ==Pandas in deep learning / machine learning== | ||
+ | ; Loading in data | ||
+ | * Deep learning / machine learning learns from data, so you need data loading to be an automatic reflex | ||
+ | * Unstructured data => the Internet | ||
+ | * Semi-structured data => Apache logs | ||
+ | * Structured data => [https://www.kaggle.com/ Kaggle] and other datasets (usually in CSV format) | ||
+ | * Each row is a record | ||
+ | * Each record's values are separated by commas | ||
+ | |||
+ | ;Basics | ||
+ | |||
+ | ''Note: The following examples and datasets are taken from [https://github.com/lazyprogrammer/machine_learning_examples here].'' | ||
+ | |||
+ | <pre> | ||
+ | # machine_learning_examples/linear_regression_class/ | ||
+ | import numpy as np | ||
+ | |||
+ | X = [] | ||
+ | for line in open("data_2d.csv"): | ||
+ | row = line.split(',') | ||
+ | sample = map(float, row) | ||
+ | X.append(sample) | ||
+ | # X => a list of lists | ||
+ | X = np.array(X) | ||
+ | X.shape | ||
+ | </pre> | ||
+ | |||
+ | A better way of doing the above is to use pandas: | ||
+ | <pre> | ||
+ | import pandas as pd | ||
+ | |||
+ | X = pd.read_csv("data_2d.csv", header=None) | ||
+ | X.head() | ||
+ | type(X) # => pandas.core.frame.DataFrame | ||
+ | X.info() | ||
+ | X.head(10) | ||
+ | M = X.as_matrix() | ||
+ | type(M) # => numpy.ndarray | ||
+ | type(X[0]) # => pandas.core.series.Series | ||
+ | X.iloc[0] # => first row | ||
+ | X.ix[0] # => first row | ||
+ | type(X.ix[0]) # => pandas.core.series.Series (anything 1D in pandas is a series) | ||
+ | X[[0,2]] # => entire first two columns | ||
+ | # Find all rows where the zeroth column < 5 | ||
+ | X[ X[0] < 5 ] | ||
+ | X[0] < 5 # => boolean | ||
+ | type(X[0] < 5) # => pandas.core.series.Series | ||
+ | </pre> | ||
+ | |||
+ | * Working with CSV files with headers and footers: | ||
+ | <pre> | ||
+ | # machine_learning_examples/airline/ | ||
+ | import pandas as pd | ||
+ | |||
+ | df = pd.read_csv("international-airline-passengers.csv", engine="python", skipfooter=3) | ||
+ | df.columns | ||
+ | |||
+ | # rename columns | ||
+ | df.columns = ["month", "passengers"] | ||
+ | df["passengers"] | ||
+ | df.passengers | ||
+ | |||
+ | # add a new column | ||
+ | df['ones'] = 1 | ||
+ | df.columns | ||
+ | df.head() | ||
+ | </pre> | ||
+ | |||
+ | ;The <code>apply()</code> function | ||
+ | * What if we want to assign a new column value, where each cell is derived from the values already in its row? | ||
+ | * Example: Model interaction between ''X1'' and ''X2'' => ''X1''*''X2'' | ||
+ | * We use the apply function: | ||
+ | df['x1x2'] = df.apply(lambda row: row['x1'] * row['x2'], axis=1) | ||
+ | * Pass in <code>axis=1</code> so the function gets applied across each row instead of each column | ||
+ | * Think of it like Python's <code>map()</code> function. | ||
+ | * The <code>lambda</code> function is the same as doing the following: | ||
+ | <pre> | ||
+ | def get_interaction(row): | ||
+ | return row['x1'] * row['x2'] | ||
+ | |||
+ | df['x1x2'] = df.apply(get_interaction, axis=1) | ||
+ | </pre> | ||
+ | |||
+ | Apply this logic to our dataset: | ||
+ | <pre> | ||
+ | from datetime import datetime | ||
+ | datetime.strptime("1949-01", "%Y-%m") | ||
+ | df['dt'] = df.apply(lambda row: datetime.strptime(row['month'], "%Y-%m"), axis=1) | ||
+ | df.info() | ||
+ | df.head() | ||
+ | </pre> | ||
+ | |||
+ | ;Joins | ||
+ | |||
+ | <pre> | ||
+ | # machine_learning_examples/numpy_class | ||
+ | import pandas as pd | ||
+ | t1 = pd.read_csv("table1.csv") | ||
+ | t2 = pd.read_csv("table2.csv") | ||
+ | t1 | ||
+ | t2 | ||
+ | m = pd.merge(t1, t2, on='user_id') | ||
+ | m | ||
+ | t1.merge(t2, on="user_id") # => same result | ||
+ | </pre> | ||
==See also== | ==See also== | ||
* [[Python/NumPy|NumPy]] | * [[Python/NumPy|NumPy]] | ||
− | * SciPy | + | * [[Python/SciPy|SciPy]] |
− | * matplotlib | + | * [[Python/matplotlib|matplotlib]] |
* scikit-learn | * scikit-learn | ||
* scikit-image | * scikit-image | ||
Line 12: | Line 118: | ||
[[Category:Scripting languages]] | [[Category:Scripting languages]] | ||
+ | [[Category:Machine Learning]] |
Latest revision as of 23:53, 30 March 2017
pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The name is derived from the term "Panel data", an econometrics term for multidimensional structured data sets.
Pandas in deep learning / machine learning
- Loading in data
- Deep learning / machine learning learns from data, so you need data loading to be an automatic reflex
- Unstructured data => the Internet
- Semi-structured data => Apache logs
- Structured data => Kaggle and other datasets (usually in CSV format)
- Each row is a record
- Each record's values are separated by commas
- Basics
Note: The following examples and datasets are taken from here.
# machine_learning_examples/linear_regression_class/ import numpy as np X = [] for line in open("data_2d.csv"): row = line.split(',') sample = map(float, row) X.append(sample) # X => a list of lists X = np.array(X) X.shape
A better way of doing the above is to use pandas:
import pandas as pd X = pd.read_csv("data_2d.csv", header=None) X.head() type(X) # => pandas.core.frame.DataFrame X.info() X.head(10) M = X.as_matrix() type(M) # => numpy.ndarray type(X[0]) # => pandas.core.series.Series X.iloc[0] # => first row X.ix[0] # => first row type(X.ix[0]) # => pandas.core.series.Series (anything 1D in pandas is a series) X[[0,2]] # => entire first two columns # Find all rows where the zeroth column < 5 X[ X[0] < 5 ] X[0] < 5 # => boolean type(X[0] < 5) # => pandas.core.series.Series
- Working with CSV files with headers and footers:
# machine_learning_examples/airline/ import pandas as pd df = pd.read_csv("international-airline-passengers.csv", engine="python", skipfooter=3) df.columns # rename columns df.columns = ["month", "passengers"] df["passengers"] df.passengers # add a new column df['ones'] = 1 df.columns df.head()
- The
apply()
function
- What if we want to assign a new column value, where each cell is derived from the values already in its row?
- Example: Model interaction between X1 and X2 => X1*X2
- We use the apply function:
df['x1x2'] = df.apply(lambda row: row['x1'] * row['x2'], axis=1)
- Pass in
axis=1
so the function gets applied across each row instead of each column - Think of it like Python's
map()
function. - The
lambda
function is the same as doing the following:
def get_interaction(row): return row['x1'] * row['x2'] df['x1x2'] = df.apply(get_interaction, axis=1)
Apply this logic to our dataset:
from datetime import datetime datetime.strptime("1949-01", "%Y-%m") df['dt'] = df.apply(lambda row: datetime.strptime(row['month'], "%Y-%m"), axis=1) df.info() df.head()
- Joins
# machine_learning_examples/numpy_class import pandas as pd t1 = pd.read_csv("table1.csv") t2 = pd.read_csv("table2.csv") t1 t2 m = pd.merge(t1, t2, on='user_id') m t1.merge(t2, on="user_id") # => same result
See also
- NumPy
- SciPy
- matplotlib
- scikit-learn
- scikit-image