Difference between revisions of "AWS/Machine Learning"
From Christoph's Personal Wiki
(→Machine Learning concepts) |
|||
Line 35: | Line 35: | ||
* Batch Predictions $0.10 per 1,000 predictions, rounded up to the next 1,000 | * Batch Predictions $0.10 per 1,000 predictions, rounded up to the next 1,000 | ||
* Real-time predictions: $0.0001 per prediction, rounded up to the nearest penny (plus hourly capacity reservation charge only when the endpoint is active) | * Real-time predictions: $0.0001 per prediction, rounded up to the nearest penny (plus hourly capacity reservation charge only when the endpoint is active) | ||
+ | |||
+ | ;What is an AWS ML datasource? | ||
+ | * A datasource is an Amazon ML object that contains information about your input data, including its location, attribute names and types, and descriptive statistics for each attribute. Note that a datasource does not contain your input data, it only points to its location. Operations such as ML model training and evaluation use the datasource ID to locate and access their input data. See "[http://docs.aws.amazon.com/machine-learning/latest/dg/creating-and-using-datasources.html?icmpid=docs_machinelearning_console Creating and Using Datasources]" and "[http://docs.aws.amazon.com/machine-learning/latest/dg/data-insights.html?icmpid=docs_machinelearning_console Data Insights]" for details. | ||
+ | |||
+ | ;Garbage in, Garbage out | ||
+ | * Amazon ML performs statistical analyses on your training and evaluation datasources to help you identify and fix any data anomalies. In order for Amazon ML to produce the best results, your data must be as clean and consistent as possible. For example, Amazon ML cannot tell that <code>NY</code>, <code>ny</code>, <code>New York</code>, and <code>new_york</code> mean the same thing, so the more consistent your datasource, the more accurate your results. | ||
;AWS ML Workflow | ;AWS ML Workflow | ||
Line 81: | Line 87: | ||
*** Traffic delays | *** Traffic delays | ||
*** How many goals will my soccer team score? | *** How many goals will my soccer team score? | ||
+ | |||
+ | ==Examples== | ||
+ | |||
+ | ;Scenario | ||
+ | * Given various attributes about an area in a forest-like wilderness area, soil type, and location. | ||
+ | * Predict the main type of tree growing in the area (the '''target''' column is called "Cover_Type"). | ||
+ | * The '''target''' is the value that we want to predict using AWS ML. | ||
+ | * In the dataset, the field "Cover_Type" will be the '''target''' field and it represents the type of tree. | ||
+ | * The dataset we will use is from [https://www.kaggle.com/c/forest-cover-type-prediction/data this publicly available dataset] | ||
+ | |||
+ | ;Use a multi-class classification model | ||
+ | * Generate predictions for multiple classes (predict one of more than two outcomes). | ||
+ | * This problem calls for a multi-class model, since the prediction can be one of the several types of trees. | ||
+ | ** Binary will not work, since it can only prediction one of two possibilities | ||
+ | ** Numeric will not work, since it only predicts numbers. | ||
+ | |||
+ | ;Workflow | ||
+ | * Data preparation | ||
+ | * Create a datasource | ||
+ | * Create and evaluate an AWS ML Model | ||
+ | * Use this AWS ML Model for prediction | ||
+ | |||
+ | ;Data preparation | ||
+ | * Download the training data from [https://www.kaggle.com/c/forest-cover-type-prediction/data here]. | ||
+ | * Review the data based on the data schema provided (in the above link). | ||
+ | * Upload the training (<code>train.csv</code>) and prediction (<code>test.csv</code>) data to S3 | ||
+ | |||
+ | ;Create a datasource | ||
+ | * Note: A datasource is the container where AWS ML can access data to use in model training, evaluation, and prediction. | ||
+ | * As of March 2017, only S3, RDS, and Redshift services are supported for ML datasources. We will use S3 in this example. | ||
+ | * Select the source of the data. Since we are using S3, select the bucket and filename of the training data (e.g., <code>s3://my-bucket-name/train.csv</code>). | ||
+ | * Create a name for the datasource (e.g., "ML-Trees-Training"). | ||
+ | * Create a schema for the data (i.e., data types in each column). | ||
+ | * There are only 4 data types in AWS ML: binary, categorical, numerical, and text. | ||
+ | * Select the '''target''' (or field/column) you want to predict. | ||
+ | |||
==External links== | ==External links== | ||
* [https://aws.amazon.com/machine-learning/ AWS Machine Learning] | * [https://aws.amazon.com/machine-learning/ AWS Machine Learning] | ||
+ | * [http://archive.ics.uci.edu/ml/datasets.html University of California at Irvine (UCI) Machine Learning Repository] | ||
[[Category:AWS]] | [[Category:AWS]] |
Revision as of 17:23, 15 March 2017
This article will be about Amazon Web Services - Machine Learning (ML).
Machine Learning concepts
- What is Machine Learning (ML)?
- The basic concept of ML is to have computers or machines program themselves.
- Machines can analyze large and complex datasets and identify patterns to create models, which are then used to predict outcomes.
- Over time, these models can take into account new datasets and improve the accuracy of the predictions.
- Examples of where ML is being used
- Recommendations when checking out on an e-commerce site (e.g., purchases on Amazon.com)
- Spam detection in email
- Any kind of image, speech, or text recognition
- Weather forecasts
- Search engines
- What is Amazon ML?
- Amazon ML is supervised ML; learns from examples or historical data.
- An Amazon ML Model requires your dataset to have both the features and the target for each observation/record.
- A feature is an attribute of a record used to identify patterns; typically, there will be multiple features.
- A target is the outcome that the patterns are linked to and is the value the ML algorithm is going to predict.
- This linking is used to predict the outcomes
- Example: {Go to the grocery store} {on Monday} (attribute {feature}) => Buy milk (target)
- Why do ML on AWS?
- Simplifies the whole process
- No coding required for creating models
- Identifies the best ML algorithm to run based on the input data
- Easily integrates into other AWS services for data retrieval
- Deploy within minutes
- Full access via APIs
- Scalable
- Amazon ML pricing (as of March 2017)
- Data Analysis and Model Building fees: $0.42/hour
- Batch Predictions $0.10 per 1,000 predictions, rounded up to the next 1,000
- Real-time predictions: $0.0001 per prediction, rounded up to the nearest penny (plus hourly capacity reservation charge only when the endpoint is active)
- What is an AWS ML datasource?
- A datasource is an Amazon ML object that contains information about your input data, including its location, attribute names and types, and descriptive statistics for each attribute. Note that a datasource does not contain your input data, it only points to its location. Operations such as ML model training and evaluation use the datasource ID to locate and access their input data. See "Creating and Using Datasources" and "Data Insights" for details.
- Garbage in, Garbage out
- Amazon ML performs statistical analyses on your training and evaluation datasources to help you identify and fix any data anomalies. In order for Amazon ML to produce the best results, your data must be as clean and consistent as possible. For example, Amazon ML cannot tell that
NY
,ny
,New York
, andnew_york
mean the same thing, so the more consistent your datasource, the more accurate your results.
- AWS ML Workflow
- Create a data source
- S3 (i.e., upload a CSV file to S3)
- RDS and Redshift (i.e., run a SQL query on a Redshift cluster and get the data back directly into ML)
- Identify the feature and target columns
- Select whether the file has a header row
- Select the correct field data types (possible types: binary, categorical, numeric, text)
- Select the target that needs to be predicted
- Select a Row ID, if the data has one
- Train a model with a part of the dataset (generally 70%)
- By default, AWS ML takes 70% of your data and uses it to train the model
- It also automatically decides the best ML Model algorithm to use, based on the data schema
- Binary target => binary model
- Numeric target => regression model
- Categorical target => multi-class model
- Evaluate the model by running the remaining dataset through it
- AWS ML automatically evaluates the model based on the data source for you
- If using the API, you would have to do this in a separate step
- Fine-tune the model
- Use the model for predictions
- Types of ML models available
- Binary
- The target/prediction value is a 0 or 1
- Best used when the prediction is a Boolean or one of two possible outcomes (e.g., true/false, yes/no, green apple/red apple, etc.)
- Examples:
- Does an email match the spam criteria?
- Will someone respond to a marketing email?
- Does a purchase on a credit card seem fraudulent?
- Multi-class
- The target/prediction is from a set of values
- Best used for predicting categories or types
- Examples:
- What is the next product a user will purchase based on his/her history of purchases?
- Film recommendations
- Regression
- The target/prediction is a numeric value
- Best used for predicting scores
- Root mean square error (RMSE)
- AWS ML takes the mean of the training target data (RMSE Baseline) and uses that as a baseline and compares it to the mean of the predictions (RMSE)
- A RMSE lower than the RMSE Baseline is better
- Examples:
- How many millimetres of rain can we expect?
- Traffic delays
- How many goals will my soccer team score?
Examples
- Scenario
- Given various attributes about an area in a forest-like wilderness area, soil type, and location.
- Predict the main type of tree growing in the area (the target column is called "Cover_Type").
- The target is the value that we want to predict using AWS ML.
- In the dataset, the field "Cover_Type" will be the target field and it represents the type of tree.
- The dataset we will use is from this publicly available dataset
- Use a multi-class classification model
- Generate predictions for multiple classes (predict one of more than two outcomes).
- This problem calls for a multi-class model, since the prediction can be one of the several types of trees.
- Binary will not work, since it can only prediction one of two possibilities
- Numeric will not work, since it only predicts numbers.
- Workflow
- Data preparation
- Create a datasource
- Create and evaluate an AWS ML Model
- Use this AWS ML Model for prediction
- Data preparation
- Download the training data from here.
- Review the data based on the data schema provided (in the above link).
- Upload the training (
train.csv
) and prediction (test.csv
) data to S3
- Create a datasource
- Note: A datasource is the container where AWS ML can access data to use in model training, evaluation, and prediction.
- As of March 2017, only S3, RDS, and Redshift services are supported for ML datasources. We will use S3 in this example.
- Select the source of the data. Since we are using S3, select the bucket and filename of the training data (e.g.,
s3://my-bucket-name/train.csv
). - Create a name for the datasource (e.g., "ML-Trees-Training").
- Create a schema for the data (i.e., data types in each column).
- There are only 4 data types in AWS ML: binary, categorical, numerical, and text.
- Select the target (or field/column) you want to predict.