Difference between revisions of "AWS/Machine Learning"

From Christoph's Personal Wiki
Jump to: navigation, search
(Examples)
(Examples)
Line 94: Line 94:
 
* Given various attributes about an area in a forest-like wilderness area, soil type, and location.
 
* Given various attributes about an area in a forest-like wilderness area, soil type, and location.
 
* Predict the main type of tree growing in the area (the '''target''' column is called "Cover_Type").
 
* Predict the main type of tree growing in the area (the '''target''' column is called "Cover_Type").
* The '''target''' is the value that we want to predict using AWS ML.
+
* The '''target''' is the value that we want to predict using Amazon ML.
 
* In the dataset, the field "Cover_Type" will be the '''target''' field and it represents the type of tree.
 
* In the dataset, the field "Cover_Type" will be the '''target''' field and it represents the type of tree.
 
* The dataset we will use is from [https://www.kaggle.com/c/forest-cover-type-prediction/data this publicly available dataset]
 
* The dataset we will use is from [https://www.kaggle.com/c/forest-cover-type-prediction/data this publicly available dataset]
Line 107: Line 107:
 
* Data preparation
 
* Data preparation
 
* Create a datasource
 
* Create a datasource
* Create and evaluate an AWS ML Model
+
* Create and evaluate an Amazon ML Model
* Use this AWS ML Model for prediction
+
* Use this Amazon ML Model for prediction
  
 
;Data preparation
 
;Data preparation
Line 116: Line 116:
  
 
;Create a datasource
 
;Create a datasource
* Note: A datasource is the container where AWS ML can access data to use in model training, evaluation, and prediction.
+
* Note: A datasource is the container where Amazon ML can access data to use in model training, evaluation, and prediction.
 
* As of March 2017, only S3, RDS, and Redshift services are supported for ML datasources. We will use S3 in this example.
 
* As of March 2017, only S3, RDS, and Redshift services are supported for ML datasources. We will use S3 in this example.
 
* Select the source of the data. Since we are using S3, select the bucket and filename of the training data (e.g., <code>s3://my-bucket-name/train.csv</code>).
 
* Select the source of the data. Since we are using S3, select the bucket and filename of the training data (e.g., <code>s3://my-bucket-name/train.csv</code>).
 
* Create a name for the datasource (e.g., "ML-Trees-Training").
 
* Create a name for the datasource (e.g., "ML-Trees-Training").
 
* Create a schema for the data (i.e., data types in each column).
 
* Create a schema for the data (i.e., data types in each column).
* There are only 4 data types in AWS ML: binary, categorical, numerical, and text.
+
* There are only 4 data types in Amazon ML: binary, categorical, numerical, and text.
 
* Select the '''target''' (or field/column) you want to predict.
 
* Select the '''target''' (or field/column) you want to predict.
 
* Note: It took AWS ~5 minutes to create my datasource (compute time: ~15 min).
 
* Note: It took AWS ~5 minutes to create my datasource (compute time: ~15 min).
Line 147: Line 147:
 
</pre>
 
</pre>
  
;Create and evaluate an AWS ML Model
+
;Create and evaluate an Amazon ML Model
 
* Select a datasource to train and evaluate our Amazon ML model (use the datasource created above).
 
* Select a datasource to train and evaluate our Amazon ML model (use the datasource created above).
 
* Train:
 
* Train:
Line 157: Line 157:
 
*# Default (recommended setting, if you are not a data-scientist)
 
*# Default (recommended setting, if you are not a data-scientist)
 
*# Custom (gives you much more control)
 
*# Custom (gives you much more control)
* Note: It took AWS ~2 minutes to create and evaluate my AWS ML Model (compute time: ~1 min).
+
* Note: It took AWS ~2 minutes to create and evaluate my Amazon ML Model (compute time: ~1 min).
 +
* Average F1 score: '''0.682''' (F1 score is used to measure the quality of the ML model. It ranges between 0 and 1. The higher the F1 score, the better the ML model quality. So, this is a pretty good model)
 +
 
 +
;Use this Amazon ML Model for prediction
 +
* Normally (or "in the real world"), we will not have predictions ''a priori'', so we will use Amazon ML to create predictions.
 +
* Once we have the Amazon ML model created, we can use it to make predictions.
 +
* Run a batch prediction using the datasource we create above.
 +
* Analyze the results.
 +
 
 +
===Binary Model===
 +
 
 +
;Scenario
 +
* Your company markets products to users based on their salary
 +
* Users sign up and create a profile
 +
* The profile combined with the user's annual salary determines what items the user can afford and, thus, should be notified of
 +
* One a user agrees to purchase an item, your company goes through a salary verification process
 +
* The problem is that users embellish their salary and your marketing is not reaching the correct audience
 +
* You have been given the task of predicting a user's salary based on their profile
 +
* This will improve the marketing efforts, since the users will see the products that they can afford
 +
* You have also been given historical income verification results, which have the verified salary on them
 +
* You will also be getting new verification data all the time and your prediction systems need to take this into account
  
 
==External links==
 
==External links==

Revision as of 19:08, 15 March 2017

This article will be about Amazon Web Services - Machine Learning (ML).

Machine Learning concepts

What is Machine Learning (ML)?
  • The basic concept of ML is to have computers or machines program themselves.
  • Machines can analyze large and complex datasets and identify patterns to create models, which are then used to predict outcomes.
  • Over time, these models can take into account new datasets and improve the accuracy of the predictions.
Examples of where ML is being used
  • Recommendations when checking out on an e-commerce site (e.g., purchases on Amazon.com)
  • Spam detection in email
  • Any kind of image, speech, or text recognition
  • Weather forecasts
  • Search engines
What is Amazon ML?
  • Amazon ML is supervised ML; learns from examples or historical data.
  • An Amazon ML Model requires your dataset to have both the features and the target for each observation/record.
  • A feature is an attribute of a record used to identify patterns; typically, there will be multiple features.
  • A target is the outcome that the patterns are linked to and is the value the ML algorithm is going to predict.
  • This linking is used to predict the outcomes
  • Example: {Go to the grocery store} {on Monday} (attribute {feature}) => Buy milk (target)
Why do ML on AWS?
  • Simplifies the whole process
  • No coding required for creating models
  • Identifies the best ML algorithm to run based on the input data
  • Easily integrates into other AWS services for data retrieval
  • Deploy within minutes
  • Full access via APIs
  • Scalable
Amazon ML pricing (as of March 2017)
  • Data Analysis and Model Building fees: $0.42/hour
  • Batch Predictions $0.10 per 1,000 predictions, rounded up to the next 1,000
  • Real-time predictions: $0.0001 per prediction, rounded up to the nearest penny (plus hourly capacity reservation charge only when the endpoint is active)
What is an AWS ML datasource?
  • A datasource is an Amazon ML object that contains information about your input data, including its location, attribute names and types, and descriptive statistics for each attribute. Note that a datasource does not contain your input data, it only points to its location. Operations such as ML model training and evaluation use the datasource ID to locate and access their input data. See "Creating and Using Datasources" and "Data Insights" for details.
Garbage in, Garbage out
  • Amazon ML performs statistical analyses on your training and evaluation datasources to help you identify and fix any data anomalies. In order for Amazon ML to produce the best results, your data must be as clean and consistent as possible. For example, Amazon ML cannot tell that NY, ny, New York, and new_york mean the same thing, so the more consistent your datasource, the more accurate your results.
AWS ML Workflow
  1. Create a data source
    • S3 (i.e., upload a CSV file to S3)
    • RDS and Redshift (i.e., run a SQL query on a Redshift cluster and get the data back directly into ML)
  2. Identify the feature and target columns
    • Select whether the file has a header row
    • Select the correct field data types (possible types: binary, categorical, numeric, text)
    • Select the target that needs to be predicted
    • Select a Row ID, if the data has one
  3. Train a model with a part of the dataset (generally 70%)
    • By default, AWS ML takes 70% of your data and uses it to train the model
    • It also automatically decides the best ML Model algorithm to use, based on the data schema
      • Binary target => binary model
      • Numeric target => regression model
      • Categorical target => multi-class model
  4. Evaluate the model by running the remaining dataset through it
    • AWS ML automatically evaluates the model based on the data source for you
    • If using the API, you would have to do this in a separate step
  5. Fine-tune the model
  6. Use the model for predictions
Types of ML models available
  • Binary
    • The target/prediction value is a 0 or 1
    • Best used when the prediction is a Boolean or one of two possible outcomes (e.g., true/false, yes/no, green apple/red apple, etc.)
    • Examples:
      • Does an email match the spam criteria?
      • Will someone respond to a marketing email?
      • Does a purchase on a credit card seem fraudulent?
  • Multi-class
    • The target/prediction is from a set of values
    • Best used for predicting categories or types
    • Examples:
      • What is the next product a user will purchase based on his/her history of purchases?
      • Film recommendations
  • Regression
    • The target/prediction is a numeric value
    • Best used for predicting scores
    • Root mean square error (RMSE)
      • AWS ML takes the mean of the training target data (RMSE Baseline) and uses that as a baseline and compares it to the mean of the predictions (RMSE)
      • A RMSE lower than the RMSE Baseline is better
    • Examples:
      • How many millimetres of rain can we expect?
      • Traffic delays
      • How many goals will my soccer team score?

Examples

Multi-Class Model

Scenario
  • Given various attributes about an area in a forest-like wilderness area, soil type, and location.
  • Predict the main type of tree growing in the area (the target column is called "Cover_Type").
  • The target is the value that we want to predict using Amazon ML.
  • In the dataset, the field "Cover_Type" will be the target field and it represents the type of tree.
  • The dataset we will use is from this publicly available dataset
Use a multi-class classification model
  • Generate predictions for multiple classes (predict one of more than two outcomes).
  • This problem calls for a multi-class model, since the prediction can be one of the several types of trees.
    • Binary will not work, since it can only prediction one of two possibilities
    • Numeric will not work, since it only predicts numbers.
Workflow
  • Data preparation
  • Create a datasource
  • Create and evaluate an Amazon ML Model
  • Use this Amazon ML Model for prediction
Data preparation
  • Download the training data from here.
  • Review the data based on the data schema provided (in the above link).
  • Upload the training (train.csv) and prediction (test.csv) data to S3
Create a datasource
  • Note: A datasource is the container where Amazon ML can access data to use in model training, evaluation, and prediction.
  • As of March 2017, only S3, RDS, and Redshift services are supported for ML datasources. We will use S3 in this example.
  • Select the source of the data. Since we are using S3, select the bucket and filename of the training data (e.g., s3://my-bucket-name/train.csv).
  • Create a name for the datasource (e.g., "ML-Trees-Training").
  • Create a schema for the data (i.e., data types in each column).
  • There are only 4 data types in Amazon ML: binary, categorical, numerical, and text.
  • Select the target (or field/column) you want to predict.
  • Note: It took AWS ~5 minutes to create my datasource (compute time: ~15 min).
Input schema
{
  "version" : "1.0",
  "rowId" : "Id",
  "rowWeight" : null,
  "targetAttributeName" : "Cover_Type",
  "dataFormat" : "CSV",
  "dataFileContainsHeader" : true,
  "attributes" : [ {
    "attributeName" : "Id",
    "attributeType" : "CATEGORICAL"
  }, {
    "attributeName" : "Elevation",
    "attributeType" : "NUMERIC"
  },
...
  } ],
  "excludedAttributeNames" : [ ]
}
Create and evaluate an Amazon ML Model
  • Select a datasource to train and evaluate our Amazon ML model (use the datasource created above).
  • Train:
    • Find patterns in the dataset that correlate the attributes to the target using a part of the data (70%) from the datasource. That is, we want to set aside 30% of our training data to evaluate the training.
    • These patterns are the Amazon ML model.
  • Evaluate:
    • Using the remaining data, which already has a 'target, predict the target using the Amazon ML model and compare that to the target already on record.
  • Amazon ML model settings:
    1. Default (recommended setting, if you are not a data-scientist)
    2. Custom (gives you much more control)
  • Note: It took AWS ~2 minutes to create and evaluate my Amazon ML Model (compute time: ~1 min).
  • Average F1 score: 0.682 (F1 score is used to measure the quality of the ML model. It ranges between 0 and 1. The higher the F1 score, the better the ML model quality. So, this is a pretty good model)
Use this Amazon ML Model for prediction
  • Normally (or "in the real world"), we will not have predictions a priori, so we will use Amazon ML to create predictions.
  • Once we have the Amazon ML model created, we can use it to make predictions.
  • Run a batch prediction using the datasource we create above.
  • Analyze the results.

Binary Model

Scenario
  • Your company markets products to users based on their salary
  • Users sign up and create a profile
  • The profile combined with the user's annual salary determines what items the user can afford and, thus, should be notified of
  • One a user agrees to purchase an item, your company goes through a salary verification process
  • The problem is that users embellish their salary and your marketing is not reaching the correct audience
  • You have been given the task of predicting a user's salary based on their profile
  • This will improve the marketing efforts, since the users will see the products that they can afford
  • You have also been given historical income verification results, which have the verified salary on them
  • You will also be getting new verification data all the time and your prediction systems need to take this into account

External links