AWS/Machine Learning

This article will be about Amazon Web Services - Machine Learning (ML).

Machine Learning concepts

What is Machine Learning (ML)?

The basic concept of ML is to have computers or machines program themselves.
Machines can analyze large and complex datasets and identify patterns to create models, which are then used to predict outcomes.
Over time, these models can take into account new datasets and improve the accuracy of the predictions.

Examples of where ML is being used

Recommendations when checking out on an e-commerce site (e.g., purchases on Amazon.com)
Spam detection in email
Any kind of image, speech, or text recognition
Weather forecasts
Search engines

What is Amazon ML?

Amazon ML is supervised ML; learns from examples or historical data.
An Amazon ML Model requires your dataset to have both the features and the target for each observation/record.
A feature is an attribute of a record used to identify patterns; typically, there will be multiple features.
A target is the outcome that the patterns are linked to and is the value the ML algorithm is going to predict.
This linking is used to predict the outcomes
Example: {Go to the grocery store} {on Monday} (attribute {feature}) => Buy milk (target)

Why do ML on AWS?

Simplifies the whole process
No coding required for creating models
Identifies the best ML algorithm to run based on the input data
Easily integrates into other AWS services for data retrieval
Deploy within minutes
Full access via APIs
Scalable

Amazon ML pricing (as of March 2017)

Data Analysis and Model Building fees: $0.42/hour
Batch Predictions $0.10 per 1,000 predictions, rounded up to the next 1,000
Real-time predictions: $0.0001 per prediction, rounded up to the nearest penny (plus hourly capacity reservation charge only when the endpoint is active)

What is an AWS ML datasource?

A datasource is an Amazon ML object that contains information about your input data, including its location, attribute names and types, and descriptive statistics for each attribute. Note that a datasource does not contain your input data, it only points to its location. Operations such as ML model training and evaluation use the datasource ID to locate and access their input data. See "Creating and Using Datasources" and "Data Insights" for details.

Garbage in, Garbage out

Amazon ML performs statistical analyses on your training and evaluation datasources to help you identify and fix any data anomalies. In order for Amazon ML to produce the best results, your data must be as clean and consistent as possible. For example, Amazon ML cannot tell that NY, ny, New York, and new_york mean the same thing, so the more consistent your datasource, the more accurate your results.

AWS ML Workflow

Create a data source
- S3 (i.e., upload a CSV file to S3)
- RDS and Redshift (i.e., run a SQL query on a Redshift cluster and get the data back directly into ML)
Identify the feature and target columns
- Select whether the file has a header row
- Select the correct field data types (possible types: binary, categorical, numeric, text)
- Select the target that needs to be predicted
- Select a Row ID, if the data has one
Train a model with a part of the dataset (generally 70%)
- By default, AWS ML takes 70% of your data and uses it to train the model
- It also automatically decides the best ML Model algorithm to use, based on the data schema
  - Binary target => binary model
  - Numeric target => regression model
  - Categorical target => multi-class model
Evaluate the model by running the remaining dataset through it
- AWS ML automatically evaluates the model based on the data source for you
- If using the API, you would have to do this in a separate step
Fine-tune the model
Use the model for predictions

Types of ML models available

Binary
- The target/prediction value is a 0 or 1
- Best used when the prediction is a Boolean or one of two possible outcomes (e.g., true/false, yes/no, green apple/red apple, etc.)
- Examples:
  - Does an email match the spam criteria?
  - Will someone respond to a marketing email?
  - Does a purchase on a credit card seem fraudulent?
Multi-class
- The target/prediction is from a set of values
- Best used for predicting categories or types
- Examples:
  - What is the next product a user will purchase based on his/her history of purchases?
  - Film recommendations
Regression
- The target/prediction is a numeric value
- Best used for predicting scores
- Root mean square error (RMSE)
  - AWS ML takes the mean of the training target data (RMSE Baseline) and uses that as a baseline and compares it to the mean of the predictions (RMSE)
  - A RMSE lower than the RMSE Baseline is better
- Examples:
  - How many millimetres of rain can we expect?
  - Traffic delays
  - How many goals will my soccer team score?

Examples

Multi-Class Model

Scenario

Given various attributes about an area in a forest-like wilderness area, soil type, and location.
Predict the main type of tree growing in the area (the target column is called "Cover_Type").
The target is the value that we want to predict using Amazon ML.
In the dataset, the field "Cover_Type" will be the target field and it represents the type of tree.
The dataset we will use is from this publicly available dataset

Use a multi-class classification model

Generate predictions for multiple classes (predict one of more than two outcomes).
This problem calls for a multi-class model, since the prediction can be one of the several types of trees.
- Binary will not work, since it can only prediction one of two possibilities
- Numeric will not work, since it only predicts numbers.

Workflow

Data preparation
Create a datasource
Create and evaluate an Amazon ML Model
Use this Amazon ML Model for prediction

Data preparation

Download the training data from here.
Review the data based on the data schema provided (in the above link).
Upload the training (train.csv) and prediction (test.csv) data to S3

Create a datasource

Note: A datasource is the container where Amazon ML can access data to use in model training, evaluation, and prediction.
As of March 2017, only S3, RDS, and Redshift services are supported for ML datasources. We will use S3 in this example.
Select the source of the data. Since we are using S3, select the bucket and filename of the training data (e.g., s3://my-bucket-name/train.csv).
Create a name for the datasource (e.g., "ML-Trees-Training").
Create a schema for the data (i.e., data types in each column).
There are only 4 data types in Amazon ML: binary, categorical, numerical, and text.
Select the target (or field/column) you want to predict.
Note: It took AWS ~5 minutes to create my datasource (compute time: ~15 min).

Input schema

{
  "version" : "1.0",
  "rowId" : "Id",
  "rowWeight" : null,
  "targetAttributeName" : "Cover_Type",
  "dataFormat" : "CSV",
  "dataFileContainsHeader" : true,
  "attributes" : [ {
    "attributeName" : "Id",
    "attributeType" : "CATEGORICAL"
  }, {
    "attributeName" : "Elevation",
    "attributeType" : "NUMERIC"
  },
...
  } ],
  "excludedAttributeNames" : [ ]
}

Create and evaluate an Amazon ML Model

Select a datasource to train and evaluate our Amazon ML model (use the datasource created above).
Train:
- Find patterns in the dataset that correlate the attributes to the target using a part of the data (70%) from the datasource. That is, we want to set aside 30% of our training data to evaluate the training.
- These patterns are the Amazon ML model.
Evaluate:
- Using the remaining data, which already has a target, predict the target using the Amazon ML model and compare that to the target already on record.
Amazon ML model settings:
1. Default (recommended setting, if you are not a data-scientist)
2. Custom (gives you much more control)
Note: It took AWS ~2 minutes to create and evaluate my Amazon ML Model (compute time: ~1 min).
Average F1 score: 0.682 (F1 score is used to measure the quality of the ML model. It ranges between 0 and 1. The higher the F1 score, the better the ML model quality. So, this is a pretty good model)

Use this Amazon ML Model for prediction

Normally (or "in the real world"), we will not have predictions a priori, so we will use Amazon ML to create predictions.
Once we have the Amazon ML model created, we can use it to make predictions.
Run a batch prediction using the datasource we create above.
Analyze the results.

Binary Model

Scenario

Your company markets products to users based on their salary
Users sign up and create a profile
The profile combined with the user's annual salary determines what items the user can afford and, thus, should be notified of
One a user agrees to purchase an item, your company goes through a salary verification process
The problem is that users embellish their salary and your marketing is not reaching the correct audience
You have been given the task of predicting a user's salary based on their profile
This will improve the marketing efforts, since the users will see the products that they can afford
You have also been given historical income verification results, which have the verified salary on them
You will also be getting new verification data all the time and your prediction systems need to take this into account

Why use a Binary Classification Model for the above scenario?

There are only 2 outcomes:
1. Users with salaries less than or equal to $50k
2. Users with salaries greater than $50k
Categorical can work, but it cannot be tuned. When there are only 2 outcomes, binary is best.
Numeric cannot work, since it will not give distinct values, but a range of numbers.

Workflow

Environment setup:
- Create an AWS IAM role (attach AmazonMachineLearningFullAccess and AmazonS3FullAccess policies)
- Launch an EC2 instance with:
  - Amazon Linux (t2.micro)
  - Attach IAM role create above
Data preparation and staging
Create datasources
- Training
- Evaluation
Create and train an Amazon ML model
Use an Amazon ML model for predictions

Data preparation and staging

Create an S3 bucket for the data and upload data files to it
Required format for Amazon ML: CSV
Machine learning data rule: garbage in, garbage out
Normalize the data
Decide on the Amazon ML model to be used based on the dataset

We will be using the "Adult Names" dataset from the UC Irvine Machine Learning Repository (this data was extracted from the US Census Bureau database).

SSH into your EC2 instance:

$ ssh -i /path/to/private/key ec2-user@x.x.x.x

Run all of the following commands from within the EC2 instance.

Create an S3 bucket to store the data files in:

$ BUCKET_NAME=<name-of-your-bucket>
$ aws s3 mb s3://${BUCKET_NAME}

Download the datasets:

$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test
$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names

Convert dataset to format needed:

$ header=$( awk '!/^\||^>|^$/{sub(":","");print $1}END{print "target"}' adult.names | tr '\n' ',' )
$ echo ${header::-1} >salary-training.csv
$ cat adult.data >>salary-training.csv

Note: The above commands are just adding a header (i.e., column names) to the new CSV file.

Cleanup salary-training.csv dataset file:
- Remove any spaces around commas:
- ```
:%s/, /,/g
```
- Convert salary field to binary format:
- ```
:%s/,<=50K/,0/g
```
- ```
:%s/,>50K/,1/g
```
- Replace all ? with blank fields:
- ```
:%s/?//g
```

Creating a schema for a datasource

See: "Creating a Data Schema for Amazon ML" for details.
We need to tell Amazon ML how to interpret our dataset by telling it the layout of our data. This is done via a schema.
A schema is composed of all attributes in the input data and their corresponding data types. That is, the schema tells ML what type of data is in each of the columns in the CSV file.
Amazon ML uses the information in the schema to read and interpret the input data, compute statistics, apply the correct attribute transformations, and fine-tune its learning algorithms.
Amazon ML requires a schema or a record layout to be submitted with the data when using the AWS CLI.
The four valid data types are:
- Numeric: any numerical value
- Binary: 0/1, yes/no, y/n, true/false, t/f
- Categorical: a list of unique string values
- Text: strings, words, long-text, etc.

You should end up with a schema file that looks like the following:

$ cat salary-schema.json
{
    "version" : "1.0",
    "rowId" : null,
    "rowWeight" : null,
    "targetAttributeName" : "target",
    "dataFormat" : "CSV",
    "dataFileContainsHeader" : false,
    "attributes": [
        {
            "attributeName": "age",
            "attributeType": "NUMERIC"
        },
        {
            "attributeName": "workclass",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "fnlwgt",
            "attributeType": "NUMERIC"
        },
        {
            "attributeName": "education",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "education-num",
            "attributeType": "NUMERIC"
        },
        {
            "attributeName": "marital-status",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "occupation",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "relationship",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "race",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "sex",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "capital-gain",
            "attributeType": "NUMERIC"
        },
        {
            "attributeName": "capital-loss",
            "attributeType": "NUMERIC"
        },
        {
            "attributeName": "hours-per-week",
            "attributeType": "NUMERIC"
        },
        {
            "attributeName": "native-country",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "target",
            "attributeType": "BINARY"
        }
    ],
    "excludedAttributeNames" : [ ]
}

Create datasource

Upload data to our S3 bucket
Apply policies to the S3 bucket, such that Amazon ML can:
- View the bucket listing (s3:ListBucket)
- Get objects from the bucket (s3:GetObject)
Create an Amazon ML datasource with all the data and without any splitting
Create an Amazon ML datasource for training our model
Create an Amazon ML datasource for evaluating our model

Upload data to our S3 bucket:

$ aws s3 cp salary-training.csv s3://${BUCKET_NAME}/
$ aws s3 cp salary-schema.json s3://${BUCKET_NAME}/

Create S3 bucket policy JSON file:

 $ cat << EOF > salary-bucket-policy.json 
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AmazonML_s3:ListBucket",
            "Effect": "Allow",
            "Principal": {
                "Service": "machinelearning.amazonaws.com"
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::${BUCKET_NAME}",
            "Condition": {
                "StringLike": {
                    "s3:prefix": "salary*"
                }
            }
        },
        {
            "Sid": "AmazonML_s3:GetObject",
            "Effect": "Allow",
            "Principal": {
                "Service": "machinelearning.amazonaws.com"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::${BUCKET_NAME}/salary*"
        }
    ]
}
EOF

Create S3 bucket policy for our bucket:

$ aws s3api put-bucket-policy --bucket ${BUCKET_NAME} --policy file://salary-bucket-policy.json

Check the S3 bucket policy:

$ aws s3api get-bucket-policy --bucket ${BUCKET_NAME}

Create the datasource:

$ aws machinelearning create-data-source-from-s3 \
      --data-source-id "salary-data-all" \
      --data-source-name "salary-data-all" \
      --data-spec\
      DataSchemaLocationS3=s3://${BUCKET_NAME}/salary-schema.json,\
      DataLocationS3=s3://${BUCKET_NAME}/salary-training.csv \
      --compute-statistics

{
    "DataSourceId": "salary-data-all"
}

Poll the status of the datasource until complete (useful when creating a datasource via a script):

$ aws machinelearning wait data-source-available \
      --filter-variable Name --eq "salary-data-all"

Create a training datasource using a random selection of 70% of the data:

$ aws machinelearning create-data-source-from-s3 \
      --cli-input-json '{
      "DataSourceId": "salary-data-model-training",
      "DataSourceName": "salary-data-model-training",
      "ComputeStatistics": true,
      "DataSpec":
      {
        "DataLocationS3": "s3://${BUCKET_NAME}/salary-training.csv",
        "DataSchemaLocationS3": "s3://${BUCKET_NAME}/salary-schema.json",
        "DataRearrangement": "{\"splitting\": {\"percentBegin\": 0, \"percentEnd\": 70, \"strategy\": \"random\"}}"
      }
  }'
$ time aws machinelearning wait data-source-available \
       --filter-variable Name \
       --eq "salary-data-model-training"

real	4m1.248s
user	0m0.456s
sys	0m0.036s

Create an evaluation datasource using a random selection of 30% of the data:

$ aws machinelearning create-data-source-from-s3 \
      --cli-input-json '{
      "DataSourceId": "salary-data-model-evaluation",
      "DataSourceName": "salary-data-model-evaluation",
      "ComputeStatistics": true,
      "DataSpec":
      {
        "DataLocationS3": "s3://${BUCKET_NAME}/salary-training.csv",
        "DataSchemaLocationS3": "s3://${BUCKET_NAME}/salary-schema.json",
        "DataRearrangement": "{\"splitting\": {\"percentBegin\": 0, \"percentEnd\": 70, \"strategy\": \"random\", \"complement\": true}}"
      }
  }'

Get details for all datasources in a given region:

$ aws machinelearning describe-data-sources --region us-east-1

Get details for a given datasource:

$ aws machinelearning get-data-source --data-source-id "salary-data-model-evaluation"

{
    "Status": "COMPLETED", 
    "ComputeTime": 960000, 
    "NumberOfFiles": 1, 
    "Name": "salary-data-model-evaluation", 
    "DataLocationS3": "s3://${BUCKET_NAME}/salary-training.csv", 
    "CreatedByIamUser": "arn:aws:sts::012345678987:assumed-role/xtof-ml-s3-role/i-047ed42adc1043571", 
    "DataSizeInBytes": 1017582, 
    "ComputeStatistics": true, 
    "LastUpdatedAt": 1489770198.35, 
    "DataSourceId": "salary-data-model-evaluation", 
    "StartedAt": 1489769884.035, 
    "LogUri": "https://eml-prod-emr.s3.amazonaws.com/012345678987-ds-salary-data-model-evaluation/...", 
    "DataRearrangement": "{\"splitting\": {\"percentBegin\": 0, \"percentEnd\": 70, \"strategy\": \"random\", \"complement\": true}}", 
    "CreatedAt": 1489769882.101, 
    "FinishedAt": 1489770198.35
}

Get even more details for a given datasource (output will include datasource schema):

$ aws machinelearning describe-data-sources --verbose

Create and train an Amazon ML model

Create an Amazon ML model from the "salary-data-model-training" datasource:

$ aws machinelearning create-ml-model \
      --ml-model-id "salary-model-v1" \
      --ml-model-name "salary-model-v1" \
      --ml-model-type BINARY \
      --training-data-source-id "salary-data-model-training"
$ aws machinelearning wait ml-model-available --filter-variable Name --eq "salary-model-v1" --region us-east-1

Get details on all Amazon ML models:

$ aws machinelearning describe-ml-models

Get details on a given Amazon ML model:

$ aws machinelearning describe-ml-models --filter-variable Name --eq "salary-model-v1"
#~OR~
$ aws machinelearning get-ml-model --ml-model-id "salary-model-v1"
#~OR~
$ aws machinelearning get-ml-model --ml-model-id "salary-model-v1" --verbose

Evaluate the performance of the Amazon ML model created above:

$ aws machinelearning create-evaluation \
      --evaluation-id "salary-model-eval1" \
      --evaluation-name "salary-model-eval1" \
      --ml-model-id "salary-model-v1" \
      --evaluation-data-source-id "salary-data-model-evaluation"
$ aws machinelearning wait evaluation-available --filter-variable Name --eq "salary-model-eval1"

$ aws machinelearning get-evaluation --evaluation-id "salary-model-eval1"

{
    "EvaluationDataSourceId": "salary-data-model-evaluation", 
    "Status": "COMPLETED", 
    "ComputeTime": 108000, 
    "Name": "salary-model-eval1", 
    "InputDataLocationS3": "s3://${BUCKET_NAME}/salary-training.csv", 
    "EvaluationId": "salary-model-eval1", 
    "CreatedByIamUser": "arn:aws:sts::012345678987:assumed-role/xtof-ml-s3-role/i-047ed42adc1043571", 
    "MLModelId": "salary-model-v1", 
    "LastUpdatedAt": 1489772087.318, 
    "StartedAt": 1489771897.375, 
    "LogUri": "https://eml-prod-emr.s3.amazonaws.com/012345678987-ev-salary-model-eval1/...", 
    "PerformanceMetrics": {
        "Properties": {
            "BinaryAUC": "0.9191914499393076"
        }
    }, 
    "CreatedAt": 1489771894.896, 
    "FinishedAt": 1489772087.318
}

The evaluation of my Amazon ML model produced an AUC (model's quality score) of ~0.919, which is considered extremely good for most machine learning applications.

If you want, you can adjust score threshold:

$ aws machinelearning update-ml-model --ml-model-id "salary-model-v1" --score-threshold 0.51

However, we are going to leave it at 0.5, as that provides the best quality score (or AUC).

Create a real-time endpoint for making predictions:

$ aws machinelearning create-realtime-endpoint --ml-model-id "salary-model-v1"

{
    "MLModelId": "salary-model-v1", 
    "RealtimeEndpointInfo": {
        "EndpointStatus": "UPDATING", 
        "PeakRequestsPerSecond": 0, 
        "CreatedAt": 1489772906.293, 
        "EndpointUrl": "https://realtime.machinelearning.us-east-1.amazonaws.com"
    }
}

Create a prediction based on given characteristics:

$ aws machinelearning predict --ml-model-id "salary-model-v1" --record \
      "age=34,workclass=Private,fnlwgt=338955,education=Bachelors,education-num=13,marital-status=Never-married,occupation=Armed-Forces,relationship=Unmarried,race=Asian-Pacific-Islander,sex=Male,hours-per-week=40,native-country=United-States" \
      --predict-endpoint "https://realtime.machinelearning.us-east-1.amazonaws.com"

{
    "Prediction": {
        "predictedLabel": "0", 
        "predictedScores": {
            "0": 0.45939722657203674
        }, 
        "details": {
            "PredictiveModelType": "BINARY", 
            "Algorithm": "SGD"
        }
    }
}

The prediction is that someone with the above characteristics will make less than $50k/year (i.e., "predictedLabel" = 0).

Create another prediction based on given characteristics:

$ aws machinelearning predict --ml-model-id "salary-model-v1" --record \
      "age=64,workclass=Private,education=Doctorate,marital-status=Married-civ-spouse,sex=Male,native-country=United-States" \
       --predict-endpoint "https://realtime.machinelearning.us-east-1.amazonaws.com"

{
    "Prediction": {
        "predictedLabel": "1", 
        "predictedScores": {
            "1": 0.900274932384491
        }, 
        "details": {
            "PredictiveModelType": "BINARY", 
            "Algorithm": "SGD"
        }
    }
}

The prediction is that someone with the above characteristics will make more than $50k/year (i.e., "predictedLabel" = 1).

Glossary

AUC: Area Under Curve (if the AUC is greater than the baseline AUC, it is generally a good model. The closer to 1, the better the model quality).

External links

AWS Machine Learning
Forest Cover Type Prediction — on Kaggle Datasets
University of California at Irvine (UCI) Machine Learning Repository

AWS/Machine Learning

Contents

Machine Learning concepts

Examples

Multi-Class Model

Binary Model

Glossary

External links

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools