AWS/Machine Learning

From Christoph's Personal Wiki
Jump to: navigation, search

This article will be about Amazon Web Services - Machine Learning (ML).

Machine Learning concepts

What is Machine Learning (ML)?
  • The basic concept of ML is to have computers or machines program themselves.
  • Machines can analyze large and complex datasets and identify patterns to create models, which are then used to predict outcomes.
  • Over time, these models can take into account new datasets and improve the accuracy of the predictions.
Examples of where ML is being used
  • Recommendations when checking out on an e-commerce site (e.g., purchases on Amazon.com)
  • Spam detection in email
  • Any kind of image, speech, or text recognition
  • Weather forecasts
  • Search engines
What is Amazon ML?
  • Amazon ML is supervised ML; learns from examples or historical data.
  • An Amazon ML Model requires your dataset to have both the features and the target for each observation/record.
  • A feature is an attribute of a record used to identify patterns; typically, there will be multiple features.
  • A target is the outcome that the patterns are linked to and is the value the ML algorithm is going to predict.
  • This linking is used to predict the outcomes
  • Example: {Go to the grocery store} {on Monday} (attribute {feature}) => Buy milk (target)
Why do ML on AWS?
  • Simplifies the whole process
  • No coding required for creating models
  • Identifies the best ML algorithm to run based on the input data
  • Easily integrates into other AWS services for data retrieval
  • Deploy within minutes
  • Full access via APIs
  • Scalable
Amazon ML pricing (as of March 2017)
  • Data Analysis and Model Building fees: $0.42/hour
  • Batch Predictions $0.10 per 1,000 predictions, rounded up to the next 1,000
  • Real-time predictions: $0.0001 per prediction, rounded up to the nearest penny (plus hourly capacity reservation charge only when the endpoint is active)
What is an AWS ML datasource?
  • A datasource is an Amazon ML object that contains information about your input data, including its location, attribute names and types, and descriptive statistics for each attribute. Note that a datasource does not contain your input data, it only points to its location. Operations such as ML model training and evaluation use the datasource ID to locate and access their input data. See "Creating and Using Datasources" and "Data Insights" for details.
Garbage in, Garbage out
  • Amazon ML performs statistical analyses on your training and evaluation datasources to help you identify and fix any data anomalies. In order for Amazon ML to produce the best results, your data must be as clean and consistent as possible. For example, Amazon ML cannot tell that NY, ny, New York, and new_york mean the same thing, so the more consistent your datasource, the more accurate your results.
AWS ML Workflow
  1. Create a data source
    • S3 (i.e., upload a CSV file to S3)
    • RDS and Redshift (i.e., run a SQL query on a Redshift cluster and get the data back directly into ML)
  2. Identify the feature and target columns
    • Select whether the file has a header row
    • Select the correct field data types (possible types: binary, categorical, numeric, text)
    • Select the target that needs to be predicted
    • Select a Row ID, if the data has one
  3. Train a model with a part of the dataset (generally 70%)
    • By default, AWS ML takes 70% of your data and uses it to train the model
    • It also automatically decides the best ML Model algorithm to use, based on the data schema
      • Binary target => binary model
      • Numeric target => regression model
      • Categorical target => multi-class model
  4. Evaluate the model by running the remaining dataset through it
    • AWS ML automatically evaluates the model based on the data source for you
    • If using the API, you would have to do this in a separate step
  5. Fine-tune the model
  6. Use the model for predictions
Types of ML models available
  • Binary
    • The target/prediction value is a 0 or 1
    • Best used when the prediction is a Boolean or one of two possible outcomes (e.g., true/false, yes/no, green apple/red apple, etc.)
    • Examples:
      • Does an email match the spam criteria?
      • Will someone respond to a marketing email?
      • Does a purchase on a credit card seem fraudulent?
  • Multi-class
    • The target/prediction is from a set of values
    • Best used for predicting categories or types
    • Examples:
      • What is the next product a user will purchase based on his/her history of purchases?
      • Film recommendations
  • Regression
    • The target/prediction is a numeric value
    • Best used for predicting scores
    • Root mean square error (RMSE)
      • AWS ML takes the mean of the training target data (RMSE Baseline) and uses that as a baseline and compares it to the mean of the predictions (RMSE)
      • A RMSE lower than the RMSE Baseline is better
    • Examples:
      • How many millimetres of rain can we expect?
      • Traffic delays
      • How many goals will my soccer team score?

Examples

Multi-Class Model

Scenario
  • Given various attributes about an area in a forest-like wilderness area, soil type, and location.
  • Predict the main type of tree growing in the area (the target column is called "Cover_Type").
  • The target is the value that we want to predict using Amazon ML.
  • In the dataset, the field "Cover_Type" will be the target field and it represents the type of tree.
  • The dataset we will use is from this publicly available dataset
Use a multi-class classification model
  • Generate predictions for multiple classes (predict one of more than two outcomes).
  • This problem calls for a multi-class model, since the prediction can be one of the several types of trees.
    • Binary will not work, since it can only prediction one of two possibilities
    • Numeric will not work, since it only predicts numbers.
Workflow
  • Data preparation
  • Create a datasource
  • Create and evaluate an Amazon ML Model
  • Use this Amazon ML Model for prediction
Data preparation
  • Download the training data from here.
  • Review the data based on the data schema provided (in the above link).
  • Upload the training (train.csv) and prediction (test.csv) data to S3
Create a datasource
  • Note: A datasource is the container where Amazon ML can access data to use in model training, evaluation, and prediction.
  • As of March 2017, only S3, RDS, and Redshift services are supported for ML datasources. We will use S3 in this example.
  • Select the source of the data. Since we are using S3, select the bucket and filename of the training data (e.g., s3://my-bucket-name/train.csv).
  • Create a name for the datasource (e.g., "ML-Trees-Training").
  • Create a schema for the data (i.e., data types in each column).
  • There are only 4 data types in Amazon ML: binary, categorical, numerical, and text.
  • Select the target (or field/column) you want to predict.
  • Note: It took AWS ~5 minutes to create my datasource (compute time: ~15 min).
Input schema
{
  "version" : "1.0",
  "rowId" : "Id",
  "rowWeight" : null,
  "targetAttributeName" : "Cover_Type",
  "dataFormat" : "CSV",
  "dataFileContainsHeader" : true,
  "attributes" : [ {
    "attributeName" : "Id",
    "attributeType" : "CATEGORICAL"
  }, {
    "attributeName" : "Elevation",
    "attributeType" : "NUMERIC"
  },
...
  } ],
  "excludedAttributeNames" : [ ]
}
Create and evaluate an Amazon ML Model
  • Select a datasource to train and evaluate our Amazon ML model (use the datasource created above).
  • Train:
    • Find patterns in the dataset that correlate the attributes to the target using a part of the data (70%) from the datasource. That is, we want to set aside 30% of our training data to evaluate the training.
    • These patterns are the Amazon ML model.
  • Evaluate:
    • Using the remaining data, which already has a target, predict the target using the Amazon ML model and compare that to the target already on record.
  • Amazon ML model settings:
    1. Default (recommended setting, if you are not a data-scientist)
    2. Custom (gives you much more control)
  • Note: It took AWS ~2 minutes to create and evaluate my Amazon ML Model (compute time: ~1 min).
  • Average F1 score: 0.682 (F1 score is used to measure the quality of the ML model. It ranges between 0 and 1. The higher the F1 score, the better the ML model quality. So, this is a pretty good model)
Use this Amazon ML Model for prediction
  • Normally (or "in the real world"), we will not have predictions a priori, so we will use Amazon ML to create predictions.
  • Once we have the Amazon ML model created, we can use it to make predictions.
  • Run a batch prediction using the datasource we create above.
  • Analyze the results.

Binary Model

Scenario
  • Your company markets products to users based on their salary
  • Users sign up and create a profile
  • The profile combined with the user's annual salary determines what items the user can afford and, thus, should be notified of
  • One a user agrees to purchase an item, your company goes through a salary verification process
  • The problem is that users embellish their salary and your marketing is not reaching the correct audience
  • You have been given the task of predicting a user's salary based on their profile
  • This will improve the marketing efforts, since the users will see the products that they can afford
  • You have also been given historical income verification results, which have the verified salary on them
  • You will also be getting new verification data all the time and your prediction systems need to take this into account
Why use a Binary Classification Model for the above scenario?
  • There are only 2 outcomes:
    1. Users with salaries less than or equal to $50k
    2. Users with salaries greater than $50k
  • Categorical can work, but it cannot be tuned. When there are only 2 outcomes, binary is best.
  • Numeric cannot work, since it will not give distinct values, but a range of numbers.
Workflow
  • Environment setup:
    • Create an AWS IAM role (attach AmazonMachineLearningFullAccess and AmazonS3FullAccess policies)
    • Launch an EC2 instance with:
      • Amazon Linux (t2.micro)
      • Attach IAM role create above
  • Data preparation and staging
  • Create datasources
    • Training
    • Evaluation
  • Create and train an Amazon ML model
  • Use an Amazon ML model for predictions
Data preparation and staging
  • Create an S3 bucket for the data and upload data files to it
  • Required format for Amazon ML: CSV
  • Machine learning data rule: garbage in, garbage out
  • Normalize the data
  • Decide on the Amazon ML model to be used based on the dataset

We will be using the "Adult Names" dataset from the UC Irvine Machine Learning Repository (this data was extracted from the US Census Bureau database).

  • SSH into your EC2 instance:
$ ssh -i /path/to/private/key ec2-user@x.x.x.x

Run all of the following commands from within the EC2 instance.

  • Create an S3 bucket to store the data files in:
$ BUCKET_NAME=<name-of-your-bucket>
$ aws s3 mb s3://${BUCKET_NAME}
  • Download the datasets:
$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test
$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names
  • Convert dataset to format needed:
$ header=$( awk '!/^\||^>|^$/{sub(":","");print $1}END{print "target"}' adult.names | tr '\n' ',' )
$ echo ${header::-1} >salary-training.csv
$ cat adult.data >>salary-training.csv

Note: The above commands are just adding a header (i.e., column names) to the new CSV file.

  • Cleanup salary-training.csv dataset file:
    • Remove any spaces around commas:
    • :%s/, /,/g
    • Convert salary field to binary format:
    • :%s/,<=50K/,0/g
    • :%s/,>50K/,1/g
    • Replace all ? with blank fields:
    • :%s/?//g
Creating a schema for a datasource
  • See: "Creating a Data Schema for Amazon ML" for details.
  • We need to tell Amazon ML how to interpret our dataset by telling it the layout of our data. This is done via a schema.
  • A schema is composed of all attributes in the input data and their corresponding data types. That is, the schema tells ML what type of data is in each of the columns in the CSV file.
  • Amazon ML uses the information in the schema to read and interpret the input data, compute statistics, apply the correct attribute transformations, and fine-tune its learning algorithms.
  • Amazon ML requires a schema or a record layout to be submitted with the data when using the AWS CLI.
  • The four valid data types are:
    • Numeric: any numerical value
    • Binary: 0/1, yes/no, y/n, true/false, t/f
    • Categorical: a list of unique string values
    • Text: strings, words, long-text, etc.

You should end up with a schema file that looks like the following:

$ cat salary-schema.json
{
    "version" : "1.0",
    "rowId" : null,
    "rowWeight" : null,
    "targetAttributeName" : "target",
    "dataFormat" : "CSV",
    "dataFileContainsHeader" : false,
    "attributes": [
        {
            "attributeName": "age",
            "attributeType": "NUMERIC"
        },
        {
            "attributeName": "workclass",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "fnlwgt",
            "attributeType": "NUMERIC"
        },
        {
            "attributeName": "education",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "education-num",
            "attributeType": "NUMERIC"
        },
        {
            "attributeName": "marital-status",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "occupation",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "relationship",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "race",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "sex",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "capital-gain",
            "attributeType": "NUMERIC"
        },
        {
            "attributeName": "capital-loss",
            "attributeType": "NUMERIC"
        },
        {
            "attributeName": "hours-per-week",
            "attributeType": "NUMERIC"
        },
        {
            "attributeName": "native-country",
            "attributeType": "CATEGORICAL"
        },
        {
            "attributeName": "target",
            "attributeType": "BINARY"
        }
    ],
    "excludedAttributeNames" : [ ]
}
Create datasource
  • Upload data to our S3 bucket
  • Apply policies to the S3 bucket, such that Amazon ML can:
    • View the bucket listing (s3:ListBucket)
    • Get objects from the bucket (s3:GetObject)
  • Create an Amazon ML datasource with all the data and without any splitting
  • Create an Amazon ML datasource for training our model
  • Create an Amazon ML datasource for evaluating our model
  • Upload data to our S3 bucket:
$ aws s3 cp salary-training.csv s3://${BUCKET_NAME}/
$ aws s3 cp salary-schema.json s3://${BUCKET_NAME}/
  • Create S3 bucket policy JSON file:
 $ cat << EOF > salary-bucket-policy.json 
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AmazonML_s3:ListBucket",
            "Effect": "Allow",
            "Principal": {
                "Service": "machinelearning.amazonaws.com"
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::${BUCKET_NAME}",
            "Condition": {
                "StringLike": {
                    "s3:prefix": "salary*"
                }
            }
        },
        {
            "Sid": "AmazonML_s3:GetObject",
            "Effect": "Allow",
            "Principal": {
                "Service": "machinelearning.amazonaws.com"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::${BUCKET_NAME}/salary*"
        }
    ]
}
EOF
  • Create S3 bucket policy for our bucket:
$ aws s3api put-bucket-policy --bucket ${BUCKET_NAME} --policy file://salary-bucket-policy.json
  • Check the S3 bucket policy:
$ aws s3api get-bucket-policy --bucket ${BUCKET_NAME}
  • Create the datasource:
$ aws machinelearning create-data-source-from-s3 \
      --data-source-id "salary-data-all" \
      --data-source-name "salary-data-all" \
      --data-spec\
      DataSchemaLocationS3=s3://${BUCKET_NAME}/salary-schema.json,\
      DataLocationS3=s3://${BUCKET_NAME}/salary-training.csv \
      --compute-statistics
{
    "DataSourceId": "salary-data-all"
}
  • Poll the status of the datasource until complete (useful when creating a datasource via a script):
$ aws machinelearning wait data-source-available \
      --filter-variable Name --eq "salary-data-all"
  • Create a training datasource using a random selection of 70% of the data:
$ aws machinelearning create-data-source-from-s3 \
      --cli-input-json '{
      "DataSourceId": "salary-data-model-training",
      "DataSourceName": "salary-data-model-training",
      "ComputeStatistics": true,
      "DataSpec":
      {
        "DataLocationS3": "s3://${BUCKET_NAME}/salary-training.csv",
        "DataSchemaLocationS3": "s3://${BUCKET_NAME}/salary-schema.json",
        "DataRearrangement": "{\"splitting\": {\"percentBegin\": 0, \"percentEnd\": 70, \"strategy\": \"random\"}}"
      }
  }'
$ time aws machinelearning wait data-source-available \
       --filter-variable Name \
       --eq "salary-data-model-training"

real	4m1.248s
user	0m0.456s
sys	0m0.036s
  • Create an evaluation datasource using a random selection of 30% of the data:
$ aws machinelearning create-data-source-from-s3 \
      --cli-input-json '{
      "DataSourceId": "salary-data-model-evaluation",
      "DataSourceName": "salary-data-model-evaluation",
      "ComputeStatistics": true,
      "DataSpec":
      {
        "DataLocationS3": "s3://${BUCKET_NAME}/salary-training.csv",
        "DataSchemaLocationS3": "s3://${BUCKET_NAME}/salary-schema.json",
        "DataRearrangement": "{\"splitting\": {\"percentBegin\": 0, \"percentEnd\": 70, \"strategy\": \"random\", \"complement\": true}}"
      }
  }'
  • Get details for all datasources in a given region:
$ aws machinelearning describe-data-sources --region us-east-1
  • Get details for a given datasource:
$ aws machinelearning get-data-source --data-source-id "salary-data-model-evaluation"
{
    "Status": "COMPLETED", 
    "ComputeTime": 960000, 
    "NumberOfFiles": 1, 
    "Name": "salary-data-model-evaluation", 
    "DataLocationS3": "s3://${BUCKET_NAME}/salary-training.csv", 
    "CreatedByIamUser": "arn:aws:sts::012345678987:assumed-role/xtof-ml-s3-role/i-047ed42adc1043571", 
    "DataSizeInBytes": 1017582, 
    "ComputeStatistics": true, 
    "LastUpdatedAt": 1489770198.35, 
    "DataSourceId": "salary-data-model-evaluation", 
    "StartedAt": 1489769884.035, 
    "LogUri": "https://eml-prod-emr.s3.amazonaws.com/012345678987-ds-salary-data-model-evaluation/...", 
    "DataRearrangement": "{\"splitting\": {\"percentBegin\": 0, \"percentEnd\": 70, \"strategy\": \"random\", \"complement\": true}}", 
    "CreatedAt": 1489769882.101, 
    "FinishedAt": 1489770198.35
}
  • Get even more details for a given datasource (output will include datasource schema):
$ aws machinelearning describe-data-sources --verbose
Create and train an Amazon ML model
  • Create an Amazon ML model from the "salary-data-model-training" datasource:
$ aws machinelearning create-ml-model \
      --ml-model-id "salary-model-v1" \
      --ml-model-name "salary-model-v1" \
      --ml-model-type BINARY \
      --training-data-source-id "salary-data-model-training"
$ aws machinelearning wait ml-model-available --filter-variable Name --eq "salary-model-v1" --region us-east-1
  • Get details on all Amazon ML models:
$ aws machinelearning describe-ml-models
  • Get details on a given Amazon ML model:
$ aws machinelearning describe-ml-models --filter-variable Name --eq "salary-model-v1"
#~OR~
$ aws machinelearning get-ml-model --ml-model-id "salary-model-v1"
#~OR~
$ aws machinelearning get-ml-model --ml-model-id "salary-model-v1" --verbose
  • Evaluate the performance of the Amazon ML model created above:
$ aws machinelearning create-evaluation \
      --evaluation-id "salary-model-eval1" \
      --evaluation-name "salary-model-eval1" \
      --ml-model-id "salary-model-v1" \
      --evaluation-data-source-id "salary-data-model-evaluation"
$ aws machinelearning wait evaluation-available --filter-variable Name --eq "salary-model-eval1"
$ aws machinelearning get-evaluation --evaluation-id "salary-model-eval1"
{
    "EvaluationDataSourceId": "salary-data-model-evaluation", 
    "Status": "COMPLETED", 
    "ComputeTime": 108000, 
    "Name": "salary-model-eval1", 
    "InputDataLocationS3": "s3://${BUCKET_NAME}/salary-training.csv", 
    "EvaluationId": "salary-model-eval1", 
    "CreatedByIamUser": "arn:aws:sts::012345678987:assumed-role/xtof-ml-s3-role/i-047ed42adc1043571", 
    "MLModelId": "salary-model-v1", 
    "LastUpdatedAt": 1489772087.318, 
    "StartedAt": 1489771897.375, 
    "LogUri": "https://eml-prod-emr.s3.amazonaws.com/012345678987-ev-salary-model-eval1/...", 
    "PerformanceMetrics": {
        "Properties": {
            "BinaryAUC": "0.9191914499393076"
        }
    }, 
    "CreatedAt": 1489771894.896, 
    "FinishedAt": 1489772087.318
}

The evaluation of my Amazon ML model produced an AUC (model's quality score) of ~0.919, which is considered extremely good for most machine learning applications.

  • If you want, you can adjust score threshold:
$ aws machinelearning update-ml-model --ml-model-id "salary-model-v1" --score-threshold 0.51

However, we are going to leave it at 0.5, as that provides the best quality score (or AUC).

  • Create a real-time endpoint for making predictions:
$ aws machinelearning create-realtime-endpoint --ml-model-id "salary-model-v1"
{
    "MLModelId": "salary-model-v1", 
    "RealtimeEndpointInfo": {
        "EndpointStatus": "UPDATING", 
        "PeakRequestsPerSecond": 0, 
        "CreatedAt": 1489772906.293, 
        "EndpointUrl": "https://realtime.machinelearning.us-east-1.amazonaws.com"
    }
}
  • Create a prediction based on given characteristics:
$ aws machinelearning predict --ml-model-id "salary-model-v1" --record \
      "age=34,workclass=Private,fnlwgt=338955,education=Bachelors,education-num=13,marital-status=Never-married,occupation=Armed-Forces,relationship=Unmarried,race=Asian-Pacific-Islander,sex=Male,hours-per-week=40,native-country=United-States" \
      --predict-endpoint "https://realtime.machinelearning.us-east-1.amazonaws.com"
{
    "Prediction": {
        "predictedLabel": "0", 
        "predictedScores": {
            "0": 0.45939722657203674
        }, 
        "details": {
            "PredictiveModelType": "BINARY", 
            "Algorithm": "SGD"
        }
    }
}

The prediction is that someone with the above characteristics will make less than $50k/year (i.e., "predictedLabel" = 0).

  • Create another prediction based on given characteristics:
$ aws machinelearning predict --ml-model-id "salary-model-v1" --record \
      "age=64,workclass=Private,education=Doctorate,marital-status=Married-civ-spouse,sex=Male,native-country=United-States" \
       --predict-endpoint "https://realtime.machinelearning.us-east-1.amazonaws.com"
{
    "Prediction": {
        "predictedLabel": "1", 
        "predictedScores": {
            "1": 0.900274932384491
        }, 
        "details": {
            "PredictiveModelType": "BINARY", 
            "Algorithm": "SGD"
        }
    }
}

The prediction is that someone with the above characteristics will make more than $50k/year (i.e., "predictedLabel" = 1).

Glossary

AUC 
Area Under Curve (if the AUC is greater than the baseline AUC, it is generally a good model. The closer to 1, the better the model quality).

External links