AWS/Machine Learning
From Christoph's Personal Wiki
This article will be about Amazon Web Services - Machine Learning (ML).
Machine Learning concepts
- What is Machine Learning (ML)?
- The basic concept of ML is to have computers or machines program themselves.
- Machines can analyze large and complex datasets and identify patterns to create models, which are then used to predict outcomes.
- Over time, these models can take into account new datasets and improve the accuracy of the predictions.
- Examples of where ML is being used
- Recommendations when checking out on an e-commerce site (e.g., purchases on Amazon.com)
- Spam detection in email
- Any kind of image, speech, or text recognition
- Weather forecasts
- Search engines
- What is Amazon ML?
- Amazon ML is supervised ML; learns from examples or historical data.
- An Amazon ML Model requires your dataset to have both the features and the target for each observation/record.
- A feature is an attribute of a record used to identify patterns; typically, there will be multiple features.
- A target is the outcome that the patterns are linked to and is the value the ML algorithm is going to predict.
- This linking is used to predict the outcomes
- Example: {Go to the grocery store} {on Monday} (attribute {feature}) => Buy milk (target)
- Why do ML on AWS?
- Simplifies the whole process
- No coding required for creating models
- Identifies the best ML algorithm to run based on the input data
- Easily integrates into other AWS services for data retrieval
- Deploy within minutes
- Full access via APIs
- Scalable
- Amazon ML pricing (as of March 2017)
- Data Analysis and Model Building fees: $0.42/hour
- Batch Predictions $0.10 per 1,000 predictions, rounded up to the next 1,000
- Real-time predictions: $0.0001 per prediction, rounded up to the nearest penny (plus hourly capacity reservation charge only when the endpoint is active)
- AWS ML Workflow
- Create a data source
- S3 (i.e., upload a CSV file to S3)
- RDS and Redshift (i.e., run a SQL query on a Redshift cluster and get the data back directly into ML)
- Identify the feature and target columns
- Select whether the file has a header row
- Select the correct field data types (possible types: binary, categorical, numeric, text)
- Select the target that needs to be predicted
- Select a Row ID, if the data has one
- Train a model with a part of the dataset (generally 70%)
- By default, AWS ML takes 70% of your data and uses it to train the model
- It also automatically decides the best ML Model algorithm to use, based on the data schema
- Binary target => binary model
- Numeric target => regression model
- Categorical target => multi-class model
- Evaluate the model by running the remaining dataset through it
- AWS ML automatically evaluates the model based on the data source for you
- If using the API, you would have to do this in a separate step
- Fine-tune the model
- Use the model for predictions
- Types of ML models available
- Binary
- The target/prediction value is a 0 or 1
- Best used when the prediction is a Boolean or one of two possible outcomes (e.g., true/false, yes/no, green apple/red apple, etc.)
- Examples:
- Does an email match the spam criteria?
- Will someone respond to a marketing email?
- Does a purchase on a credit card seem fraudulent?
- Multi-class
- The target/prediction is from a set of values
- Best used for predicting categories or types
- Examples:
- What is the next product a user will purchase based on his/her history of purchases?
- Film recommendations
- Regression
- The target/prediction is a numeric value
- Best used for predicting scores
- Root mean square error (RMSE)
- AWS ML takes the mean of the training target data (RMSE Baseline) and uses that as a baseline and compares it to the mean of the predictions (RMSE)
- A RMSE lower than the RMSE Baseline is better
- Examples:
- How many millimetres of rain can we expect?
- Traffic delays
- How many goals will my soccer team score?