Difference between revisions of "Big Data"

From Christoph's Personal Wiki
Jump to: navigation, search
(Big Data tools and technologies)
(Big Data tools and technologies)
Line 230: Line 230:
 
* The ''Namenode'' and ''Datanodes'' are Java software designed to run on commodity hardware that supports Java.
 
* The ''Namenode'' and ''Datanodes'' are Java software designed to run on commodity hardware that supports Java.
 
* Usually, a cluster contains a single ''Namenode'' and multiple ''Datanodes''; one each for each node in the cluster.
 
* Usually, a cluster contains a single ''Namenode'' and multiple ''Datanodes''; one each for each node in the cluster.
 +
 +
; File system and replication in HDFS
 +
* Supports a traditional hierarchical file system. Users or applications can create directories and store files inside of directories. Can move, rename, or delete files and directories.
 +
* Stores each file as a sequence of blocks, where blocks are replicated for fault-tolerance. Block size and replication factors are configurable per file.
 +
* The ''Namenode'' makes all decisions regarding replication of blocks. It periodically receives heartbeat and BlockReport from each of the ''Datanodes'' in the cluster. A receipt of a heartbeat implies that the ''Datanode'' is functioning properly.
 +
 +
; Hadoop MapReduce
 +
* Software framework:
 +
** For easily writing applications that process vast amount of data (multi-terabyte data sets)
 +
** Enables parallel processing on large clusters of thousands of nodes
 +
** Runs on clusters made of commodity hardware
 +
** Is reliable and fault-tolerant
 +
* The Hadoop MapReduce layer consists of two components:
 +
** An API for writing MapReduce workflows in Java
 +
** A set of services for managing these workflows, providing scheduling, distribution, and parallelizing.
 +
 +
; A MapReduce job consists of the following steps:
 +
* Splits the data sets into independent chunks.
 +
* Data sets are processed by map tasks in a parallel manner.
 +
* The MapReduce framework sorts the output of map jobs and feeds them to the reduce tasks.
 +
* Both input and output of map and reducec tasks are stored on the file system.
 +
* The framework takes care of scheduling tasks, monitoring them in re-executing failed tasks.
 +
* The MapReduce framework and HDFS are running on the same set of nodes. Tasks are scheduled on nodes where data is already present, thereby yielding high-bandwidth across the cluster.
 +
 +
; Inputs and outputs in a MapReduce job
 +
* MapReduce operates exclusively on key-value pairs.
 +
* Input is a large-scale data set that benefits from parallel processing and does not fit on a single machine.
 +
* Input is split into multiple independent data sets and the map function produces a key-value pair for each record in the data set.
 +
* The output of mappers are shuffled, sorted, grouped, and passed to the reducers.
 +
* The reducer function is applied to set of key-value pairs that share the same key. The reducer function often aggregates the value for the pairs with the same key.
 +
* Almost all data can be mapped to a key-value pair by using a map function.
 +
* Keys and values can be of any type: string, numeric, or a custom type. Since they have to be serializable, a custom type must implement the writable interface provided by MapReduce.
 +
* MapReduce cannot be used if a computation of a value depends on a previously computed value. Recursive functions like Fibonacci cannot be implemented using MapReduce.
 +
 +
; Applications of MapReduce
 +
* Counting votes across the US by processing data from each polling booth.
 +
* Aggregating electricity consumption from data points collected across a large geographical area.
 +
* Word count applications used for document classification, page rank in web searches, sntiment analysis, etc.
 +
* Used by Google Maps to calculate nearest neighbour (e.g., find the nearest gas station).
 +
* Performing statistical aggregate type functions on large data sets.
 +
* Distributed grep: grepping files across a large cluster.
 +
* Counting the number of href links in web log files for clickstream analysis.
 +
 +
; Writing and running MapReduce jobs
 +
* MapReduce jobs are typically written in Java, but can also be written using:
 +
** Hadoop Streaming: A utility that allows users to create and run MapReduce jobs with any executables (e.g., Linux shell utilities) as the mapper and/or the reducer.
 +
** Hadoop Pipes: A SWIG-compatible C++ API to implement MapReduce applications.
 +
* A Hadoop job configuration consists of:
 +
** Input and output locations on HDFS.
 +
** Map and reduce functions via implementations of interfaces or abstract classes.
 +
** Other job parameters
 +
* A Hadoop job client then submits the job (jar/executable) and configuration to the ResourceManager in YARN, which distributes them to the workers and performs functions like scheduling, monitoring, and providing status and diagnostic information.
  
 
==Glossary==
 
==Glossary==

Revision as of 01:29, 6 March 2017

Big Data is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating, and information privacy. The term "big data" often refers simply to the use of predictive analytics, user behaviour analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.

Big Data terms and concepts

  • Doug Laney's "4 V's of Big Data":[1]
Volume 
Extremely large volumes of data (i.e., peta- or exa-bytes, as of February 2017)
Variety 
Various forms of data (structured, semi-structured, and unstructured)
Velocity 
Real-time (e.g., IoT, social media, sensors, etc.), batch, streams of data. Is usually either human- or machined-generated data.
Veracity or variability 
Inconsistent, sometimes inaccurate, varying, or missing data
  • Format of Big Data:
Structured 
Data that has a defined length and format (aka "schema"). Examples include numbers, words, dates, etc. Easy to store and analyse. Often managed using SQL.
Semi-structured
Between structured and unstructured. Does not conform to a specific format, but is self-describing and involving simple key-value pairs. Examples include JSON, SWIFT (financial transactions), and EDI (healthcare).
Unstructured
Data that does not follow a specific format. Examples include audio, video, images, text messages, etc.
  • Big Data Analytics:
Basic analytics 
Reporting, dashboards, simple visualizations, slicing and dicing.
Advanced analytics 
Complex analytics models using machine learning, statistics, text analytics, neural networks, data mining, etc.
Operationalized analytics 
Embedded big data analytics in a business process to streamline and increase efficiency.
Analytics for business decisions 
Implemented for better decision-making, which drives revenue.
  • What is IoT?
    • Internet of Things
    • Physical objects that are connected to the Internet
    • Identified by an IP address (IPv4 now; IPv6 in the future)
    • Devices communicate with each other and other Internet-enabled devices and systems
    • Includes everyday devices that utilize embedded technology to communicate with an external environment by connecting to the Internet
    • IoT data is high volume, high velocity, high variety, and high veracity
  • Examples of IoT:
    • Security systems
    • Thermostats (e.g., Nest)
    • Vehicles
    • Electronic appliances
    • Smart-lighting in households or commercial buildings (e.g., Philips Hue)
    • Fitness devices (e.g., Fitbit)
    • Sensors to measure environmental parameters (e.g., temperature, humidity, wind, etc.)
Cycle of Big Data management
  • Capture data: depending on the problem to be solved, decide on the data sources and the data to be collected.
  • Organize: cleanse, organize, and validate data. If data contains sensitive information, implement sufficient levels of security and governance.
  • Integrate: integrate with business rules and other relevant systems like data warehouses, CRMs, ERPs (Enterprise Resource Planning), etc.
  • Analyze: real-time analysis, batch type analysis, reports, visualizations, advanced analytics, etc.
  • Act: use analysis to solve the business problem.
Components of a Big Data infrastructure
  • Redundant physical infrastructure: hardware, storage servers, network, etc.
  • Security infrastructure: maintaining security and governance on data is critical to protect from misuse of Big Data.
  • Data stores: to capture structured, semi-structured, and un-structured data. Data stores that need to be fast, scalable, and durable.
  • Organize and integrate data: stage, clean, organize, normalize, transform, and integrate data.
  • Analytics: traditional and including Business Intelligence and advanced analytics.
  • Data:
    • text, audio, video, etc.
    • social media
    • machine generated
    • human generated
  • Capture:
    • distributed file systems
    • streaming data
    • NoSQL
    • RDBMS
  • Organize and integrate
    • Apache Spark SQL
    • Hadoop MapReduce
    • ETL / ELT
    • data warehouse
  • Analyze
    • predictive analytics
    • advanced analytics
    • social media and text analytics
    • alerts and recommendations
    • visualization, reports, dashboards
Physical infrastructure
  • You physical infrastructure can make or break your Big Data implementations. Has to support high-volume, high-velocity, high-variety of Big Data and be highly available, resilient, and redundant.
  • Requirements to factor while designing the infrastructure include performance, availability, scalability, flexibility, and costs.
  • Networks must be redundant and resilient and have sufficient capacity to accommodate the anticipated volume and velocity of data in addition to normal business data. You infrastructure should be elastic.
  • Hardware storage and servers must have sufficient computing power and memory to support analytics requirements.
  • Infrastructure operations: Managing and maintaining data centres to avoid catastrophic failure and thus preserve the integrity of data and continuity of business processes.
  • Cloud based infrastructures allow outsourcing of building Big Data infrastructure and managing the infrastructure.
Security infrastructure
  • Data access: Same as non-Big Data implementations. Data access is granted only to users who have legitimate business reason to access the data.
  • Application access; Accessing data from applications is defined by restrictions imposed by an API.
  • Data encryption: Encrypting and decrypting data for high-volume, high-velocity, and high-variability can be expensive (computationally and to your wallet). An alternative is to encrypt only certain elements of the data that are sensitive and critical.
  • Threat detection: With exposure to social media and mobile data comes increase exposure to threats. Multiple layers of defence for network security are required to protect from security threats.
Data stores to capture data
  • Data stores are at the core of the Big Data infrastructure and need to be fast, scalable, and highly available (HA).
  • A number of different data stores are available and each is suitable for a set of different requirements.
  • Example data stores include:
    • Distributed file systems (e.g., Hadoop Distributed File System (HDFS))
    • NoSQL databases (e.g., Cassandra, MongoDB)
    • Traditional RDBMs (e.g., MySQL, Postgres)
  • Real-time streaming data can be ingested using Apache Kafka, Apache Storm, Apache Spark streaming, etc.
Distributed file systems

A shared file system mounted on several nodes of a cluster, that has the following characteristics:

  • Access transparency: Can be accessed the same way as local files without knowing its exact location.
  • Concurrency transparency: All clients have the same view of the files.
  • Failure transparency: Clients should have a correct view after a server failure.
  • Scalability transparency: Should work for small loads and scale for bigger loads on addition of nodes.
  • Replication transparency: To support fault tolerance and high availability, replication across multiple servers must be transparent to the client.
Relational Database Management systems (RDBMs)

Relational databases are easy to store and query data and follow the ACID principle:

  • Atomicity: A transaction is all-or-nothing. If any part of the transaction fails, the entire transaction fails.
  • Consistency: Any transaction will bring a database from one valid state to another.
  • Isolation: Ensures that concurrent transactions executed result in a state of the system similar to if transactions were executed serially.
  • Durability: Ensures that once a transaction is committed, it remains so, irrespective of a database crash, power, or other errors.
NoSQL databases

Not only SQL (NoSQL) implies that there is more than one mechanism of storing data depending on the needs. NoSQL databases are mostly open source and built to handle the requirements of fast, scalable processing of unstructured data. Types of NoSQL databases are:

  • Document-oriented data stores are mainly designed to store and retrieve collections of documents and support complex data forms in several standard formats, such as JSON, XML, and binary forms (e.g., PDFs or MS Word documents). Examples: MongoDB, CouchDB.
  • A column-oriented database stores its content in columns instead of rows, with attribute values belonging to the same column stored contiguously. Examples: Cassandra, Hbase.
  • A graph database is designed to store and represent data that utilize a graph model with nodes, edges, and properties related to one another through relations. Examples: Neo4j.
  • Key-value stores data blobs referenced by a key. This is the simplest form of NoSQL and is high-performing. Examples: Memcached, Couchbase.
Extract Transform Load (ETL)
  • Extract: Read data from the data source.
  • Transform: Convert the format of the extracted data so that it conforms to the requirements of the target database.
  • Load: Write data to the target database.

Apache Hadoop's MapReduce is a popular choice for batch processing large volumes of data in a scalable and resilient style. Apache Spark is more suitable for applying complex analytics using machine learning models in an interactive approach.

Data Warehouse

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workloads from transaction workloads and enables an organization to consolidate data from several sources.

Properties of a data warehouse include:

  • Organized by subject are. It contains an abstracted view of the business.
  • Highly transformed and structured.
  • Data loaded in the data warehouse is based on a strictly defined use case.
Data Lake

A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.

Data lakes can be built on Hadoop's HDFS or in the Cloud using Amazon S3, Google Cloud Storage, etc.

Differences between a Data Lake and a Data Warehouse
  • Data Lake:
    • Retains all data.
    • Data types consist of data sources, such as web server logs, sensor data, social network activity, text and images, etc.
    • Processing of data follows "schema-on-read".
    • Extremely agile and can adapt to changes in business requirements quickly.
    • Harder to secure a data lake, but a top priority for data lake providers (as of early 2017).
    • Hardware to support a data lake consists of commodity servers connected in a cluster.
    • Used for advanced analytics, typically by data scientists.
  • Data Warehouse:
    • Only data that is required for the business use case and data that is highly modelled and structured.
    • Consists of data extracted from transactional systems and consist of quantitative metrics and the attributes.
    • Processing of data follows "schema-on-write".
    • Due to its structured modelling, it is time consuming to modify business processes.
    • An older technology and mature in security.
    • Hardware consists of enterprise-grade servers and vendor provided data warehousing software.
    • Used for operational analytics to report on KPIs and slices of the data.
Analyzing Big Data

Examples of analytics:

  • Predictive analytics: Using statistical models and machine learning algorithms to predict the outcome of a task.
  • Advanced analytics: Applying deep learning for applications like speech recognition, face recognition, genomics, etc.
  • Social media analytics: Analyze social media to determine market trends, forecast demand, etc.
  • Text analytics: Derive insights from text using Natural Language Processing (NLP).
  • Alerts and recommendations: Fraud detection and recommendation engines embedded within commerce applications.
  • Reports, dashboards, and visualizations: Description analytics providing answers on what, when, where type questions.
Big Data in the Cloud
Comparison of Big Data Cloud platforms
Amazon Google Microsoft
Big Data storage S3 Cloud Storage Azure Storage
MapReduce Elastic MapReduce (EMR) N/A (MapReduce replaced by Cloud Dataflow) Hadoop on Azure, HDInsight
Big Data analytics EMR with Apache Spark and Presto Cloud Dataproc, Cloud DataLab HDInsight
Relational Database RDS Cloud SQL SQL Database as a Service
NoSQL database DynamoDB or EMR with Apache Hbase Cloud BigTable, Cloud Datastore DocumentDB, TableStorage
Data warehouse Redshift BigQuery SQL Data warehouse as a Service
Stream processing Kinesis Cloud Dataflow Azure Stream analytics
Machine Learning Amazon ML Cloud Machine Learning Services, Vision API, Speech API, Natural Language API, Translate API Azure ML, Azure Cognitive Services


Big Data tools and technologies

Apache Hadoop
  • Hadoop is an open-source software framework for distributed storage and the distributed processing of very large data sets on compute clusters built from commodity hardware.
  • Originally build by Yahoo engineer Doug Cutting in 2005 and named after his son's toy elephant.
  • Inspired by Google's MapReduce and Google File System papers.
  • Written in Java to implement the MapReduce programming model for scalable, reliable, and distributed computing.

The Apache Hadoop framework is composed of the following modules:

  • Hadoop Common: Contains libraries and utilities needed by other Hadoop modules.
  • Hadoop Distributed File System (HDFS): A distributed file system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
  • Hadoop MapReduce: A programming model for large-scale data processing.
  • Hadoop YARN: A resource management platform responsible for managing compute resources in clusters and using them for the scheduling of users' applications.
Hadoop Distributed File System (HDFS)
  • Structured like a regular Unix-like file system with data storage distributed across several machines in a cluster.
  • A data service that sits atop regular file systems, allowing a fault-tolerant, resilient, clustered approach to storing and processing data.
  • Fault-tolerant: Detection of faults and quick automatic recovery is a core architectural goal.
  • Tuned to support large files. A typical file size is in the GB or TB and can support tens of millions of files by scaling to hundreds of nodes in a cluster.
  • Follow the write once, read multiple time approach, simplifying data coherency issues and enabling high throughput data access. Example: a web crawler application.
  • Optimized for throughput rather than latency. Hence, suited for long running batch operations on large scale data rather than interactive analysis on streaming data.
  • Moving computation near the data reduces network congestion and latency and increases throughput. HDFS provides interfaces or applications to move closer to where the data is stored.
HDFS architecture
  • Master-slave architecture, where Namenode is the master and Datanodes are the slaves.
  • Files are split into blocks, and blocks are stored on Datanodes, which are generally one per node in the cluster.
  • Datanodes manage storage attached to the nodes that they run on.
  • The Namenode controls all metadata, including what blocks make up a file and which Datanode the blocks are stored on.
  • The Namenode executes file system operations like opening, closing, and renaming files and directories.
  • Datanodes serve read/write requests from the clients.
  • Datanodes also perform block creation, deletion, and replication upon instruction from the Namenode.
  • The Namenode and Datanodes are Java software designed to run on commodity hardware that supports Java.
  • Usually, a cluster contains a single Namenode and multiple Datanodes; one each for each node in the cluster.
File system and replication in HDFS
  • Supports a traditional hierarchical file system. Users or applications can create directories and store files inside of directories. Can move, rename, or delete files and directories.
  • Stores each file as a sequence of blocks, where blocks are replicated for fault-tolerance. Block size and replication factors are configurable per file.
  • The Namenode makes all decisions regarding replication of blocks. It periodically receives heartbeat and BlockReport from each of the Datanodes in the cluster. A receipt of a heartbeat implies that the Datanode is functioning properly.
Hadoop MapReduce
  • Software framework:
    • For easily writing applications that process vast amount of data (multi-terabyte data sets)
    • Enables parallel processing on large clusters of thousands of nodes
    • Runs on clusters made of commodity hardware
    • Is reliable and fault-tolerant
  • The Hadoop MapReduce layer consists of two components:
    • An API for writing MapReduce workflows in Java
    • A set of services for managing these workflows, providing scheduling, distribution, and parallelizing.
A MapReduce job consists of the following steps
  • Splits the data sets into independent chunks.
  • Data sets are processed by map tasks in a parallel manner.
  • The MapReduce framework sorts the output of map jobs and feeds them to the reduce tasks.
  • Both input and output of map and reducec tasks are stored on the file system.
  • The framework takes care of scheduling tasks, monitoring them in re-executing failed tasks.
  • The MapReduce framework and HDFS are running on the same set of nodes. Tasks are scheduled on nodes where data is already present, thereby yielding high-bandwidth across the cluster.
Inputs and outputs in a MapReduce job
  • MapReduce operates exclusively on key-value pairs.
  • Input is a large-scale data set that benefits from parallel processing and does not fit on a single machine.
  • Input is split into multiple independent data sets and the map function produces a key-value pair for each record in the data set.
  • The output of mappers are shuffled, sorted, grouped, and passed to the reducers.
  • The reducer function is applied to set of key-value pairs that share the same key. The reducer function often aggregates the value for the pairs with the same key.
  • Almost all data can be mapped to a key-value pair by using a map function.
  • Keys and values can be of any type: string, numeric, or a custom type. Since they have to be serializable, a custom type must implement the writable interface provided by MapReduce.
  • MapReduce cannot be used if a computation of a value depends on a previously computed value. Recursive functions like Fibonacci cannot be implemented using MapReduce.
Applications of MapReduce
  • Counting votes across the US by processing data from each polling booth.
  • Aggregating electricity consumption from data points collected across a large geographical area.
  • Word count applications used for document classification, page rank in web searches, sntiment analysis, etc.
  • Used by Google Maps to calculate nearest neighbour (e.g., find the nearest gas station).
  • Performing statistical aggregate type functions on large data sets.
  • Distributed grep: grepping files across a large cluster.
  • Counting the number of href links in web log files for clickstream analysis.
Writing and running MapReduce jobs
  • MapReduce jobs are typically written in Java, but can also be written using:
    • Hadoop Streaming: A utility that allows users to create and run MapReduce jobs with any executables (e.g., Linux shell utilities) as the mapper and/or the reducer.
    • Hadoop Pipes: A SWIG-compatible C++ API to implement MapReduce applications.
  • A Hadoop job configuration consists of:
    • Input and output locations on HDFS.
    • Map and reduce functions via implementations of interfaces or abstract classes.
    • Other job parameters
  • A Hadoop job client then submits the job (jar/executable) and configuration to the ResourceManager in YARN, which distributes them to the workers and performs functions like scheduling, monitoring, and providing status and diagnostic information.

Glossary

Hadoop
A software ecosystem that enables massively parallel computations distributed across thousands of (commodity) servers in a cluster
Data Lake

See also

References

  1. 4 Vs For Big Data Analytics. 2013-06-31.