Big Data

From Christoph's Personal Wiki
Revision as of 23:49, 5 March 2017 by Christoph (Talk | contribs)

Jump to: navigation, search

Big Data is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating, and information privacy. The term "big data" often refers simply to the use of predictive analytics, user behaviour analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.

Big Data terms and concepts

  • Doug Laney's "4 V's of Big Data":[1]
Volume 
Extremely large volumes of data (i.e., peta- or exa-bytes, as of February 2017)
Variety 
Various forms of data (structured, semi-structured, and unstructured)
Velocity 
Real-time (e.g., IoT, social media, sensors, etc.), batch, streams of data. Is usually either human- or machined-generated data.
Veracity or variability 
Inconsistent, sometimes inaccurate, varying, or missing data
  • Format of Big Data:
Structured 
Data that has a defined length and format (aka "schema"). Examples include numbers, words, dates, etc. Easy to store and analyse. Often managed using SQL.
Semi-structured
Between structured and unstructured. Does not conform to a specific format, but is self-describing and involving simple key-value pairs. Examples include JSON, SWIFT (financial transactions), and EDI (healthcare).
Unstructured
Data that does not follow a specific format. Examples include audio, video, images, text messages, etc.
  • Big Data Analytics:
Basic analytics 
Reporting, dashboards, simple visualizations, slicing and dicing.
Advanced analytics 
Complex analytics models using machine learning, statistics, text analytics, neural networks, data mining, etc.
Operationalized analytics 
Embedded big data analytics in a business process to streamline and increase efficiency.
Analytics for business decisions 
Implemented for better decision-making, which drives revenue.
  • What is IoT?
    • Internet of Things
    • Physical objects that are connected to the Internet
    • Identified by an IP address (IPv4 now; IPv6 in the future)
    • Devices communicate with each other and other Internet-enabled devices and systems
    • Includes everyday devices that utilize embedded technology to communicate with an external environment by connecting to the Internet
    • IoT data is high volume, high velocity, high variety, and high veracity
  • Examples of IoT:
    • Security systems
    • Thermostats (e.g., Nest)
    • Vehicles
    • Electronic appliances
    • Smart-lighting in households or commercial buildings (e.g., Philips Hue)
    • Fitness devices (e.g., Fitbit)
    • Sensors to measure environmental parameters (e.g., temperature, humidity, wind, etc.)
Cycle of Big Data management
  • Capture data: depending on the problem to be solved, decide on the data sources and the data to be collected.
  • Organize: cleanse, organize, and validate data. If data contains sensitive information, implement sufficient levels of security and governance.
  • Integrate: integrate with business rules and other relevant systems like data warehouses, CRMs, ERPs (Enterprise Resource Planning), etc.
  • Analyze: real-time analysis, batch type analysis, reports, visualizations, advanced analytics, etc.
  • Act: use analysis to solve the business problem.
Components of a Big Data infrastructure
  • Redundant physical infrastructure: hardware, storage servers, network, etc.
  • Security infrastructure: maintaining security and governance on data is critical to protect from misuse of Big Data.
  • Data stores: to capture structured, semi-structured, and un-structured data. Data stores that need to be fast, scalable, and durable.
  • Organize and integrate data: stage, clean, organize, normalize, transform, and integrate data.
  • Analytics: traditional and including Business Intelligence and advanced analytics.
  • Data:
    • text, audio, video, etc.
    • social media
    • machine generated
    • human generated
  • Capture:
    • distributed file systems
    • streaming data
    • NoSQL
    • RDBMS
  • Organize and integrate
    • Apache Spark SQL
    • Hadoop MapReduce
    • ETL / ELT
    • data warehouse
  • Analyze
    • predictive analytics
    • advanced analytics
    • social media and text analytics
    • alerts and recommendations
    • visualization, reports, dashboards
Physical infrastructure
  • You physical infrastructure can make or break your Big Data implementations. Has to support high-volume, high-velocity, high-variety of Big Data and be highly available, resilient, and redundant.
  • Requirements to factor while designing the infrastructure include performance, availability, scalability, flexibility, and costs.
  • Networks must be redundant and resilient and have sufficient capacity to accommodate the anticipated volume and velocity of data in addition to normal business data. You infrastructure should be elastic.
  • Hardware storage and servers must have sufficient computing power and memory to support analytics requirements.
  • Infrastructure operations: Managing and maintaining data centres to avoid catastrophic failure and thus preserve the integrity of data and continuity of business processes.
  • Cloud based infrastructures allow outsourcing of building Big Data infrastructure and managing the infrastructure.
Security infrastructure
  • Data access: Same as non-Big Data implementations. Data access is granted only to users who have legitimate business reason to access the data.
  • Application access; Accessing data from applications is defined by restrictions imposed by an API.
  • Data encryption: Encrypting and decrypting data for high-volume, high-velocity, and high-variability can be expensive (computationally and to your wallet). An alternative is to encrypt only certain elements of the data that are sensitive and critical.
  • Threat detection: With exposure to social media and mobile data comes increase exposure to threats. Multiple layers of defence for network security are required to protect from security threats.
Data stores to capture data
  • Data stores are at the core of the Big Data infrastructure and need to be fast, scalable, and highly available (HA).
  • A number of different data stores are available and each is suitable for a set of different requirements.
  • Example data stores include:
    • Distributed file systems (e.g., Hadoop Distributed File System (HDFS))
    • NoSQL databases (e.g., Cassandra, MongoDB)
    • Traditional RDBMs (e.g., MySQL, Postgres)
  • Real-time streaming data can be ingested using Apache Kafka, Apache Storm, Apache Spark streaming, etc.
Distributed file systems

A shared file system mounted on several nodes of a cluster, that has the following characteristics:

  • Access transparency: Can be accessed the same way as local files without knowing its exact location.
  • Concurrency transparency: All clients have the same view of the files.
  • Failure transparency: Clients should have a correct view after a server failure.
  • Scalability transparency: Should work for small loads and scale for bigger loads on addition of nodes.
  • Replication transparency: To support fault tolerance and high availability, replication across multiple servers must be transparent to the client.
Relational Database Management systems (RDBMs)

Relational databases are easy to store and query data and follow the ACID principle:

  • Atomicity: A transaction is all-or-nothing. If any part of the transaction fails, the entire transaction fails.
  • Consistency: Any transaction will bring a database from one valid state to another.
  • Isolation: Ensures that concurrent transactions executed result in a state of the system similar to if transactions were executed serially.
  • Durability: Ensures that once a transaction is committed, it remains so, irrespective of a database crash, power, or other errors.
NoSQL databases

Not only SQL (NoSQL) implies that there is more than one mechanism of storing data depending on the needs. NoSQL databases are mostly open source and built to handle the requirements of fast, scalable processing of unstructured data. Types of NoSQL databases are:

  • Document-oriented data stores are mainly designed to store and retrieve collections of documents and support complex data forms in several standard formats, such as JSON, XML, and binary forms (e.g., PDFs or MS Word documents). Examples: MongoDB, CouchDB.
  • A column-oriented database stores its content in columns instead of rows, with attribute values belonging to the same column stored contiguously. Examples: Cassandra, Hbase.
  • A graph database is designed to store and represent data that utilize a graph model with nodes, edges, and properties related to one another through relations. Examples: Neo4j.
  • Key-value stores data blobs referenced by a key. This is the simplest form of NoSQL and is high-performing. Examples: Memcached, Couchbase.
Extract Transform Load (ETL)
  • Extract: Read data from the data source.
  • Transform: Convert the format of the extracted data so that it conforms to the requirements of the target database.
  • Load: Write data to the target database.

Apache Hadoop's MapReduce is a popular choice for batch processing large volumes of data in a scalable and resilient style. Apache Spark is more suitable for applying complex analytics using machine learning models in an interactive approach.

Data Warehouse

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workloads from transaction workloads and enables an organization to consolidate data from several sources.

Properties of a data warehouse include:

  • Organized by subject are. It contains an abstracted view of the business.
  • Highly transformed and structured.
  • Data loaded in the data warehouse is based on a strictly defined use case.
Data Lake

A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.

Data lakes can be built on Hadoop's HDFS or in the Cloud using Amazon S3, Google Cloud Storage, etc.

Differences between a Data Lake and a Data Warehouse
  • Data Lake:
    • Retains all data.
    • Data types consist of data sources, such as web server logs, sensor data, social network activity, text and images, etc.
    • Processing of data follows "schema-on-read".
    • Extremely agile and can adapt to changes in business requirements quickly.
    • Harder to secure a data lake, but a top priority for data lake providers (as of early 2017).
    • Hardware to support a data lake consists of commodity servers connected in a cluster.
    • Used for advanced analytics, typically by data scientists.
  • Data Warehouse:
    • Only data that is required for the business use case and data that is highly modelled and structured.
    • Consists of data extracted from transactional systems and consist of quantitative metrics and the attributes.
    • Processing of data follows "schema-on-write".
    • Due to its structured modelling, it is time consuming to modify business processes.
    • An older technology and mature in security.
    • Hardware consists of enterprise-grade servers and vendor provided data warehousing software.
    • Used for operational analytics to report on KPIs and slices of the data.
Analyzing Big Data

Examples of analytics:

  • Predictive analytics: Using statistical models and machine learning algorithms to predict the outcome of a task.
  • Advanced analytics: Applying deep learning for applications like speech recognition, face recognition, genomics, etc.
  • Social media analytics: Analyze social media to determine market trends, forecast demand, etc.
  • Text analytics: Derive insights from text using Natural Language Processing (NLP).
  • Alerts and recommendations: Fraud detection and recommendation engines embedded within commerce applications.
  • Reports, dashboards, and visualizations: Description analytics providing answers on what, when, where type questions.
Big Data in the Cloud
Comparison of Big Data Cloud platforms
Amazon Google Microsoft
Big Data storage S3 Cloud Storage Azure Storage
MapReduce Elastic MapReduce (EMR) N/A (MapReduce replaced by Cloud Dataflow) Hadoop on Azure, HDInsight
Big Data analytics EMR with Apache Spark and Presto Cloud Dataproc, Cloud DataLab HDInsight
Relational Database RDS Cloud SQL SQL Database as a Service
NoSQL database DynamoDB or EMR with Apache Hbase Cloud BigTable, Cloud Datastore DocumentDB, TableStorage
Data warehouse Redshift BigQuery SQL Data warehouse as a Service
Stream processing Kinesis Cloud Dataflow Azure Stream analytics
Machine Learning Amazon ML Cloud Machine Learning Services, Vision API, Speech API, Natural Language API, Translate API Azure ML, Azure Cognitive Services


Glossary

Hadoop
A software ecosystem that enables massively parallel computations distributed across thousands of (commodity) servers in a cluster
Data Lake

References

  1. 4 Vs For Big Data Analytics. 2013-06-31.