Difference between revisions of "Big Data"

From Christoph's Personal Wiki
Jump to: navigation, search
Line 34: Line 34:
 
** Fitness devices (e.g., Fitbit)
 
** Fitness devices (e.g., Fitbit)
 
** Sensors to measure environmental parameters (e.g., temperature, humidity, wind, etc.)
 
** Sensors to measure environmental parameters (e.g., temperature, humidity, wind, etc.)
 +
 +
; Cycle of Big Data management
 +
* Capture data: depending on the problem to be solved, decide on the data sources and the data to be collected.
 +
* Organize: cleanse, organize, and validate data. If data contains sensitive information, implement sufficient levels of security and governance.
 +
* Integrate: integrate with business rules and other relevant systems like data warehouses, CRMs, ERPs (Enterprise Resource Planning), etc.
 +
* Analyze: real-time analysis, batch type analysis, reports, visualizations, advanced analytics, etc.
 +
* Act: use analysis to solve the business problem.
 +
 +
;Components of a Big Data infrastructure
 +
* Redundant physical infrastructure: hardware, storage servers, network, etc.
 +
* Security infrastructure: maintaining security and governance on data is critical to protect from misuse of Big Data.
 +
* Data stores: to capture structured, semi-structured, and un-structured data. Data stores that need to be fast, scalable, and durable.
 +
* Organize and integrate data: stage, clean, organize, normalize, transform, and integrate data.
 +
* Analytics: traditional and including Business Intelligence and advanced analytics.
 +
 +
* Data:
 +
** text, audio, video, etc.
 +
** social media
 +
** machine generated
 +
** human generated
 +
* Capture:
 +
** distributed file systems
 +
** streaming data
 +
** NoSQL
 +
** RDBMS
 +
* Organize and integrate
 +
** Apache Spark SQL
 +
** Hadoop MapReduce
 +
** ETL / ELT
 +
** data warehouse
 +
* Analyze
 +
** predictive analytics
 +
** advanced analytics
 +
** social media and text analytics
 +
** alerts and recommendations
 +
** visualization, reports, dashboards
 +
 +
Physical infrastructure
 +
* You physical infrastructure can make or break your Big Data implementations. Has to support high-volume, high-velocity, high-variety of Big Data and be highly available, resilient, and redundant.
 +
* Requirements to factor while designing the infrastructure include performance, availability, scalability, flexibility, and costs.
 +
* Networks must be redundant and resilient and have sufficient capacity to accommodate the anticipated volume and velocity of data in addition to normal business data. You infrastructure should be elastic.
 +
* Hardware storage and servers must have sufficient computing power and memory to support analytics requirements.
 +
* Infrastructure operations: Managing and maintaining data centres to avoid catastrophic failure and thus preserve the integrity of data and continuity of business processes.
 +
* Cloud based infrastructures allow outsourcing of building Big Data infrastructure and managing the infrastructure.
 +
 +
; Security infrastructure
 +
* Data access: Same as non-Big Data implementations. Data access is granted only to users who have legitimate business reason to access the data.
 +
* Application access; Accessing data from applications is defined by restrictions imposed by an API.
 +
* Data encryption: Encrypting and decrypting data for high-volume, high-velocity, and high-variability can be expensive (computationally and to your wallet). An alternative is to encrypt only certain elements of the data that are sensitive and critical.
 +
* Threat detection: With exposure to social media and mobile data comes increase exposure to threats. Multiple layers of defence for network security are required to protect from security threats.
 +
 +
; Data stores to capture data
 +
* Data stores are at the core of the Big Data infrastructure and need to be fast, scalable, and highly available (HA).
 +
* A number of different data stores are available and each is suitable for a set of different requirements.
 +
* Example data stores include:
 +
** Distributed file systems (e.g., Hadoop Distributed File System (HDFS))
 +
** NoSQL databases (e.g., Cassandra, MongoDB)
 +
** Traditional RDBMs (e.g., MySQL, Postgres)
 +
* Real-time streaming data can be ingested using Apache Kafka, Apache Storm, Apache Spark streaming, etc.
  
 
;Hadoop: A software ecosystem that enables massively parallel computations distributed across thousands of (commodity) servers in a cluster
 
;Hadoop: A software ecosystem that enables massively parallel computations distributed across thousands of (commodity) servers in a cluster

Revision as of 22:27, 5 March 2017

Big Data is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating, and information privacy. The term "big data" often refers simply to the use of predictive analytics, user behaviour analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.

  • Doug Laney's "4 V's of Big Data":[1]
Volume 
Extremely large volumes of data (i.e., peta- or exa-bytes, as of February 2017)
Variety 
Various forms of data (structured, semi-structured, and unstructured)
Velocity 
Real-time (e.g., IoT, social media, sensors, etc.), batch, streams of data. Is usually either human- or machined-generated data.
Veracity or variability 
Inconsistent, sometimes inaccurate, varying, or missing data
  • Format of Big Data:
Structured 
Data that has a defined length and format (aka "schema"). Examples include numbers, words, dates, etc. Easy to store and analyse. Often managed using SQL.
Semi-structured
Between structured and unstructured. Does not conform to a specific format, but is self-describing and involving simple key-value pairs. Examples include JSON, SWIFT (financial transactions), and EDI (healthcare).
Unstructured
Data that does not follow a specific format. Examples include audio, video, images, text messages, etc.
  • Big Data Analytics:
Basic analytics 
Reporting, dashboards, simple visualizations, slicing and dicing.
Advanced analytics 
Complex analytics models using machine learning, statistics, text analytics, neural networks, data mining, etc.
Operationalized analytics 
Embedded big data analytics in a business process to streamline and increase efficiency.
Analytics for business decisions 
Implemented for better decision-making, which drives revenue.
  • What is IoT?
    • Internet of Things
    • Physical objects that are connected to the Internet
    • Identified by an IP address (IPv4 now; IPv6 in the future)
    • Devices communicate with each other and other Internet-enabled devices and systems
    • Includes everyday devices that utilize embedded technology to communicate with an external environment by connecting to the Internet
    • IoT data is high volume, high velocity, high variety, and high veracity
  • Examples of IoT:
    • Security systems
    • Thermostats (e.g., Nest)
    • Vehicles
    • Electronic appliances
    • Smart-lighting in households or commercial buildings (e.g., Philips Hue)
    • Fitness devices (e.g., Fitbit)
    • Sensors to measure environmental parameters (e.g., temperature, humidity, wind, etc.)
Cycle of Big Data management
  • Capture data: depending on the problem to be solved, decide on the data sources and the data to be collected.
  • Organize: cleanse, organize, and validate data. If data contains sensitive information, implement sufficient levels of security and governance.
  • Integrate: integrate with business rules and other relevant systems like data warehouses, CRMs, ERPs (Enterprise Resource Planning), etc.
  • Analyze: real-time analysis, batch type analysis, reports, visualizations, advanced analytics, etc.
  • Act: use analysis to solve the business problem.
Components of a Big Data infrastructure
  • Redundant physical infrastructure: hardware, storage servers, network, etc.
  • Security infrastructure: maintaining security and governance on data is critical to protect from misuse of Big Data.
  • Data stores: to capture structured, semi-structured, and un-structured data. Data stores that need to be fast, scalable, and durable.
  • Organize and integrate data: stage, clean, organize, normalize, transform, and integrate data.
  • Analytics: traditional and including Business Intelligence and advanced analytics.
  • Data:
    • text, audio, video, etc.
    • social media
    • machine generated
    • human generated
  • Capture:
    • distributed file systems
    • streaming data
    • NoSQL
    • RDBMS
  • Organize and integrate
    • Apache Spark SQL
    • Hadoop MapReduce
    • ETL / ELT
    • data warehouse
  • Analyze
    • predictive analytics
    • advanced analytics
    • social media and text analytics
    • alerts and recommendations
    • visualization, reports, dashboards

Physical infrastructure

  • You physical infrastructure can make or break your Big Data implementations. Has to support high-volume, high-velocity, high-variety of Big Data and be highly available, resilient, and redundant.
  • Requirements to factor while designing the infrastructure include performance, availability, scalability, flexibility, and costs.
  • Networks must be redundant and resilient and have sufficient capacity to accommodate the anticipated volume and velocity of data in addition to normal business data. You infrastructure should be elastic.
  • Hardware storage and servers must have sufficient computing power and memory to support analytics requirements.
  • Infrastructure operations: Managing and maintaining data centres to avoid catastrophic failure and thus preserve the integrity of data and continuity of business processes.
  • Cloud based infrastructures allow outsourcing of building Big Data infrastructure and managing the infrastructure.
Security infrastructure
  • Data access: Same as non-Big Data implementations. Data access is granted only to users who have legitimate business reason to access the data.
  • Application access; Accessing data from applications is defined by restrictions imposed by an API.
  • Data encryption: Encrypting and decrypting data for high-volume, high-velocity, and high-variability can be expensive (computationally and to your wallet). An alternative is to encrypt only certain elements of the data that are sensitive and critical.
  • Threat detection: With exposure to social media and mobile data comes increase exposure to threats. Multiple layers of defence for network security are required to protect from security threats.
Data stores to capture data
  • Data stores are at the core of the Big Data infrastructure and need to be fast, scalable, and highly available (HA).
  • A number of different data stores are available and each is suitable for a set of different requirements.
  • Example data stores include:
    • Distributed file systems (e.g., Hadoop Distributed File System (HDFS))
    • NoSQL databases (e.g., Cassandra, MongoDB)
    • Traditional RDBMs (e.g., MySQL, Postgres)
  • Real-time streaming data can be ingested using Apache Kafka, Apache Storm, Apache Spark streaming, etc.
Hadoop
A software ecosystem that enables massively parallel computations distributed across thousands of (commodity) servers in a cluster

References

  1. 4 Vs For Big Data Analytics. 2013-06-31.