Difference between revisions of "Big Data"

From Christoph's Personal Wiki
Jump to: navigation, search
(Created page with "'''Big Data''' is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them. Challenges include...")
 
Line 4: Line 4:
 
;Volume : Extremely large volumes of data (i.e., peta- or exa-bytes, as of February 2017)
 
;Volume : Extremely large volumes of data (i.e., peta- or exa-bytes, as of February 2017)
 
;Variety : Various forms of data (structured, semi-structured, and unstructured)
 
;Variety : Various forms of data (structured, semi-structured, and unstructured)
;Velocity : Real-time (e.g., IoT, social media, sensors, etc.), batch, streams of data
+
;Velocity : Real-time (e.g., IoT, social media, sensors, etc.), batch, streams of data. Is usually either human- or machined-generated data.
;Veracity or variability : Inconsistent, sometimes inaccurate, varying data
+
;Veracity or variability : Inconsistent, sometimes inaccurate, varying, or missing data
  
 
* Format of Big Data:
 
* Format of Big Data:
Line 15: Line 15:
 
; Basic analytics : Reporting, dashboards, simple visualizations, slicing and dicing.
 
; Basic analytics : Reporting, dashboards, simple visualizations, slicing and dicing.
 
; Advanced analytics : Complex analytics models using machine learning, statistics, text analytics, neural networks, data mining, etc.
 
; Advanced analytics : Complex analytics models using machine learning, statistics, text analytics, neural networks, data mining, etc.
; Operationalized analytics : Embeded big data analytics in a business process to streamline and increase efficiency.
+
; Operationalized analytics : Embedded big data analytics in a business process to streamline and increase efficiency.
 
; Analytics for business decisions : Implemented for better decision-making, which drives revenue.
 
; Analytics for business decisions : Implemented for better decision-making, which drives revenue.
  
Human- vs. machined-generated data.
+
* What is IoT?
 +
** Internet of Things
 +
** Physical objects that are connected to the Internet
 +
** Identified by an IP address (IPv4 now; IPv6 in the future)
 +
** Devices communicate with each other and other Internet-enabled devices and systems
 +
** Includes everyday devices that utilize embedded technology to communicate with an external environment by connecting to the Internet
 +
** IoT data is high volume, high velocity, high variety, and high veracity
 +
 
 +
* Examples of IoT:
 +
** Security systems
 +
** Thermostats (e.g., Nest)
 +
** Vehicles
 +
** Electronic appliances
 +
** Smart-lighting in households or commercial buildings (e.g., Philips Hue)
 +
** Fitness devices (e.g., Fitbit)
 +
** Sensors to measure environmental parameters (e.g., temperature, humidity, wind, etc.)
  
 
;Hadoop: A software ecosystem that enables massively parallel computations distributed across thousands of (commodity) servers in a cluster
 
;Hadoop: A software ecosystem that enables massively parallel computations distributed across thousands of (commodity) servers in a cluster

Revision as of 01:21, 26 February 2017

Big Data is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating, and information privacy. The term "big data" often refers simply to the use of predictive analytics, user behaviour analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.

  • Doug Laney's "4 V's of Big Data":[1]
Volume 
Extremely large volumes of data (i.e., peta- or exa-bytes, as of February 2017)
Variety 
Various forms of data (structured, semi-structured, and unstructured)
Velocity 
Real-time (e.g., IoT, social media, sensors, etc.), batch, streams of data. Is usually either human- or machined-generated data.
Veracity or variability 
Inconsistent, sometimes inaccurate, varying, or missing data
  • Format of Big Data:
Structured 
Data that has a defined length and format (aka "schema"). Examples include numbers, words, dates, etc. Easy to store and analyse. Often managed using SQL.
Semi-structured
Between structured and unstructured. Does not conform to a specific format, but is self-describing and involving simple key-value pairs. Examples include JSON, SWIFT (financial transactions), and EDI (healthcare).
Unstructured
Data that does not follow a specific format. Examples include audio, video, images, text messages, etc.
  • Big Data Analytics:
Basic analytics 
Reporting, dashboards, simple visualizations, slicing and dicing.
Advanced analytics 
Complex analytics models using machine learning, statistics, text analytics, neural networks, data mining, etc.
Operationalized analytics 
Embedded big data analytics in a business process to streamline and increase efficiency.
Analytics for business decisions 
Implemented for better decision-making, which drives revenue.
  • What is IoT?
    • Internet of Things
    • Physical objects that are connected to the Internet
    • Identified by an IP address (IPv4 now; IPv6 in the future)
    • Devices communicate with each other and other Internet-enabled devices and systems
    • Includes everyday devices that utilize embedded technology to communicate with an external environment by connecting to the Internet
    • IoT data is high volume, high velocity, high variety, and high veracity
  • Examples of IoT:
    • Security systems
    • Thermostats (e.g., Nest)
    • Vehicles
    • Electronic appliances
    • Smart-lighting in households or commercial buildings (e.g., Philips Hue)
    • Fitness devices (e.g., Fitbit)
    • Sensors to measure environmental parameters (e.g., temperature, humidity, wind, etc.)
Hadoop
A software ecosystem that enables massively parallel computations distributed across thousands of (commodity) servers in a cluster

References

  1. 4 Vs For Big Data Analytics. 2013-06-31.