Big Data

From Christoph's Personal Wiki
Revision as of 00:01, 26 February 2017 by Christoph (Talk | contribs) (Created page with "'''Big Data''' is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them. Challenges include...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Big Data is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating, and information privacy. The term "big data" often refers simply to the use of predictive analytics, user behaviour analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.

  • Doug Laney's "4 V's of Big Data":[1]
Volume 
Extremely large volumes of data (i.e., peta- or exa-bytes, as of February 2017)
Variety 
Various forms of data (structured, semi-structured, and unstructured)
Velocity 
Real-time (e.g., IoT, social media, sensors, etc.), batch, streams of data
Veracity or variability 
Inconsistent, sometimes inaccurate, varying data
  • Format of Big Data:
Structured 
Data that has a defined length and format (aka "schema"). Examples include numbers, words, dates, etc. Easy to store and analyse. Often managed using SQL.
Semi-structured
Between structured and unstructured. Does not conform to a specific format, but is self-describing and involving simple key-value pairs. Examples include JSON, SWIFT (financial transactions), and EDI (healthcare).
Unstructured
Data that does not follow a specific format. Examples include audio, video, images, text messages, etc.
  • Big Data Analytics:
Basic analytics 
Reporting, dashboards, simple visualizations, slicing and dicing.
Advanced analytics 
Complex analytics models using machine learning, statistics, text analytics, neural networks, data mining, etc.
Operationalized analytics 
Embeded big data analytics in a business process to streamline and increase efficiency.
Analytics for business decisions 
Implemented for better decision-making, which drives revenue.

Human- vs. machined-generated data.

Hadoop
A software ecosystem that enables massively parallel computations distributed across thousands of (commodity) servers in a cluster

References

  1. 4 Vs For Big Data Analytics. 2013-06-31.