14 Big Data

What is Big Data

nestrov

Big Data: Data of sizes beyond ability of traditional SW tools to quickly capture, curate, manage, and process

  • unstructred (usually), semi-structured, structured data
  • Data Lake: contains data inluding unstructured data
    • unstructured data challenges:
      • storing in DB not easy
      • not easy to edit, search, analyze..etc

3v

  • The 3V: Velocity, Volume, Variety

Big Data come from:

  • clicks, ad, server request, transanction…etc
  • Facebook's daily logs > 50 TB
  • YouTube's DB cost > ad income
  • User Generated Content (Web & Mobile)
    • Facebook, Instagram, Google…etc
  • Health and Scientific Computing
  • Graph Data (maps, social networks, telecom networks…)
  • Log Files (Apache Web Server Log, Machine Syslog File)
  • Aerospace : 25,000 sensors to optimize plane maintenance
  • Autonomous Vehicles: 1.4-19TB / hr / car
  • e-Commerce: using consumer data to enhance profit
  • Precision Medicine

  • \(B \longrightarrow KB \longrightarrow MB \longrightarrow GB\) \(\longrightarrow TB \longrightarrow PB \longrightarrow EB \longrightarrow ZB \longrightarrow YB\)
    • soon will reach 40EB/year by 2025

Data in Organization

  • Data Warehouse

    : collects and organizes data (from multiple sources)

    • data periodically ETL‘d into data warehouse
    • Extracted, Transformed, Loaded
  • Optimizing Data:
    • reduce columns if not necessary
    • partition tables
    • Dimension Tables: mutlidimensional ‘cube’ of data

    • Star Schema, Snowflake Schema

      3v 3v 3v
      • Snowflake Schema: small size of tables all connected (might be quite challenging to analyze)
  • OLAP: Online Analytics Processing (BI)
    • user interacts with m-D data using Tableau, Excel Pivot
  • Dealing with semi-structured or unstructred data
    • enable data consumers to choose how to transform & use data
    • PROBLEM: Dark Side of Data Lakes (DB become noisy, not 100% accurate (dirty data))
    • \(\Rightarrow\) Data Analysts Required (데이터 규격화 必)
  • Big Data Problems
    1. Data Structuring (데이터 규격화)
    2. Expensive Storage (CPU, hard drives)

Unstructured Datasets

  • Requirements:
    • Handle large files spanning multiple computers
    • Use Cheap commodity devices that fail frequently
    • Distributed data processing quickly and easily
  • Solutions:
    1. Distributed File Systems
    2. Distributed Computing

Distributed File Systems

\(Q.\): Store and access very large files across cheap commodity devices distributed

  • split and make duplicates \(\rightarrow\) safer when machine fails since restorable
    • utilized by Google

Distributed Computing

  • Interacting with the data (Request / Response data samples)

  • Map-reduce distributed aggregation
  • Example Scenario:
    • Computing number of occurences of each word in all the books using a team of people
    • distributed
    • divide and combine
    • distributed
  • used by Hadoop, Spark