Big Data

"No formal defintion !"

"Big Data is any data that is expensive to manage and hard to extract value from"

"Big" is relative! "Difficult Data" is perhaps more apt!"

"Big data have a way of being used for purposes other than originally intended"

"Pre dates rise of Internet era and influencing current topics on ethics, privacy, validation"

Big Data

Where does big data come from?

"the necessity of grappling with Big Data, and the desirability of unlocking the information hidden within it, is now a key theme in all the sciences - arguably the key scientific theme of our times"

Big Data

Big Data

Three challenges

  • Volume
    the size of the data
  • Velocity
    the latency of data processing relative to the growing demand for interactivity. how fast is it coming based on how fast it needs to be consumed
  • Variety
    the diversity of sources, formats, quality, structures

A technology-oriented view of Big Data

Data Analysis

  • Business Intelligence (BI) is associated with datawarehouse, dashboard and reports that consume data from datawarehouse and provides answers to particular questions. It requires Upfront effort to design and build & therefore not too adaptable when requirements change. BI system tools are mostly for business analyst to make decisions with.

  • With Big Data, Need more flexible analysis and Data Analyst need to make business decisions too.

Diverse Data Sources

  • Apart from Relational Data models and Data manaagement and data could be in in video, audio, text, graphs

Statistics and Visualization

  • Very much needed to devise solutions to scale up

Machine Learning

  • Chosing right model is heart of Big data analysis however getting and tidying the data is often the biggest challenge

Big Data

How does it look

  • Different Data Formats :
    XML, JSON, Fixed width Text Records(Header and Detail), Delimited, Compressed
  • Media formats : Images, Video, Audio
  • Web or HTML format :
    Social Media, gov
  • Databases
  • Spreadsheet Organized rarely?

Getting and Cleaning Data is often the biggest task , more than the (machine learning) algorithm

Big Data

Big Data

What is needed to create value


Parallel Distributed Platforms

Scale up & Scale Out - incore or out of core computing


Big Data

Implementation choices : Scale up & Scale Out

  • Inmemory, Incore Computing :
  • NoSQL, MapReduce , Parallel computing:
  • Information Visualization : Tableau, R Plots, Lumira
  • Regression, Relational Algebra Computing:
  • Desktop Computing:
  • Cloud , Distributed:
  • OLTP & OLAP :

Machine learning

Key Steps

Question -> Data -> Characteristics -> Algorithm -> Parameters -> Evaluation



Big Data


MapReduce was invented at Google [Dean & Ghemawat, OSDI'04] Hadoop = open source implementation Data stored on HDFS distributed file system

Programmers write Map and Reduce functions Framework provides automated parallelization and fault tolerance Source: Huy Vo

Big Data

Project Rules

  1. Don't start a Big Data project without understanding the value - demonstrate ROI

  2. Don't ignore the wider Enterprise story - draw & merge data across an organisation to create insights.

  3. A big data project is not a technology project - treat it as more about Business Change

  4. It pays to have the right type of Data Scientists onboard- qualification or experience.

  5. Build a support structure from the beginning - building blocks such as Data Quality, Data Governance and Metadata

  6. Build for tomorrow - Understand the future of organisation

  7. Involve the organisation - to contribute to the vision