Big Data
a big hype topic
everything is big data
everyone wants to work with big data
Wikipedia: "... a collection of data sets so large and complex that it
becomes di_cult to process using on-hand database management
tools or traditional data processing applications ..."
Access to many di_erent data sources (Internet)
storage is cheep - store everything
today, CIOs get interested in the power of all their data
a lot of di_erent and complex data have to be stored
NoSQL
NoSQL: databases with less constrained consistency models )
schema-less
MongoDB:
I open source, cross-platform document-oriented database system
I most popular NoSQL database system
I supported MongoDB Inc.
I stores structured data as JSON-like documents with dynamic schemas
I MongoDB as a German / European Service
http://www.mongodb.org http://www.mongosoup.de
Hadoop
open-source software framework designed to support large scale data
processing
Map Reduce: a computational paradigm
I application is divided into many small fragments of work
HDFS: Hadoop Distributed File System
I a distributed _le system that stores data on the compute nodes
the Ecosystem: Hive, Pig, Flume, Mahout, ...
written in Java, opened up to alternatives by its Streaming API
HDFS and Hadoop cluster
HDFS is a block-structured _le system
I blocks are stored across a cluster of one or more machines with data
storage capacity: DataNode
I data is accessed in a write once and read many model
HDFS does come with its own utilities for _le management
HDFS _le system stores its metadata reliably: NameNode
Example: Rstudio
M
No comments:
Post a Comment