Sunday, July 10, 2011

Big Data - Tools

To manage and use the large amount of data there has been a great amount of innovation.

There are various storage devices that are out there targeted towards this market like EMC Isilon, IBM SoNAS and HP X9000. Netapp recently announced E-Series range of storage devices.

Storage software includes relational databases like Teradata which is row-based. There are column oriented databases that stores content by column rather than row. They are advantageous for fetching data for large number of rows but only a small subset of columns. Examples include Oracle Exadata, EMC Greenplum and HP Vertica. One of the new trends for big data is move away from traditional RDBMS software.There have been a new class of "NoSQL" (Not Only SQL) alternatives that does not require fixed schemas and are highly distributed and horizontally scalable. Examples of a few such databases are Cassandra, CouchDB and MongoDB.

To analyze and process the big datasets, the processing is not carried on a single powerful computer but a network of commodity servers clustered together. The most popular framework in this realm is Hadoop. Inspired from Google's MapReduce and Google File System (GFS), it was originally developed by Yahoo but now is open source managed by Apache. Hadoop utilizes a highly scalable and distributed file system HDFS to store data and MapReduce to rapidly process data in parallel on cluster of nodes. In addition to HDFS and MapReduce there are other useful sub-projects associated with Hadoop like HBase, Hive and Pig. Hive is a data warehouse infrastructure that provides data summarization and ad hoc querying similar to SQL. Pig is a high-level data-flow language and execution framework for parallel computation.

Hadoop has been a framework of choice for many big data implementations in various companies including Yahoo, Facebook, EBay and Twitter. There is a healthy ecosystem of commercial vendors that provide tools and services based on Hadoop platform like Cloudera, Datameer and Karmasphere.

As we can see the list of tools for big data is itself big and growing.

No comments:

Post a Comment