Sunday, July 10, 2011

Big Data - Tools

To manage and use the large amount of data there has been a great amount of innovation.

There are various storage devices that are out there targeted towards this market like EMC Isilon, IBM SoNAS and HP X9000. Netapp recently announced E-Series range of storage devices.

Storage software includes relational databases like Teradata which is row-based. There are column oriented databases that stores content by column rather than row. They are advantageous for fetching data for large number of rows but only a small subset of columns. Examples include Oracle Exadata, EMC Greenplum and HP Vertica. One of the new trends for big data is move away from traditional RDBMS software.There have been a new class of "NoSQL" (Not Only SQL) alternatives that does not require fixed schemas and are highly distributed and horizontally scalable. Examples of a few such databases are Cassandra, CouchDB and MongoDB.

To analyze and process the big datasets, the processing is not carried on a single powerful computer but a network of commodity servers clustered together. The most popular framework in this realm is Hadoop. Inspired from Google's MapReduce and Google File System (GFS), it was originally developed by Yahoo but now is open source managed by Apache. Hadoop utilizes a highly scalable and distributed file system HDFS to store data and MapReduce to rapidly process data in parallel on cluster of nodes. In addition to HDFS and MapReduce there are other useful sub-projects associated with Hadoop like HBase, Hive and Pig. Hive is a data warehouse infrastructure that provides data summarization and ad hoc querying similar to SQL. Pig is a high-level data-flow language and execution framework for parallel computation.

Hadoop has been a framework of choice for many big data implementations in various companies including Yahoo, Facebook, EBay and Twitter. There is a healthy ecosystem of commercial vendors that provide tools and services based on Hadoop platform like Cloudera, Datameer and Karmasphere.

As we can see the list of tools for big data is itself big and growing.

Saturday, July 9, 2011

Big Data - Introduction

The term "Big Data" is used for referring to datasets that are typically in the order of petabytes, exabytes and even larger. The data could be either structured or unstructured. In simple terms any data that has a definite structure and could be stored easily in relational databases could be considered structured. Unstructured data does not have an identifiable structure, examples are email, images, word documents, phone conversations, etc.

With the advent of cheap digital storage and advancements in computational power we are seeing an explosion in the growth of digital data. It is estimated that 90% of the data in the world today has been created in the past two years. According to IDC, in 2011 more than 1.2 zettabytes of information will be created and stored and by 2020 will grow to 44 times that of 2009.

The big question is what does it mean for us?

Should this scare us? Governments and companies acting like big brothers with their ability to collect and retain personal information about us and our behaviors through records of our searches, chats, health records and any other form of interactions that can be digitally tracked.

Shelving the privacy concerns for now let us see how it can benefit the businesses and consumers. Businesses can analyze this data to gain efficiencies within their organization and strategic advantage over their competitors. Consumers will greatly benefit from the innovations that will affect some core sectors like health care where effective use of data will help doctors make more informed decisions in treating patients. Real time traffic and weather data from Mobile phones and GPS devices will help avoid traffic congestions.

What opportunities does it bring for engineers and scientists? Big Data is still mostly a buzz-word and adopted by larger companies - internet giants like Google, Yahoo, Facebook; retailers like Walmart; financial services companies. So there are lot of businesses that in coming years will be looking for talent to manage and use big data and associated services.

In the upcoming blog I will talk about some of the popular tools associated with Big Data.

Wednesday, July 6, 2011

TinyMCE and Selenium

TinyMCE is a platform independent web based Javascript HTML WYSIWYG editor control that can be embedded in web pages. Selenium is an open source automation framework that can drive tests on most web browsers.

The web page that I wanted to test using Selenium consisted of multiple tinymce editors used for various rich textareas. Internet Explorer 8 was the target browser. Obviously simply using the type command with input text did not work. So googled on how other people achieved this, found following two solutions:

1. Select frame and then type text

selectFrame id_of_tinymce_iframe
focus tinymce
type tinymce Text_text
selectFrame relative=parent
2. Identify the text area using DOM locator
type dom=document.getElementById('id_of_tinymce_iframe').contentWindow.document.body Test_text

Both of them seem to work on Firefox where I develop my test using Selenium IDE. But on IE 8 I would not see the "Test_text" in the TinyMCE textarea even though the steps would run fine. The way I ended up achieving this was by using the following:
focus dom=document.getElementById('id_of_tinymce_iframe').contentWindow.document.body
getEval selenium.browserbot.getCurrentWindow().document.getElementById('id_of_tinymce_iframe').contentWindow.document.body.innerHTML = 'Test_text';
Yes and this did work on IE 8.

Click here to read "Selenium tips for better automation".