Thursday, May 3, 2012

Big Data

A couple of years ago I kidded that soon someone would come up with a Chaos Database system where you dumped everything into a big pile and let the CPU sort it out. It appears that the day has come. I am attending the Enterprise Data World where data architects, data managers and various folks interested in DWH, DSS, BI meet and greet. Here the big buzz is about big data. The best definition I have seen so far is it is essentially columnar designated data. For example, a document would be stored with a unique id tag, a column group, a column group description and a timestamp (actually at linear sequence of some kind) and the data package which would be the document. Now inside the document it is hoped there are XML tags which parse out the information contained, if not, some kind of index is created on the document. So, instead of a table you have a single column with multiple data pieces and a package, where the package contains the information and the rest tells the system how it fits into a predefined data taxonomy or oncology. You also have a bunch of idividual index structures for the packages which aren't self intelligent. A taxonomy is essentially a definition of a specific data set. For example, animal breaks down into species, which break into subspecies which breaks into sexes. The tag portion of the columnar data puts the data into the particular spot in the data taxonomy where it fits. Once all your individual big data pieces have been encoded with their tags and stored, a language such as Hadoop is used to access it using MapReduce statements which is essentially a map of taxonomy objects and what to do with them if you find them. This is hand coded. Of course all of this, Big Data, Hadoop, NoSQL (not only SQL) and all the new hoopla is in its beginning, it is around Oracle2.0 level. It seems to be the revenge of the Data Architect and Programmer against the relative ease of database programming as it exists today. It would seem that defining a table and using a blob or XML object for the package with the same type of column definitions as in the Big Data paradigm would give the same benefits but allow use of existing data query tools. I pose the question, how much of all this "New Paradigm" is re-labeled old technology? Do we really need completely new databases and structures to handle this? Of course with each column becoming a tagged data object the size of our database will also become Big. What required a table with 5 columns will now require an additional 3-4 , for lack of a better term, columns to describe each main column. This seems to indicate that data volumes will increase by a factor of 4 or more. As an employee in a company that makes storage technology, I applaud this, however, part of me asks is it really needed.

No comments:

Post a Comment