I like to think of myself as someone who lives near the leading, if not the bleeding, edge of technology when it comes to my professional life. I’ve been in this field for a fairly long time now. I got my feet wet working with one of the best technical mentors I’ve ever known (Hey Tex! If you’re reading this, yes, I’m talking about you!) who always challenged me to think outside of the box when it came to solutions to problems. Him and I were designing and building solutions that at the time seemed way ahead of their time (I think I can honestly lay claim to the first “roaming profiles” in Windows, using Windows 3.11 – later Windows for Workgroups – Novell NetWare login scripts and some seriously custom client-side code, as well as the first wide-area-network that was designed to allow hospitals to communicate with one another and share almost-real-time operating room data). As time progressed, it seemed that it was harder and harder to stay near the front of leading technologies, especially in the data world that I’d chosen as my niche.
For the last several years, my domain has been applying Business Intelligence to IT Configuration Management, and more recently to Storage Resource Management (SRM). I’ve been very fortunate to work with some extremely bright folks, and some very large companies. Over the last 12 months, I’d say that one of the technical challenges that’s been near the top of the list for these companies is how to deal with the “Big Data” problem. When I first the term “Big Data”, I didn’t really know why it was any different than any other data set, and didn’t really understand why people were so worked up about it, but as I spent more time talking with customers and people about it, I began to realize that “Big Data” really is a new kind of challenge, and a different way of thinking about things.
Big Data Defined
Simply put, when someone talks about '”Big Data”, they are referring to datasets that have become too large to work with using standard tooling. This problem encompasses everything from getting data in, to querying data and producing meaningful output (reports, visualizations, etc.) The problem is exacerbated when you add non-typical data structures (many call this “unstructured data”, but I hate that term) and have a need to coalesce all of this together into a dashboard or other typical BI client. Obviously, the idea of “Big Data” goes much beyond the problem space of typical data warehouse implementations, but it does seem than many people assume that “Big Data” and “Data Warehouse” are one in the same. (Even Microsoft gets into this with their concept of the “Parallel Data Warehouse”) Most techniques today that deal with “Big Data” are more focused on the collection and inserting of data into the repository and less focused on what to do with the data once it’s collected. Consider a massively parallel (MPP) database consisting of tables that contain billions of rows of operational data, and then think of using today’s index and query techniques to access the data. Even thinking about constructing cubes of data to pre-aggregate and organize into multi-dimensional structures falls down pretty quickly when you have such a massive repository of data that you’re trying to access.
The Kimball vs Inmon Factor
If I’m to be honest with myself, I’ll have to admit that I follow the thoughts and concepts of Ralph Kimball as opposed to Bill Inmon and thus tend to think of a Data Warehouse as a collection of Data Marts. When I used to teach Data Warehouse theory, I always used the “bottom up” approach in terms of “Once you have decided what the business needs, break it down into segments and then build a data mart for that segment. Repeat until you’re out of segments, and viola! You have a Data Warehouse”. Using this approach, it becomes easy to use a Data Mart as the source of all output.
Using Bill Inmon’s approach, the Data Warehouse is more normalized and organized by subject and time. Data Marts are then created from the bigger Data Warehouse and are used to deliver subject-specific output.
On the surface, it would appear that the Inmon approach to Data Warehousing is more conducive to using BI tools with “Big Data” systems, however I think even that approach falls down pretty quickly when faced with the massive amount of data that is stored. (By the way, as an employee and shareholder of one of the largest IT storage companies in the world, I’m OK with this problem. Just means people need to buy more storage!)
The Greenplum Community
In July of 2010, EMC acquired Greenplum, which is one of the major players in the “Big Data” game. Greenplum offers some very interesting technology in terms of both the data storage and insert side, as well as the “get data out” side of things. Just recently, EMC announced that the Greenplum Database Community Edition would be available free of charge for those who want to play around with Big Data tools. I think this is a very smart move, as it allows developers access to both data storage as well as data mining and visualization tools that support both traditional relational as well as non-structured data sources.
This is going to be a very interesting year in terms of technology advancements. It will be interesting to see how it all plays out.