Data warehousing used to be simple. If you wanted to stockpile and organize your data for analytics purposes, you spun up an extra database instance of whatever database platform your organization standardized on, connected your ETL tool of choice, modeled the data, wrote a few ETL routines, and set up regular refreshes.
Today, things have changed. The emergence of appliances for purpose-built, high-performance data warehouses, Hadoop for really big data warehouses, and a number of cloud-based elastic data warehouse platforms has certainly made things more complicated.
It’s important to know the key differentiators of various data warehousing technology choices, as well as how to implement them in your organization.
The world of data is increasingly becoming multi-modal, which means an integration of multiple technologies both new and old from multiple vendors and communities. Today, data warehouses can also be on-premise, solely in the cloud, or hybrid of the two.
Data warehousing modalities
Ten to 15 years ago, in order to prepare and store data for analytic purposes, companies really had only one choice: build out a traditional data warehouse environment. Typically, these data warehousing environments would sit on traditional relational database management systems (RDBMSs). To improve the performance of RDBMS technology for analytic workloads, the major database vendors introduced innovations such as columnar-store, columnar indexing, in-memory, and dimension partitioning. As the need for higher performance grew, expensive purpose-built arrays of highly connected and interoperable massively parallel processing (MPP) data warehouse appliances appeared. These systems were designed to scale linearly from a few parallel nodes to several hundred. Unfortunately, those systems had physical upper limits, and when the data got too large, splitting the workload among many appliances would introduce diminishing returns in speed and overall performance. Further, choke-points on data ingestion (INPUT) and data querying (OUTPUT) made these environments less attractive for high-volume, low-latency and high-concurrency querying scenarios. With the development of Hadoop-style technologies, companies could move their data into an infinitely scalable environment based on parallel processing on commodity hardware, backed by vendor-supported distributions of open-source code projects. While the scale problem appeared to be solved, performance—especially user-generated data querying response times—could not match the level businesses had become accustomed to in data warehouse appliances. Finally, through the evolution of virtualized parallel elastic cloud data processing, select vendors are able to provide the ideal balance of scale, complexity, sophistication and performance in a “pay-as-you-play” and “pay-for-performance” on-demand model at scale. While challenges still exist—specifically, how do you get your data into the cloud at scale, and how do you justify cloud storage costs for an ever increasing data footprint—elastic cloud solutions are becoming increasingly popular.
A framework of thinking
To help frame how to think about this new world of data warehousing, it’s worth considering a cue from the age-old discipline of project management. According the Project Management Book of Knowledge (PMBOK), the art of professional project management requires balancing three elements in the interest of creating value: scope, schedule and cost.
In project management, to create the same amount of value, you must increase schedule and cost if you want to increase the scope. Likewise, if the costs go up or the schedule has delays, for the same amount of scope, you will get less value.
Similarly, the discipline of data warehousing or data engineering for analytics also requires the balancing of three elements in the interest in creating performance: scale, complexity and sophistication.
To create the same amount of performance, you should increase the sophistication if you want to increase scale and/or complexity.
Scale is how big the data is in extent, and at what velocity it arrives. In traditional terms, we might think of measuring scale in terms of the number of records flowing in, through and out of your environment per unit of time. Complexity is how varied the data is, how many kinds of sources, what form factors that data arrives in, how much context is delivered with the data (i.e. how structured, regular, clean or consistent it is). Sophistication refers to how skilled a team must be to construct and operate the environment, how evolved the methods of implementation are, and the rigor of processes designed to support it. Performance is a measure of quality of the total system or environment. It can be measured in terms of how quickly major integration and transformation tasks take to complete, how long it takes to return a queried result set, or how long it takes to onboard and integrate a new data source.