This post was originally published at the IBM Data Warehousing blog.
Many years ago in a cartoon dialogue, Dilbert’s boss expressed concern for the theft of their desktop computers, but Dilbert assured him, to his boss’ satisfaction, that if he loaded them with data they would be too heavy to move. Hold that thought.
Co location: Getting durable performance from queries
Many shops will migrate to a new PureData System for Analytics appliance, Powered by Netezza Technology, simply by copying old data structures into the new data warehouse appliance. They then point their BI tools at it and voila, a 10x performance boost just for moving the data. Life is good.
The shop moves on by hooking up the ETL tools, backups and other infrastructure, not noticing that queries that ran in 5 seconds the week before, now run in 5.1 seconds. As the weeks wear on, 5.1 seconds become 6, then 7, then 10 seconds. Nobody is really watching, because 10 seconds is a phenomenal turnaround compared to their prior system’s 10-minute turnaround.
But six months to a year down the line, when the query takes 30 seconds or longer to run, someone may raise a flag of concern. By this time, we’ve built many new applications on these data structures. Far-and-away more data has been added to its storage. In true Dilbert-esque terms, loading more data makes the system go slower.
PureData has many layers of high-performance hardware, each one more powerful than the one above it. Adhering to this leverage over time helps maintain durable performance.
The better part about a PureData machine is that it has the power to address this by adhering to a few simple rules. When simply migrating point-to-point onto a PureData appliance, we’re likely not taking advantage of the core power-centers in Netezza technology. The point-to-point migration starts out in first-gear and never shifts up to access more power. That is, PureData has many layers of high-performance hardware, each one more powerful than the one above it. Adhering to this leverage over time helps maintain durable performance. The system may eventually need an upgrade for storage reasons, but not for performance reasons.
PureData is a physical machine with data stored on its physical “real estate”, but unlike buying a house with “location-location-location!” we want “co-location-co-location-co-location!” Two flavors of data co-location exist: zone maps and data distribution. The use of these (or lack thereof) either enable or constrain performance. These factors are physical, because performance is in the physics. It’s not enough to migrate or maintain a logical representation of the data. Physical trumps logical.
Zone maps, a powerful form of co-location in PureData
The most powerful form of co-location is zone maps, optimized through the Organize-On and Groom functions. Think of transaction_date as an Organize-On optimization key. The objective is to regroup the physical records so that those with like-valued keys are co-located on as few disk pages as possible. Groom will do this for us. Now when a query is issued against the table, filtering the transaction_date on a date value or date range filter, this query will be applied to the zone maps to derive the known physical disk locations and exclude all others. This is Netezza’s principle of using the query to tell it “where-not-to-look”.
The additional caveat is that the physical co-location of records by Organize-On keys is only valuable if they are actually used in the query. They radically reduce data reads, for example from 5 thousand pages down to 5 pages to get the same information. That’s a 1000x boost! The zone maps, enabled by Organize-On and Groom, are what achieve these dramatic performance boosts. If we do not use them, then queries will initiate a full table-scan which naturally takes more time.
The reason why this is so important is that disk-read is the number one penalty of the query, with no close second. A PureData System N200x or N3001 can read over 1100 pages per second on a given data slice. So if the query scans 5000 pages for each, it’s easily a 4-second query. But it won’t stay a 4-second query. As the data grows from 5000 pages to 10,000 pages, it will become a 10-second query. If the query leverages the zone maps and reduces it consistently to say, 100 pages per query, the query will achieve a sub-second duration and remain there for the life of the solution.
Does this sound like too much physical detail to know for certain what to do? That’s why the Organize-On and Groom functions make it easy. Just use the Query History’s column access statistics, locate the largest tables and find the most-often-accessed columns in where-clause filters (just don’t Organize-On join-only columns or distribution keys!). Add them to the Organize-On, Groom the table and watch this single action boost the most common queries into the stratosphere.
Data Distribution, co-location through “data slices”
Data distribution is another form of co-location. On a PureData system, every table is automatically divided across disks, each representing a “data slice”. Basically when a distribution key (e.g. Customer_ID) is used, the machine will hash the key values to guarantee that records with the same key value will always arrive on the same data slice. If several tables are distributed on the same key value, their like-keyed records will also be co-located on the same data slice. This means joining on those keys will initiate a parallel join, or what is called a co-located read.
Another of the most powerful aspects of Netezza technology is the ability to process data in parallel. Using the same distribution key to make an intermediate table, an insert-select styled query will perform a co-located read and a co-located write, effectively performing the operation in massively parallel form and at very fast speeds. Netezza technology can eclipse a mainframe in both its processing speed and ability to move and position large quantities of data for immediate consumption.
A few tweaks to tables and queries however, can yield a 100x or 1000x boost…
The caveat of data distribution is that a good distribution model can preserve capacity for the long-term. A distribution model that does not leverage co-located joining will chew-up the machine’s more limited resources such as memory and the inter-process network fabric. If we have enough of these queries running simultaneously, the degradation becomes extremely pronounced. A few tweaks to tables and queries however, can yield a 100x or 1000x boost; and without them the solution is using 10x or 100x more machine capacity than necessary. This is why some machines appear very stressed even though they are doing and storing so little.
Accessing the machine’s “deep metal”
Back to the notion of a “simple migration”. Does it sound like a simple point-to-point migration will leverage the power of the machine? Do the legacy queries use where-clause filters that can consistently invoke the zone maps? Are the tables configured to be heavily dependent upon indexes to support performance? If so, then the initial solution will be in first-gear.
But wait, maybe the migration happened a year or so ago and now the machine is “under stress” for no apparent reason. Where did all the capacity go? It’s still waiting to be used, in the deep-metal of the machine, the metal that the migrated solution doesn’t regard. It’s easy to fix that and voila, all this “extra” capacity seemingly appears from nowhere, like magic! It was always there. The solution was ignoring it and grinding the engines in first gear.
Enable business users to explore deep data detail
When Stephen Spielberg made Jurassic Park, he mentioned that the first dinosaur scene with the giant Brachiosaurus required over a hundred hours of film and CGI crunched into fifteen seconds of movie magic.
This represents a typical analytic flow model, where tons of data points are summarized into smaller form for fast consumption by business analysts. PureData System changes this because it is fast and easy to expose deep detail to users. Business analysts like to have access to the detail of deep data because summary structures will throw away useful details in an effort to boost performance on other systems.
The performance is built-in to the machine. It’s an appliance after all.
Architects and developers alike can see how the “Co-location, co-location, co-location!” is easy to configure and maintain, offering a durable performance experience that is also adaptable as business needs change over time. Getting there and staying there doesn’t require a high-wall of engineering activities or a gang of administrators on roller-skates to keep it running smoothly. The performance is built-in to the machine. It’s an appliance after all.