This post was originally published at The Expert Integrated Systems Blog and is co-authored by Dave Ebbels, Isaac Moreno Navarro, and Mark Aukeman.
While simplicity is one of the cornerstones of IBM PureData System for Analytics, powered by Netezza technology, there are some easy steps you may take to increase the performance of your system when you are running heavy analytical loads.
As you may already know, in PureData System for Analytics, you do not have to perform administration tasks commonly encountered in other database platforms, such as index creation, partition management and storage optimization. However, there are some simple things related to the way PureData System for Analytics is designed that help to ensure you get the maximum performance available in your system.
The following five points are important tasks to consider for optimizing your PureData System for Analytics data warehouse solution.
1. Distribution optimization
We have two main objectives here, as finding a proper distribution is probably the most important feature for getting optimum performance. You need to ensure that each data slice contains about the same amount of data. Additionally, you want to find the distribution key to verify that most of your joins are “co-located.” Joins are co-located when at least one of the join keys is also the distribution key for both joined tables.
For the distribution key, choose the one column you are most likely to use in a join with other tables, and concentrate on your largest tables. And remember to choose a distribution key with high cardinality, such as “customer_id” or “product_id.” Never use a date field.
2. Zone map optimization, clustered base tables (CBT) and organization
Zone maps help the system eliminate data blocks that it knows in advance will not have data for query. Optimizing zone maps will decrease the time spent scanning data in disks, and thus increase the performance of the system dramatically.
Always use data types that are “zone-mappable” (integer or temporal data types) in your WHERE clause conditions to get the advantage of zone maps.
Proper organization of data will boost the effectiveness of zone maps. For instance, ordering on zone-mapped columns or creating a clustered base table (CBT) will improve zone map effectiveness and thereby accelerate data scanning.
3. Concurrency management and optimization
What really is concurrency? We will define it as the number of queries that are submitted and completed within a given time period. It is all about throughput!
Allocate reasonable resources to longer running queries, while at the same time allow short queries to pass through quickly (increase throughput). You can achieve this objective by reducing the number of queries being executed at the same time and prioritizing the short queries. Setting a JOB MAXIMUM limit on concurrency can actually improve throughput and query response time.
4. Workload management
Workload management approaches optimization from the perspective of workloads that span multiple queries and multiple users, as opposed to individual queries.
By creating different resource sharing groups (RSG), you can manage the resources allocated to those groups to meet service level agreements (SLA) for different populations in your organization. Consider that RSGs should represent a similar number of active users to assure the expected level of performance. Also, use caution when specifying RESOURCE MAXIMUM settings. This could result in queries sitting idle when system resources are indeed available.
5. Structured Query Language (SQL) optimization
The way in which your SQL is structured can have an effect on the performance of the resulting query. While in most cases you may leave your SQL untouched, there are a few considerations you may need to know. For example, the effectiveness of zone maps will depend on the conditions included in the WHERE clause in your SQL. Also, including distribution keys in your join conditions is a prerequisite for co-located joins.
For complex analytical queries, consider breaking them into smaller steps to reduce the number of joins and give the opportunity to distribute the intermediate results in an optimal way. Review your queries and find out if what you are trying to do is already covered by a Netezza analytic function. If that is the case, you may solve your query in a much more efficient manner (in-database execution) and with spectacular performance.
PureData System for Analytics is designed to provide industry-leading performance without the complex administrative tasks normally associated with database tuning. However, by performing a few simple tasks like the ones we have outlined, you can further enhance the performance of your entire system and increase the level of satisfaction in your user community.
We encourage you to use these best practices when seeking to optimize your existing PureData System for Analytics environment, or planning a new implementation. Please comment or tweet about your interesting experiences optimizing PureData System for Analytics queries.