This article was reprinted courtesy of the International DB2 Users Group.
Since the introduction of the Star Join access method in Version 6 of DB2 UDB for z/OS and OS/390, there have been many additional enhancements to improve the detection of star schema queries to determine the most efficient access path, and also to minimize regression of non-star schema queries.
Due to the rapid deployment of these improvements, the challenge has been to ensure that the DB2 community is aware of this enhanced functionality. Thus, it is the objective of this article to summarize the current implementation of Star Schema support in DB2 UDB for z/OS and OS/390 V6 and V7, and provide an overview of the issues and complexity associated with this join methodology.
For a more comprehensive overview of the evolution of the Star Join implementation, from initial implementation to the current enhancements, please refer to the whitepaper entitled: "Evolution of Star Join Optimization" (see bibliography section for location). This whitepaper also details the complexities involved in processing queries against this data model, and provides a full listing of the relevant APARs related to the star join access path.
A star schema is the typical data model used for business intelligence and data warehousing systems, where there exists a number of normalized tables (known as dimensions) surrounding a large centralized table (known as the Fact Table). The Fact Table stores the fact, such as sales transactions, and the dimensions store the attributes about these facts. This model schematically resembles a star, and is named as such due to the dimension tables appearing as points of a star surrounding the central Fact Table.
Star schema join challenges
In an OLTP world, the query optimizer technology has been designed to handle pair-wise joins; joining tables two at a time based on join predicates and determining the join sequence based on the associated cost. A join pair is formed by the existence of join predicates between tables, thus tables without join predicates are considered to be unrelated and therefore not valid join pairs.
The pair-wise join method fails miserably for star schemas. As dictated by the star schema data model, the only table directly related to other tables is the Fact Table. As a consequence, the Fact Table is most likely chosen in the first pair-wise join, with subsequent dimension tables joined based on cost.
Even though the intersection of all dimensions with the Fact Table can produce a small result set, the predicates applied to one single dimension table are typically insufficient to reduce the enormous number of Fact Table rows. As demonstrated in the star schema example in Figure 1, there is no one single dimension table that could be paired with the Fact Table as the first join pair to produce a manageable result set.
If a join based on related join-pairs does not provide adequate performance, then an alternative is to join unrelated tables. Joining of unrelated join-pairs results in a Cartesian product, whereby every row of the first table is joined with every row of the second.
Performing a Cartesian join of all dimension tables before accessing the Fact Table may not be efficient. The optimizer must decide how many dimension tables should be accessed first to provide the greatest level of filtering of Fact Table rows using available indexes. This can be a delicate balance, as further Cartesian products will produce a massive increase in the size of the intermediate result sets. Alternatively, minimal pre-joining of unrelated dimension tables may not provide adequate filtering for the join to the Fact Table.
Further normalization of the dimension tables results in each dimension branch representing a snowflake. The resultant data model can contain large number of tables, which generally provides a challenge for cost based optimizers using dynamic programming (or exhaustive search) algorithms to identify the best possible table join sequence.
To simplify access path selection, DB2 always materializes each snowflake into a single dimension before determining the most cost effective join sequence as shown in Figure 2.
Figure 3 provides the explain output showing the materialization of snowflakes in their separate query blocks (Q_Blk_No 2 and 3), with one snowflake workfile accessed before, and one after the Fact Table.
Similar to snowflake materialization, single dimension tables accessed before the Fact Table are sorted into join column order and are also materialized into their own workfiles (although this does not appear explicitly in the plan table output). Single dimension tables accessed after the Fact Table are only materialized if required by the chosen join method.
Index Key Feedback
One of the most impressive run-time features of the Version 6 star join implementation was the Index Key Feedback loop, designed to minimize join overhead.
The sparseness of data within the Fact Table implies a significant number of values generated by the Cartesian join process are not to be found by a join to the Fact Table. Sparseness refers to impossible data combinations, such as the sale of bikinis in Alaska in December. To reduce the CPU overhead of joining unnecessarily derived rows to the Fact Table, DB2 introduces an index key feedback loop to return the next highest key value whenever a not-found condition is encountered.
A hit on the Fact Table index will return the matching Fact Table row. A miss will return the next valid Fact Table index key so that data manager can reposition itself within the dimension workfiles (materialized dimensions accessed before the Fact Table), thus skipping composite rows with no possibility of obtaining a Fact Table match. Figure 4 demonstrates the index key feedback technique.
To further improve the performance of the join to the Fact Table, the entire join has been pushed down to data manager (stage 1). This only applies for star join access from the composite (materialized dimensions) to the Fact Table. This ensures a reduced path length since rows no longer need to be returned to RDS (stage 2) for the join to occur. The join method used by this process is a nested loop join.
Fact Table Detection
Given that the issues involved with processing a star schema query are based on the sheer size of the central Fact Table, detection of this Fact Table is the major objective to allow the most efficient runtime access path to be identified.
If the Fact Table is incorrectly identified, then the Fact Table will be considered to be a dimension table (within a snowflake) to be materialized and potentially accessed early in the table join sequence. Alternatively, incorrect Fact Table detection may disqualify the query for the star join optimization.
Unique Index Check
The first method used to detect the Fact Table is the Unique Index detection. If this method detects a Fact Table, then the remaining star schema detection rules are applied (listed below).
Beginning outside-in, the optimizer will evaluate each set of join predicates. For each set of join predicates between two tables, the table with a unique index on the join predicates is considered to be the parent in a parent-child relationship. As the optimizer continues outside-in, the table without any further children (and which therefore only has parents) is considered to be the Fact Table.
If the first check fails to identify a Fact Table, or this Fact Table fails against the subsequent rules, then the second Fact Table detection rule is applied. This second method is to evaluate the cardinality of the Fact Table and largest dimension against the STARJOIN DSNZPARM value. If a Fact Table is identified, then the remaining rules are applied.
The third and final method is the topology check. This is applied if the previous two methods have failed to qualify the query for star join processing. Once again, if a Fact Table is detected, then the additional rules are applied.
The topology check determines that the Fact Table has the highest number of join predicates in the query.
Star Schema Detection
Once a Fact Table is identified using any of the three Fact Table detection algorithms, then the following conditions must be met for DB2 to use the star join technique:
- Numbers of tables in the query block must be at least 10 (altered via DSNZPARM SJTABLES).
- All join predicates are between the Fact Table and the dimension tables, or within tables of the same dimension.
- All join predicates between the fact and dimension tables must be equijoin (equal join) predicates.
- All join predicates between the fact and dimension tables must be Boolean term predicates (fact to dimension join predicates cannot be ÒORÓed).
- A local predicate on a dimension table cannot be ÒORÓed with a local predicate of another dimension table.
- A single Fact Table column cannot be joined to columns of different dimension tables in join predicates. For example, Fact Table column ÒF1Ó cannot be joined to column ÒD1Ó of dimension table T1 and also joined to column D2 of dimension table T2.
- No correlated subqueries cross dimensions.
- The data type and length of both sides of a join predicate are the same between the fact and dimension tables.
- Dimensions cannot be a table function.
- After DB2 simplifies join operations, no outer join operations can exist between the fact and dimension tables.
A successful match on all of the above star schema detection rules will immediately qualify the query for star join optimization. A failure of these rules for each of the Fact Table detection rules will result in the query being optimized using standard dynamic programming (exhaustive search) techniques/algorithms.
Access Path Selection
The focus on the Fact Table within a star schema implies that the possible join sequences must aintain this focus. Efficient Fact Table access requires the existence of indexes on this table. Exploiting these indexes requires a certain degree of dimension table access to produce the selective join predicates for accessing the Fact Table.
The standard dynamic programming or exhaustive search algorithm may spend excessive time evaluating a significant number of inefficient join sequences given the potentially large number of tables accessed in a star schema query.
For star schema processing, DB2 instead invokes a set of heuristic rules driven by the available Fact Table indexes to evaluate the possible join permutations. The limited number of join permutations derived from these rules ensures that the bind time, or dynamic SQL preparation time, is not severely impacted as the number of tables increases. This implementation delivers significant bind time reductions over the standard dynamic programming algorithm.
First heuristic - Selective outside-in join
The first heuristic applies the following process to determine the potential join permutations:
For each Fact Table index I(C1, C2, ..., Cm):
Second heuristic Inside-out join
This second heuristic focuses on the Fact Table as the first table in the join sequence.
Beginning with the Fact Table, the following heuristic rules are applied to determine the join permutations for all dimension tables:
- Remaining tables joined based upon the cardinality (lowest cardinality to highest).
- Remaining tables joined based upon the cardinality (lowest cardinality to highest) with dimensions having an index on the join column selected first.
- Remaining tables joined based upon the reduction ratio or selectivity.
- Remaining tables joined based upon the reduction ratio or selectivity with dimensions having an index on the join column selected first.
Cost Estimation for Reduced Join Permutations
The first heuristic (outside-in) will produce a maximum of four access path alternatives for every leading key column combination (C1, C1/C2, C1/C2/C3 etc) of all Fact Table indexes. The cost of each access path will be estimated with and without the index skipping technique. Although index skipping reduces unnecessary index probes, it does carry an overhead with repositioning within the dimension workfiles. Also, the join to the Fact Table is processed entirely within Data Manager for star join (utilizing index key feedback), and therefore cannot utilize list prefetch, which is a RDS process.
The second heuristic will only produce a maximum of four access path alternatives.
The cost of all possible table join permutations derived by the two main heuristic rules will be then estimated using formulae from the standard dynamic programming cost-based optimization. New formulae have been introduced that take into consideration the costs associated with workfile repositioning and index key feedback.
It is anticipated that the minimized join permutations derived from these heuristic rules will determine the most efficient access path in the majority of situations. The goal of reducing the possibility of query regression, while still reducing the bind time/space complexity is therefore achieved.
The star join access logic has been developed to improve the processing of queries against data warehouse and business intelligence systems adopting the star schema data model.
The current implementation has significantly improved star schema detection, and introduced DSNZPARMs that provide DBAs and system administrators with greater control over the application of star join processing for their environment. This showcases IBM's commitment to z-Series machines as a viable and powerful processor for your data warehousing and BI systems. The current status of Star Join processing allows you to leverage your mainframe resources so that BI and OLTP applications can coexist within the one DB2 subsystem or data sharing group, at least from a query optimization perspective.
Whitepaper - Evolution of Star Join Optimization (October 2001)
DB2 UDB Server for OS/390 Version 6 Technical Update (SG24-6108-00)
DB2 for z/Os and OS/390 Version 7 Performance Topics (SG24-6129-00)
About The Author
Terry Purcell is a Senior Consultant with Yevich, Lawson & Associates and a member of the IBM Gold Consultants program for DB2. Terry can be reached at Terry_Purcell@YLAssoc.com.