Reflections on Data Redundancy and the Shades of Denormalization

Database Denormalization Can Help Boost Query Performance

Database normalization and denormalization is one of the most discussed and debated topics ([1], [2], [3]).    In a nutshell, “normalization is eliminating redundant data (for example, storing the same data in more than one table) and ensuring data dependencies make sense (only storing related data in a table).”   Normalization reduces the amount of database storage space.   Denormalization is the opposite: “the process of attempting to optimize the performance of a database by adding redundant data or by grouping data” ([4]).

Obviously, denormalization is a dual-edge sword. On one hand, the database size may grow quite significantly. On the other hand, even with partial denormalization, many analytical queries may be simplified, as it allows for reformulating them as single table queries, reducing multi-table join dimension, and eliminating some correlated subqueries.

For example, let us consider the TPC-H benchmark database schema. Query 10 is formulated as follows:

select c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal, n_name, c_address, c_phone, c_comment
from customer, orders, lineitem, nation
where c_custkey = o_custkey and l_orderkey = o_orderkey
and o_orderdate >= date ‘:1′ and o_orderdate < date ‘:1′ + interval ‘3′ month
and l_returnflag = ‘R’ and c_nationkey = n_nationkey
group by c_custkey, c_name, c_acctbal, c_phone, n_name,  c_address, c_comment
order by revenue desc;

This is a four-table join. However, the only purpose of joining with table nation is to display n_name in the select list, and table orders is only needed in the joined table list for filtering on o_orderdate.

Suppose we denormalize customer and lineitem tables as follows. Add a new redundant column c_nationname to customer, and new redundant columns l_custkey and l_orderdate to lineitem. The database size will increase by 2%. Such denormalization allows reformulating TPC-H Query 10 as follows:

select c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal, c_nationname, c_address, c_phone, c_comment
from customer, lineitem
where l_orderdate >= date ‘:1′ and l_orderdate < date ‘:1′ + interval ‘3′ month
and l_returnflag = ‘R’
group by c_custkey, c_name, c_acctbal, c_phone, c_nationname, c_address, c_comment
order by revenue desc;

Let us also add s_nationname to supplier and n_regionname to nation. Those familiar with TPC-H Benchmark will notice that these small denormalization steps will speed up many queries. For example, the 6-table join Query 7 will turn into a 3-table join; the 8-table join Query 8 will turn into a 5-table join; the 6-table join Query 9 – into a 4-table join; the correlated subquery in Query 4 will turn into a much simpler 2-table join.

Urban Legend That Vendors Used Denormalization in Benchmarking

If denormalization of a few fields can provide spectacular query performance improvements, is it possible to do implement some kind of transparent denormalization to get better benchmark results? Imagine what TPC-H results may look like if a database server could invisibly automatically denormalize several tables. All 8 tables will still have to remain in the schema.

Clearly, in case of row-oriented databases, you can’t hand-pick which columns to denormalize during database table creation and data loading. However, once the data is loaded and written to disk, it is prohibitively difficult to implement granular “invisible” denormalization in row-oriented databases. Besides, as no indexes may be created in TPC-H databases, row-oriented DBMS have to perform full table scans.

Can transparent denormalization be implemented on the table level? Not really. Each of the eight TPC-H tables has a very long “comment” column. For example, column orders.o_comment is varchar(79), partsupp.ps_comment is varchar(199). Thus, merging table partsupp into orders would triple the size of table orders, and merging orders into lineitem would triple the size of table lineitem. I remember 15 years ago, when I worked on SybaseIQ, rumors and suspicions of cheating TPC-D (the predecessor of TPC-H) were quite common. TPC-H Benchmark was designed to make implicit denormalization impossible.

Partial Denormalization Rules of Thumb

Unlike in case of TPC-H, analytical BI and DW designers are not bound by strict benchmark specification and rules. They can take full advantage of partial database denormalization to maximize analytical query performance. Which columns to duplicate, depends entirely on the nature of the database application and on the query mix. When evaluating the merits of denormalizing particular database columns, look for two things.

First, you don’t want to increase to risk that your analytical database may become inconsistent. Don’t duplicate columns unless the column values are unlikely to change, as otherwise you will have to cascade value updates manually. In the example above, o_orderdate is a very good candidate for denormalization, because it will not change.

Second, examine poorly-performing queries and tailor your denormalization steps to benefit these queries. For example, if you see a number of queries with GROUP_BY vendor_name, you may want to duplicate vendor_name field. If your BI tool generates a lot of filters involving a dimension table column (e.g., WHERE model_year=2008), this column is a good candidate too, as was the orders.o_orderdate column in the TPC-H Query 10 example.

Partial Database Denormalization is Simpler and Less Risky in Case of Column-Oriented Databases

As I was writing this blog, I was thinking about how partial denormalization can help boost analytical BI and DW query performance, and how challenging it may be for database designers and DBAs to anticipate the query mix accurately enough to make correct schema design decisions. And even if such a perfect database schema is found, it may become suboptimal over time because of changing business models, use cases, workflows, and rules, BI data feeds, and computing infrastructure. When business changes of this kind happen, you may want to reorganize your databases, so that database schemas better meet business needs.

Consider for a minute the impact of such reorganization. Suppose you have a 20 terabyte analytical database. In case of a row-oriented database, entire fact tables will have to be scanned and transformed by an ETL tool and loaded in reformatted/reorganized way. Add to this big memory and storage footprint as result of poor data compression, the need to handle indexes and sharding, - and the big impact of any such database schema reorganization becomes clear. From my prior IS management experience, I know firsthand that datacenters and IS departments at large organizations spend many months and a lot of resources planning such transitions.

If on the other hand your 20-terabyte analytical database is column-oriented, you only need to reload one column, without doing ETL or anything else. All you need is a working ALTER TABLE statement that allows adding and dropping a column. Add to this small storage and memory footprint as a benefit of superior data compression, and database schema redesign task becomes significantly simpler, smaller, and less risky, compared to the row-oriented database case.

Leave a Reply

You must be logged in to post a comment.