19 November 2007

Good Database Design Part 4: Normalization

Ideally, normalize it so hard that an understanding of abnormal is abolished. Unfortunately, we do not live in an ideal world. The database administrator desires normalization while the developer craves denormalization. Good design is simple to represent denormalized.

C.J. Date said that database design is common sense formalized. Expounding upon that, you do not need to know relational theory to be capable of good design. Hopefully, you enjoy what you are doing enough to take an interest, but it is by no means mandatory. Critical thinking will often get you farther than theory alone.

With that said, normalization is great theory. The naysayer will whine about performance costs. You may be reading more tables, but you are doing it more efficiently which just as often results in… wait for it… less I/O and/or quicker retrieval times. The naysayer is generally considering only one side of performance: reading. A well normalized model yields more joins than a flatter model, and joins are expensive. However, more and smaller tables may improve write operations, overall. Additionally, a denormalized model often requires more supporting indices. This further degrades write performance, may increase fragmentation, requires more disk space and--getting overly nitpicky--complicates the optimizer’s job by forcing consideration of additional plans. I am not saying that denormalization does not have a place; just that it is typically in an analytical/reporting environment. Nor am I saying that normalized design does result in poor read performance. Poor implementation results in poor performance; don’t blame normalization (you can blame DBMSs for some of this (SQL should better provide for relational algebra operations)). If someone is using performance as an excuse not to normalize their OLTP model, there is a strong chance that they are an idiot.

There is no need to take everything to the fifth normal form; a good DBA knows when to be a bit forgiving—but you best have some sound reasoning backing you up. For most applications, the third normal form is sufficient. Start with your ideal model and adjust it around your implementation. Your concessions should be small to none for OLTP models. You can present it to your application developers however they want it; don’t let them break your rules unless you cannot accommodate their needs.

A well normalized model results in a database containing tables with an average of six to eight fields. Outside of averages, trust your intuition. If something looks good and feels correct (you kept it simple, right?), it probably is—move on.

There are standard structures that are much easier not to normalize. Names and addresses, for instance. Most of the time, we turn a blind eye and throw some columns into a table because it is much easier than normalizing. This does not mean that names and addresses should be in one table; obviously, if the entity associated with the address can have multiple addresses, that needs to be factored into your design.

Consider the selectivity of your data. If it is low, maybe normalizing saves some space. If not, maybe we can let this slide (like addresses and suite numbers). Also consider your methodology. If you are using an integer as your surrogate key, what will normalizing area codes do? If you store the area code as an integer, it actually adds the domain times two and creates an additional join. If the area code is three characters, that is smaller than the key. Don’t get carried away with normalizing your design, but remain vigilant.

I once implemented a project where normalizing the names and migrating surrogate keys saved a lot of space and did not sacrifice performance (quite the opposite). Look at the Social Security Administrations Death Master File; millions of names with only a couple hundred thousand unique first names and surnames. This is an anomaly (not the names, normalizing the names). The important thing is being aware of all the factors and accounting for them in your design.

No comments: