04 December 2007

Cartesian Products

When a report monkey asks why they are receiving an exponentially greater resultset than expected, we are often presented the opportunity to say, 'Cartesian product.' It is a wonderful phrase. I mean, if you can sling words like that, you must be good at your job. But making complex ideas sound simple is what really shows understanding. So when they follow up with, ‘Is that bad?’ respond with, 'You probably forgot a join.'

So what is a Cartesian product? It is a set of all possible sets. Let’s simplify this and deal with two source sets. Table Product and table Quantity. Note the compound keys. It is easier to avoid Cartesians using surrogate keys.


Table Product

ManufacturerProductSize
CogswellCog5
CogswellSprocket1
OmicronWidget3
MajobberThingey1


Table Quantity

ManufacturerProductQuantity
CogswellCog264
OmicronWidget17


Prior to SQL-92, tables were joined in the where clause. While it keeps the from clause cleaner, it does make it easier to forget joins on more complex queries.

Cartesian Statement:
select
p.Product,
q.Quantity
from
Product p,
Quantity q;

Resulting Cartesian Product:


ProductQuantity
Cog264
Cog17
Sprocket264
Sprocket17
Widget264
Widget17
Thingey264
Thingey17


Since we did not specify how the tables relate to each other, we are relating one set to another set. Meaning, each row in one set is equal to every row in the other set; a Cartesian product.

Specifying joins in the from clause does not eliminate Cartesian products. It may even make them more difficult to spot.

Incorrect Quantity by Manufacturer:
select
Manufacturer,
sum(Quantity) Quantity
from
Product p
inner join Quantity q
on q.Manufacturer = p.Manufacturer;


ManufacturerQuantity
Cogswell528
Omicron17


This is not a Cartesian product. Perhaps ‘Cartesian byproduct’ works, or ‘nested Cartesian?’ The query will equate every manufacturer record in Product to every manufacturer record in Quantity.

In any event, it appears to work because some (ok, in this case just one) of the answers are correct; Omicron does have a total quantity of 17. Cogswell however is wrong. Both product lines for Cogswell were married to one Quantity record, thereby inflating the number. The point is, don’t label every malformed query a Cartesian product; you never know when some math major will give you a verbal beatdown.

Another way of looking at Cartesians is considering them many-to-many joins. If your intent is a Cartesian product, use CROSS JOIN. In my experience, I have found one time that I actually wanted to use a Cartesian product… and I wanted to use it. Ultimately, it was not the correct approach. Not that I have never created Cartesians, just that they have never been intentional.

No comments: