When a report monkey asks why they are receiving an exponentially greater resultset than expected, we are often presented the opportunity to say, 'Cartesian product.' It is a wonderful phrase. I mean, if you can sling words like that, you must be good at your job. But making complex ideas sound simple is what really shows understanding. So when they follow up with, ‘Is that bad?’ respond with, 'You probably forgot a join.'
So what is a Cartesian product? It is a set of all possible sets. Let’s simplify this and deal with two source sets. Table Product and table Quantity. Note the compound keys. It is easier to avoid Cartesians using surrogate keys.
Table Product
Manufacturer | Product | Size |
Cogswell | Cog | 5 |
Cogswell | Sprocket | 1 |
Omicron | Widget | 3 |
Majobber | Thingey | 1 |
Table Quantity
Manufacturer | Product | Quantity |
Cogswell | Cog | 264 |
Omicron | Widget | 17 |
Prior to SQL-92, tables were joined in the where clause. While it keeps the from clause cleaner, it does make it easier to forget joins on more complex queries.
Cartesian Statement:
select
p.Product,
q.Quantity
from
Product p,
Quantity q;
Resulting Cartesian Product:
Product | Quantity |
Cog | 264 |
Cog | 17 |
Sprocket | 264 |
Sprocket | 17 |
Widget | 264 |
Widget | 17 |
Thingey | 264 |
Thingey | 17 |
Since we did not specify how the tables relate to each other, we are relating one set to another set. Meaning, each row in one set is equal to every row in the other set; a Cartesian product.
Specifying joins in the from clause does not eliminate Cartesian products. It may even make them more difficult to spot.
Incorrect Quantity by Manufacturer:
select
Manufacturer,
sum(Quantity) Quantity
from
Product p
inner join Quantity q
on q.Manufacturer = p.Manufacturer;
Manufacturer | Quantity |
Cogswell | 528 |
Omicron | 17 |
This is not a Cartesian product. Perhaps ‘Cartesian byproduct’ works, or ‘nested Cartesian?’ The query will equate every manufacturer record in Product to every manufacturer record in Quantity.
In any event, it appears to work because some (ok, in this case just one) of the answers are correct; Omicron does have a total quantity of 17. Cogswell however is wrong. Both product lines for Cogswell were married to one Quantity record, thereby inflating the number. The point is, don’t label every malformed query a Cartesian product; you never know when some math major will give you a verbal beatdown.
Another way of looking at Cartesians is considering them many-to-many joins. If your intent is a Cartesian product, use CROSS JOIN. In my experience, I have found one time that I actually wanted to use a Cartesian product… and I wanted to use it. Ultimately, it was not the correct approach. Not that I have never created Cartesians, just that they have never been intentional.
No comments:
Post a Comment