SQL Server Thread: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data

A Simple Single-Child Model

Consider the following logical data model:

There are multiple Companies
Each Company has many Projects
Each Project has many Tasks
Each Task has a Status, selected from a global list of pre-defined Statuses.

Status   Companies
      \       |
       \      |
        \ Projects
        \    |
          \   |
           Tasks

Let us say that we decide that the primary key of Companies, Projects, Tasks, and Status are all Identity (auto-number) columns, since we would like to auto-generate primary keys for those tables.

So, we have something like this:

Table       Primary Key     Foreign Key References
---------   -----------     ----------------------
Status      StatusID        n/a
Companies   CompanyID       n/a
Projects    ProjectID       Companies(CompanyID)
Task        TaskID        Projects(ProjectID), Status(StatusID)

At this point, things work fine -- we can enforce full referential integrity with our identity primary keys. We can't necessarily eliminate duplicate data, but suppose that is not an issue and/or we have unique constraints in place. Choosing to follow the "every primary key must be an identity rule" did not have any adverse affects and we have a nice, clean data model.

Parents with Multiple Children

Now, allow me to add one little wrinkle. Suppose that the available Statuses for each Task is not defined globally, but at the Company level. That is, each Company has its own list of Statuses to which a Task can be assigned.

     Companies
      /   \
    /     \
Status   Projects
     \       /
    \     /
       Tasks

This means that the Status table now needs a Foreign Key reference to the Companies table (to indicate which Company each Status belongs to), giving us:

Table       Primary Key     Foreign Key References
---------   -----------     ----------------------
Companies   CompanyID       n/a
Status      StatusID        Companies(CompanyID)
Projects    ProjectID       Companies(CompanyID)
Task        TaskID        Projects(ProjectID), Status(StatusID)

Are there any other changes we need to make to this data model? Or is simply adding a CompanyID column to the Status table enough to facilitate this change? Remember, our goal as always is full referential integrity using primary and foreign key constraints wherever possible.

Well, there is an issue:

Nothing in this data model stops us from assigning a Status to a Task that is not defined for that Task's parent company. We have no way of enforcing this right now with our current constraints. Our physical data model is flawed. (This is also described a bit here.)

Identities Are Part of the Solution

This is very easily fixed, but it can only be done by violating the "all tables just need an identity primary key" rule.

First, remember this: just because an identity column is unique does not mean that that column cannot be part of a primary key. So, even though maybe you decide that you really like integer primary keys for your Status table (when probably a simple CHAR(2) code is all you need), if Status codes are child entities of Companies then we can instead create a composite primary key like this:

Table          Primary Key             Foreign Key References
---------     -----------             ----------------------
Status         CompanyID,StatusID    Companies(CompanyID)

So now, with the two column primary key and one of those columns referencing another table, Statuses are truly a child entity of Companies. Even though it seems that having a two-column primary key is not necessary since the StatusID column in itself is unique, you will see that to physically enforce the very simple logical model we are working with, it's required!

Now, just as Statuses are child entities of Companies, so are Projects. And as with the StatusID column, we have a nice unique identity ProjectID column that is unique enough that it could be our primary key -- but it shouldn't be. Once again, being a child entity of a Company, and with the goal of wishing to enforce integrity throughout our entire model, we find that the true primary key of our Project table needs to be a composite of CompanyID/ProjectID:

Table       Primary Key             Foreign Key References
---------   -----------             ----------------------
Projects    CompanyID,ProjectID    Companies(CompanyID)

None of this seems necessary yet -- why are we messing with our perfectly fine keys? -- but soon it will all make sense.

Looking at Tasks, we know we have a foreign key reference to Projects, and we know the primary key of Projects is now (CompanyID, ProjectID). Also, remember that we changed our Status table as well to have a composite primary key including the CompanyID column. Thus, we now know that we need to add a CompanyID column to our Tasks table to enforce these FK constraints:

Table       Primary Key     Foreign Key References
---------   -----------     ----------------------
Task        TaskID        Projects(CompanyID,ProjectID), Status(CompanyID,StatusID)

OK, so we are making our primary keys bigger when it doesn't seem necessary, and now our foreign key constraints are getting more complicated as well since they now involve multiple columns.

Why did we bother doing all this?

Because now we have full, complete referential integrity, that's why!

Examining the Results

With these changes in place, only Status codes belonging to a Task's Company can be assigned to that Task. By propagating the CompanyID column down to our child tables, we have the column necessary to enforce full referential integrity as dictated by our logical model.

So, in the end, it looks like this:

Table       Primary Key            Foreign Key References
---------   -----------            ----------------------
Companies   CompanyID              n/a
Status      CompanyID,StatusID     Companies(CompanyID)
Projects    CompanyID,ProjectID    Companies(CompanyID)
Task        TaskID                 Projects(CompanyID,ProjectID), Status(CompanyID,StatusID)

Just this little tweak in our logical model -- that each Company has their own list of Statuses -- required all of these changes in the physical database schema! What gives?

Well, remember: if the logical model changes, the physical data model should change. If it doesn't, then you are probably not modeling it correctly or accurately to maintain strict data integrity in the first place.

"All-identity-all-the-time" data modelers often claim that only using identities makes things easier and "more flexible". As we can see from this example, it can work in some cases (back when Statuses were global), but not in every case. Sometimes, if your goal is to enforce strict data integrity, you simply cannot create your physical schema that way. If you want to approximately model your data, then all identities can be nice. If you want to accurately model your data, then it simply is not always possible with only single-column identity keys as clearly demonstrated here.

"Identifying" When Using Only Identities Will Not Work

The Companies table originally had just one child: Projects. But once we determined that Status codes are also a child entity of Companies, we now had multiple children for the Companies table: Projects and Statuses.

Now, that in of itself is fine; at this point, there is no reason to change anything. We can have those entities both references the Companies table via CompanyID, and both of those tables can have single column identity primary keys.

However, also in our data model, we see that at some point a child entity of Projects (Tasks) requires that we relate Projects to Statuses. Now, suddenly the fact that Parents and Statuses have a common parent -- CompanyID -- is very important. This is because we cannot allow a Project from one company to be assigned to a Status from another. Once we identify this situation, we realize that the way to handle this is that the primary keys of both the Projects table and Status table must also include the CompanyID column.

With the composite primary keys to those children with the common parent in place, we can now relate them to each other as needed and we will have full data intregrity between those two entities and any of their children. You can see this in our new data model, as it will not allow you to assign a Status to a Task unless it has been defined for the Task's company. We would not be able to enforce this without ensuring that the Task table had a full normalized, non-redundant CompanyID column, and the way to do that was to ensure that the primary key of Projects was CompanyID/ProjectID and that the primary key of Status was CompanyID/Status. The rest follows naturally when assigning the foreign key constraints.

Another Example

Suppose we have Stores, and each Store has Departments, and each Department has Employees. That is a straight-forward, simple data model where no two entities share the same parent, and we can just use single-column identities all the way throughout and be fine.

But what if each Store has multiple Locations? And each Employee works at a Location. We cannot juse use LocationID as a primary key of Locations, otherwise we can assign any random Employee to any random Location, regardless of the Store they belong to.

So, our Location table needs to have a primary key of StoreID/LocationID. Now we also need the StoreID column in the Employee table to relate Employees to Locations. To get this, we need to also include StoreID in the primary key of the Department table. With that in place, we can now properly relate Employees to Locations and we are guaranteed they will be for the same Store.

Once again, once we saw that two entities had the same parent (Locations, Departments), and that somewhere "down the line" those two entities would be related (via Employees), it becomes clear that we need to incorporate the common parent's primary key (StoreID) into the primary key of those two child entities (Locations and Departments) to enable that relation (in the Employees table) to function properly.

Summary

There are 3 main points I am trying to make here today:

The "all-identities-all-the-time" approach can work well in some cases, but not in all. If you insist on only using all identities for all of your primary keys, you're not always going to have strictly accurate data integrity.
Don't forget that Identity columns can be part of a composite primary key.
When a table has multiple child entities in the database, and there are any relations between any of those child entities, then all of those child entities should have a composite primary key that includes the parent table's primary key columns to facilitate those relations.

This is not to say that the "all-identities-all-the-time" approach is always wrong, just that it is not always right. And it certainly is not always "easiest", unless you find continually scrubbing invalid data to be an easy task.

Source Click Here.

Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data

Blog Archive