When do I use a surrogate key in PostgreSQL?
I have been working on a site where the question of “When do I use a surrogate key?” has come up in conversation quite often. I wrote this out because the answer is bigger than a dinner table conversation. Enjoy.
What is a surrogate key?
A surrogate key represents a row in the database table. The surrogate is internally generated by the system and is invisible to the user or application.
Commonly to PostgreSQL, a surrogate key is an integer, bigint or numeric type column that is populated using a sequence object.
How is a row identified otherwise?
Businesses define the composition of elements that are uniquely identifiable as a customer, policy, product and other business entities. This combination of elements is known as a “natural” or “composite” key.
If we can identify data by business rules, why would we ever use a surrogate key?
- Speed.
- Integer indexes are dramatically faster than the character based indexes that typically make up a composite key. (No matter what Josh says. See below.)
- Integer comparisons can speed up many processes including replication, de-duplication, sorting and other general maintenance operations.
- Using composite keys for foreign keys is a complete PITA and makes for ugly SQL, coding mistakes and poor performance.
- Foreign keys based on integers generally JOIN faster than characters.
- To prevent a service outage or degradation.
- Composite indexes generally become fragmented over time because new values are inserted out of order.
- Eventually REINDEX must be issued to tidy up the index on disk, or PostgreSQL will quit using the index.
- When the REINDEX is issued, the table will still be available, but operations requiring the index will perform terribly or wait for the REINDEX to finish.
- If the composite index is also the primary key, the table will effectively be fully locked for the duration of the REINDEX.
Consider the alternative:
An ascending integer sequence always adds new rows to the end of the index, therefore it will never fragment, and never need to be rebuilt.
You may have both the surrogate key and the composite index at the same time. They are not mutually exclusive. You probably should use both kinds, but you should *always* have a business identifiable key.
If the surrogate key is also the primary key, the composite key can be rebuilt with no service delay.
CREATE UNIQUE INDEX CONCURRENTLY replacement_composite_key ON …
BEGIN
DROP INDEX original_composite_key …
ALTER INDEX replacement_composite_key RENAME TO original_composite_key;
END
In this example, the composite key is refreshed while no service outage occurs at all. - To support external applications and tools.
- Coding frameworks have conflicting requirements.
Some require an integer key, an IDENTITY/AUTOINCREMENT/SERIAL field, a primary key or some combination of those elements. The most compatible denominator is a numeric primary key using a sequence object.
QCubed, CodeIgniter, Symphony, Yii, rails, Cake, etc… - If you want to get really nutty about it, the name should always be “id”.
Drupal < 3, QCodo < 1, Java Hibernate, etc... - pgAdminIII requires a primary key to edit data in a table view.
- Dojo, VB, REALbasic, and other coding languages have support for showing elements in a list or combo box, with a hidden integer array to numerically identify the contents. (ItemData, itemindex, etc.)
- Coding frameworks have conflicting requirements.
- To make ETL easier.
- Comparisons for the unique presence of data are far easier with a single comparison field.
- They are also much faster with an integer than a character.
- To prevent collisions.
- Keys that are used to perform database maintenance should not be values that are based on user inputs.
- Users make mistakes.
- Multiple users can create the same data at the same time.
- Users can create data that would make a sailor blush, and you can’t get rid of it.
- Surrogate keys never have any of those issues, and provide a support mechanism for fixing the composite key when it happens there.
- Things that can’t possibly change, change anyway.
- Social Security numbers
- Driver’s license
- Names
- Genders
- Anything else your mama told you is permanent.
- Cascading updates
- on composite foreign keys are costly and usually not set up right anyway.
- You think you don’t care now. But you will. Oh yes, you will.
So why do I need natural keys?
- Because your data still needs to have meaning, and the natural key means something to the business.
- Because Josh Berkus may make you the butt of a presentation or two or three.
So what do you really want me to do?
CREATE SEQUENCE my_new_table_surrogate;
CREATE TABLE my_new_table (
id numeric(1000,0) primary key
default nextval('my_new_table_surrogate'),
...
)
ALTER TABLE my_new_table_surrogate
OWNER TO my_new_table;
CREATE UNIQUE INDEX my_business_identifier
ON my_new_table(column1, column2, column3);
Make “id” the first column, make it lowercase, populate it with a sequence, make it the primary key.
Then do what makes sense for the business in addition to that. Because, let’s face it, Josh is right, and you haven’t defined your business data very well.