The relational database PostgreSQL (also known as Postgres) has grown increasingly popular, and enterprises and public sectors use it across the globe. With this widespread adoption, databases have become larger than ever. At Crunchy Data, we regularly work with databases north of 20TB, and our existing databases continue to grow. My colleague David Christensen and I have gathered some tips about managing a database with huge tables.
Production databases commonly consist of many tables with varying data, sizes, and schemas. It's common to end up with a single huge and unruly database table, far larger than any other table in your database. This table often stores activity logs or time-stamped events and is necessary for your application or users.
Really large tables can cause challenges for many reasons, but a common one is locks. Regular maintenance on a table often requires locks, but locks on your large table can take down your application or cause a traffic jam and many headaches. I have a few tips for doing basic maintenance, like adding columns or indexes, while avoiding long-running locks.
Adding indexes problem: Index creation locks the table for the duration of the creation process. If you have a massive table, this can take hours.
CREATE INDEX ON customers (last_name)
Solution: Use the CREATE INDEX CONCURRENTLY feature. This approach splits up index creation into two parts, one with a brief lock to create the index that starts tracking changes immediately but minimizes application blockage, followed by a full build-out of the index, after which queries can start using it.
CREATE INDEX CONCURRENTLY ON customers (last_name)
Adding a column is a common request during the life of a database, but with a huge table, it can be tricky, again, due to locking.
Problem: When you add a new column with a default that calls a function, Postgres needs to rewrite the table. For big tables, this can take several hours.
Solution: Split up the operation into multiple steps with the total effect of the basic statement, but retain control of the timing of locks.
Add the column:
ALTER TABLE all_my_exes ADD COLUMN location text
Add the default:
ALTER TABLE all_my_exes ALTER COLUMN location SET DEFAULT texas()
Use UPDATE to add the default:
UPDATE all_my_exes SET location = DEFAULT
Problem: You want to add a check constraint for data validation. But if you use the straightforward approach to adding a constraint, it will lock the table while it validates all of the existing data in the table. Also, if there's an error at any point in the validation, it will roll back.
ALTER TABLE favorite_bands ADD CONSTRAINT name_check CHECK (name = 'Led Zeppelin')
Solution: Tell Postgres about the constraint but don't validate it. Validate in a second step. This will take a short lock in the first step, ensuring that all new/modified rows will fit the constraint, then validate in a separate pass to confirm all existing data passes the constraint.
Tell Postgres about the constraint but do not to enforce it:
ALTER TABLE favorite_bands ADD CONSTRAINT name_check CHECK (name = 'Led Zeppelin') NOT VALID
Then VALIDATE it after it's created:
ALTER TABLE favorite_bands VALIDATE CONSTRAINT name_check
Hungry for more?
David Christensen and I will be in Pasadena, CA, at SCaLE's Postgres Days, March 9-10. Lots of great folks from the Postgres community will be there too. Join us!