Batyr Nuryyev | (In)efficient Insertions in Postgres

PostgreSQL is a popular relational database that is used for variety of applications. With the recent surge in popularity of Generative AI (Large Language Models in particular), PostgreSQL is often used for storing document “embeddings” (low-level representation of documents) to enhance search functionality. In addition to the ability to “search” similar documents in a query, it is used primarily due to its reliability, speed, and community support (I know, some folks might say relational databases are slow, but there are plenty of ways to tune the database to all kinds of use cases and access patterns).

Writing to PostgreSQL

The SQL in PostgreSQL is already a spoiler for what language is used to retrieve or store the data. One of the common ways, which is also part of SQL standard, is to use INSERT command:

INSERT INTO customer (id, name, age)
VALUES (123, "john doe", 29);

In the example below, we insert (store) a single row containing 3 columns. This classic usage of INSERT is quite common when you need to store a few rows (e.g. when you need to insert new information in real time).

But does INSERT still work with larger amount of data? What if we write 100s of columns? One way is to loop over your data records and insert them one by one:

import psycopg

customers = [( 123, "john doe", 29 ), ( 125, "jane doe", 31 )]

# Connect to an existing database
with psycopg.connect("dbname=marketplace user=mytestusername") as conn:

    # Open a cursor to perform database operations
    with conn.cursor() as cur:

        # Pass data to fill a query placeholders and let Psycopg perform
        # the correct conversion (no SQL injections!)
	for id, name, age in customers:
		cur.execute(
		    "INSERT INTO customer (id, name, age) VALUES (%s, %s, %s)", (id, name, age)
		)

        # Make the changes to the database persistent
        conn.commit()

This works fine for a handful of queries. Once you get into 1000s or even more records, the code above will progressively start getting worse. It is not scalable.

Another option is to use cur.executemany() where the same query will be executed for a list of tuples instead of a single tuple. It is better than serially executing INSERTS because the transaction is committed only once instead of multiple times. However, executemany() will still be very slow because it is not optimized for very large amounts of data (>=1 million rows).

Best option for “batch” loading (when we load large amount of data at once) is to use COPY. COPY is done only within a single transaction and treats the entire file as an input stream. It is specifically optimized for batch loading.

COPY customer FROM 'customer_data.csv' DELIMITER ',';

Benchmarking

I created a repository with benchmarking code that shows the performance difference between the three methods (i.e. execute, executemany, and copy). Feel free to fork it and reproduce locally. The results might differ, but only slightly. The speed-up ratios between the approach should stay about the same.

Here are the results from a single run:

Function 'very_slow_insert' [execute one by one] executed in 4309.07 secs / 71.82 mins
Function 'slow_insert' [executemany] executed in 168.56 secs / 2.81 mins
Function 'fast_copy' [copy] executed in 61.26 secs / 1.02 mins

COPY command was about ~71x faster than running INSERT in a loop and ~2.8x faster than running a single-batch INSERT command.

Conclusion

It is clear that COPY is the best option for large data loading in Postgres. The single-batch INSERT (with many values) works too, but is still behind the copy. The individual INSERT commands are the slowest and I do not recommend running them unless you are only inserting a handful of rows and do not generally have a large load.