How I Accidentally Ran 682 Million Index Scans Without Knowing It

Here is exactly what I found, what I did, and what you should know about indexes.

The Background

I work on a data pipeline that processes financial filings from the US Securities and Exchange Commission. Every day the system crawls thousands of documents, extracts data, and runs them through an LLM for classification.

Each document is a record in a PostgreSQL table called filing_records. Every record moves through stages:

Crawled → MetaData → TextExtracted → LLMProcessed

The pipeline queries this table every hour fetch records at the current stage, process them, move them to the next. Simple enough. Running fine. Or so I thought.

The Moment I Looked Under the Hood

I had been learning about EXPLAIN ANALYZE a PostgreSQL command that does not just run your query, it shows you how Postgres ran it. The execution plan. The strategy it chose. How long each step actually took.

I ran it on one of our most common pipeline queries:

EXPLAIN ANALYZE
SELECT *
FROM filing_records
WHERE status = 'Text Extracted'::filing_record_status;

This is what came back:

Parallel Seq Scan on filing_records
  Filter: (status = 'Text Extracted'::filing_record_status)
  Rows Removed by Filter: 641,440
Execution Time: 6324.389 ms

6.3 seconds. To return 5,024 rows.

And that number — Rows Removed by Filter: 641,440 — that is per worker. With 3 parallel workers running, Postgres scanned nearly 1.9 million rows and threw almost all of them away just to find 5,024 matches.

Why This Was Happening

The key thing I had to understand: PostgreSQL stores rows in insertion order. Not grouped by column value. The Text Extracted rows were scattered randomly across the entire table mixed in with Crawled rows, Failed rows, LLMProcessed rows.

Without an index, Postgres has no map. It does not know where the Text Extracted rows live. The only option is to start from row 1 and check every single row until the end. That is what Seq Scan means in the plan — Sequential Scan. Brute force. Every row touched.

On 1.9 million rows, that costs 6 seconds.

My First Instinct Was Wrong

My first thought: create an index on status.

But then I thought about what status actually looks like. It has 5 possible values. On a 1.9 million row table, querying status = ‘LLMProcessed’ might match 1.5 million rows. An index that leads Postgres to 1.5 million scattered disk locations is actually slower than just reading the table top to bottom because random disk reads are expensive, and a sequential pass is one clean sweep.

This is the core rule about indexes: they only win when they are selective. When your filter returns a tiny fraction of the table, index jumps are fast. When your filter returns most of the table, sequential scan wins.

A normal index on status would be ignored by Postgres for most queries. I needed something smarter.

The Fix: Partial Index

PostgreSQL has a feature called a partial index you only index the rows that match a specific condition:

CREATE INDEX idx_text_extracted
ON filing_records(status)
WHERE status = 'Text Extracted'::filing_record_status;

This index contains only the 5,024 rows currently in Text Extracted status. It completely ignores the other 1.9 million rows. As the table grows to 10 million rows mostly completed LLMProcessed records this index stays small and stays fast. Those finished records are never added to it.

I ran EXPLAIN ANALYZE again with the same query:

Index Scan using idx_text_extracted on filing_records
Execution Time: 860.936 ms

From 6,324ms to 860ms. Seven times faster. Same query. Same data. Same machine.

It Was Still Not Fast Enough And Here Is Why

860ms for 5,024 rows felt slower than it should. The problem was SELECT *.

Here is what happens physically with an index scan on SELECT *:

Postgres finds 5,024 row locations in the indexfast
For each location, it jumps to that physical position on disk to read all columns
Those 5,024 rows are scattered across hundreds of different 8KB pages on disk
Each jump is a random disk read

The index gave Postgres a map. But the map pointed to 5,024 different locations scattered across the disk. Following all those pointers called heap fetches is the hidden cost of SELECT *.

The fix: only select the columns the pipeline actually needs.

SELECT record_id, accession_no, form_link, companyname, status, llm_retry_count
FROM filing_records
WHERE status = 'Text Extracted'::filing_record_status;

Or go one step further with a covering index store the needed columns inside the index itself so Postgres never has to touch the main table at all:

CREATE INDEX idx_text_extracted_covering
ON filing_records(status, record_id, accession_no, form_link)
WHERE status = 'Text Extracted'::filing_record_status;

Zero heap fetches. The lookup and the column data live in one place.

The Other Side of the Coin Write Cost

Before you start adding indexes everywhere, understand what you are paying for them.

Every index adds a write cost to every INSERT and UPDATE on that table:

Without index  → INSERT = 1 write to table
With 1 index   → INSERT = 1 table write + 1 index write
With 10 indexes → INSERT = 11 writes total

Our pipeline inserts thousands of records daily and constantly updates status as records move through stages. Every status update on an indexed column means removing the old value from the index and inserting the new sorted value. Extra operations every time.

The question before every index: does the read gain outweigh the write tax? A 7x improvement on a query running every few minutes — yes, absolutely. An index on a column nobody queries — dead weight on every insert.

Auditing What Was Already There

After the win with the partial index, I ran a query to check how often each existing index was actually being used:

SELECT indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE relname = 'filing_records';

The result surprised me:

filing_records_pkey              →   6,980,795 scans
filing_records_unique            →   5,614,386 scans
unique_form_link_accession_no    → 682,780,741 scans
idx_text_extracted               →           2 scans

682 million scans on one composite index. That is the join key between filing_records and another table — running constantly inside the pipeline for years. Without that index, every join would have been a full table scan on both sides, millions of times over.

I had that index the whole time without truly understanding how much work it was quietly doing.

The two scans on idx_text_extracted were from my own EXPLAIN ANALYZE tests. Too early to judge give any new index a few weeks of real traffic before deciding whether to keep or drop it.

What to Take From This

Run EXPLAIN ANALYZE on your slow queries. Do not guess. Seq Scan means no index is being used and every row is being checked. Index Scan means Postgres found a shortcut. The actual execution time tells you where the time goes.

Low cardinality does not mean no index. It means use a partial index. Index only the slice of data your queries actually touch.

SELECT * has a hidden cost. Every matched row triggers a random disk read to fetch all columns. Select only what you need.

Audit your indexes regularly. Use pg_stat_user_indexes to check idx_scan. Near zero after weeks of real traffic means the index is costing you writes and giving you nothing back.

Every index is a write tax. Create them deliberately. The right index at the right time is transformative. Indexes everywhere is just overhead.

The Commands to Keep

-- See your query execution plan
EXPLAIN ANALYZE
SELECT ... your query ...;

-- See all indexes on a table
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'your_table';

-- See how often each index is actually used
SELECT indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE relname = 'your_table';

-- Create a partial index for a low cardinality column
CREATE INDEX idx_name
ON table_name(column_name)
WHERE column_name = 'specific_value';

How I Accidentally Ran 682 Million Index Scans Without Knowing It was originally published in Javarevisited on Medium, where people are continuing the conversation by highlighting and responding to this story.

This post first appeared on Read More