Embedded Objects and Other Index Gotchas - Open Source Software News

In a recent design review, the customer’s application was in production, but performance had taken a nosedive as data volumes grew. It turned out that the issue was down to how they were indexing the embedded objects in their documents. This article explains why their indexes were causing problems, and how they could be fixed.

Note that I’ve changed details for this use case to obfuscate the customer and application. All customer information shared in a design review is kept confidential.

We looked at the schema, and things looked good. They’d correctly split their claim information across two documents:

One contained a modest amount of queryable data (20 KB per claim). These documents included the _id of the second document in case the application needed to fetch it (which was relatively rare).
The second contained the bulky raw data that’s immutable, unindexed, and rarely read.

They had 110K queryable documents in the first collection—claims.

With 2.2 GB of documents (before compression, which only reduces on-disk size) and 4 GB of cache, there shouldn’t have been any performance issues. We looked at some of the queries, and there was a pretty wide set of keys being filtered on and in different combinations, but none of them returned massive numbers of documents.
Some queries were taking tens of seconds.

It made no sense. Even a full collection scan should take well under a second for this configuration. And they’d even added indexes for their common queries. So then, we looked at the indexes…

Figure 1. Collection size report in MongoDB Atlas.

Screen capture from MongoDB Atlas showing that for the collection named

15 indexes on one collection is on the high side and could slow down your writes, but it’s the read performance that we were troubleshooting.

But, those 15 indexes are consuming 85 GB of space. With the 4 GB of cache available on their M30 Atlas nodes, that’s a huge problem!

There wasn’t enough RAM in the system for the indexes to fit in cache. The result was that when MongoDB navigated an index, it would repeatedly hit branches that weren’t yet in memory and then have to fetch them from disk. That’s slow.

Taking a look at one of the indexes…

Figure 2. Index definition in MongoDB Atlas.

Index definition showing a compound index on policy_holder, policy_details, claim, adjuster, and messages. The size of the index is 19.2 GB.

It’s a compound index on six fields, but the first five of those fields are objects, and the sixth is an array of objects—this explains why the indexes were so large.

Avoiding indexes on objects

Even ignoring the size of the index, adding objects to an index can be problematic. Querying on embedded objects doesn’t behave in the way that many people expect. If an index on an embedded object is to be used, then the query needs to include every field in the embedded object. E.g., if I execute this query, then it matches exactly one of the documents in the database:

db.getCollection('claim').findOne(&NewLine;  {&NewLine;    "policy_holder": {&NewLine;      "first_name": "Janelle",&NewLine;      "last_name": "Nienow",&NewLine;      "dob": new Date("2024-12-16T23:56:49.643Z"),&NewLine;      "location": {&NewLine;        "street": "67628 Warren Road",&NewLine;        "city": "Padbergstead",&NewLine;        "state": "Minnesota",&NewLine;        "zip_code": "44832-7187"&NewLine;      },&NewLine;      "contact": {&NewLine;        "email": "[email protected]"&NewLine;      }&NewLine;    }&NewLine;  }&NewLine;);&NewLine;

It delivers this result:

{&NewLine;  "_id": {&NewLine;    "$oid": "67d801b7ad415ad6165ccd5f"&NewLine;  },&NewLine;  "region": 12,&NewLine;  "policy_holder": {&NewLine;    "first_name": "Janelle",&NewLine;    "last_name": "Nienow",&NewLine;    "dob": {&NewLine;      "$date": "2024-12-16T23:56:49.643Z"&NewLine;    },&NewLine;    "location": {&NewLine;      "street": "67628 Warren Road",&NewLine;      "city": "Padbergstead",&NewLine;      "state": "Minnesota",&NewLine;      "zip_code": "44832-7187"&NewLine;    },&NewLine;    "contact": {&NewLine;      "email": "[email protected]"&NewLine;    }&NewLine;  },&NewLine;  "policy_details": {&NewLine;    "policy_number": "POL554359100",&NewLine;    "type": "Home Insurance",&NewLine;    "coverage": {&NewLine;      "liability": 849000000,&NewLine;      "collision": 512000,&NewLine;      "comprehensive": 699000&NewLine;    }&NewLine;  },&NewLine;  ...&NewLine;}&NewLine;

The explain plan confirmed that MongoDB was able to use one of the defined indexes:

Figure 3. The visual explain plan tool in MongoDB Atlas displaying that the compound index on policy_holder and messages was used.

Visual explain plan showing that the compound index on policy_holder and messages was used. One document was examined and one returned.

If just one field from the embedded object isn’t included in the query, then no documents will match:

db.getCollection('claim').findOne(&NewLine;  {&NewLine;    "policy_holder": {&NewLine;      "first_name": "Janelle",&NewLine;      "last_name": "Nienow",&NewLine;      "dob": new Date("2024-12-16T23:56:49.643Z"),&NewLine;      "location": {&NewLine;        "street": "67628 Warren Road",&NewLine;        "city": "Padbergstead",&NewLine;        "state": "Minnesota",&NewLine;        // "zip_code": "44832-7187"&NewLine;      },&NewLine;      "contact": {&NewLine;        "email": "[email protected]"&NewLine;      }&NewLine;    }&NewLine;  }&NewLine;);&NewLine;

This resulted in no matches—though the index is at least still used.

If we instead pick out individual fields from the object to query on, then we get the results we expect:

db.getCollection('claim').findOne(&NewLine;  {&NewLine;    "policy_holder.first_name": "Janelle",&NewLine;    "policy_holder.last_name": "Nienow"&NewLine;  }&NewLine;);&NewLine;

{&NewLine;  "_id": {&NewLine;    "$oid": "67d801b7ad415ad6165ccd5f"&NewLine;  },&NewLine;  "region": 12,&NewLine;  "policy_holder": {&NewLine;    "first_name": "Janelle",&NewLine;    "last_name": "Nienow",&NewLine;    "dob": {&NewLine;      "$date": "2024-12-16T23:56:49.643Z"&NewLine;    },&NewLine;    "location": {&NewLine;      "street": "67628 Warren Road",&NewLine;      "city": "Padbergstead",&NewLine;      "state": "Minnesota",&NewLine;      "zip_code": "44832-7187"&NewLine;    },&NewLine;    "contact": {&NewLine;      "email": "[email protected]"&NewLine;    }&NewLine;  },&NewLine;  "policy_details": {&NewLine;    "policy_number": "POL554359100",&NewLine;    "type": "Home Insurance",&NewLine;    "coverage": {&NewLine;      "liability": 849000000,&NewLine;      "collision": 512000,&NewLine;      "comprehensive": 699000&NewLine;    }&NewLine;  },&NewLine;  ...&NewLine;}&NewLine;

Unfortunately, none of the indexes that included policy_holder could be used as they were indexing the value of the complete embedded object, not the individual fields within it, and so a full collection scan was performed:

Figure 4. The visual explain plan too warning that no index was available.

Visual explain plan warning that no index was available for this query, and so there was a collection scan of all 110,000 documents.

Using compound indexes instead

If we instead add a compound index that leads with the fields from the object we need to filter on, then that index will be used:

Figure 5. Creating an index in MongoDB Atlas.

Creating an index on policy_holder.first_name, policy_holder.last_name, and policy_holder.location.zip_code.

Figure 6. Explain plan providing information for the compound index.

Explain plan showing that the compound index was used as documents examined == 1 & documents returned == 1.

As a quick refresher on using compound indexes, that index will be used if we query on just first_name:

db.getCollection('claim').findOne(&NewLine;  {&NewLine;    "policy_holder.first_name": "Janelle",&NewLine;    // "policy_holder.last_name": "Nienow"&NewLine;  }&NewLine;);&NewLine;

Figure 7. Explain plan showing that the compound index was used.

Explain showing that the compound index was used. Documents examined == documents returned == 111.

If we don’t include the first key in the compound index, then it won’t be used:

db.getCollection('claim').findOne(&NewLine;  {&NewLine;    // "policy_holder.first_name": "Janelle",&NewLine;    "policy_holder.last_name": "Nienow"&NewLine;  }&NewLine;);&NewLine;

Figure 8. Explain plan providing more information on the query.

Explain plan showing that the compound index isn't used and the query had to scan all 110K documents to return the matching 113 documents.

However, you can use the index if you artificially include the leading keys in the query (though it will be more efficient if last_name had been the first key in the index):

db.getCollection('claim').findOne(&NewLine;  {&NewLine;    "policy_holder.first_name": {$exists: true},&NewLine;    "policy_holder.last_name": "Nienow"&NewLine;  }&NewLine;);&NewLine;

Figure 9. Explain plan showing the data for the index.

Explain plan showing that the index was used with 2,306 keys examined, 113 documents examined, and 133 returned.

Incompletely indexed queries

While having indexes for your queries is critical, there is a cost to having too many, or in having indexes that include too many fields—writes get slower and pressure increases on cache occupancy.

Sometimes, it’s enough to have an index that does part of the work, and then rely on a scan of the documents found by the index to check the remaining keys.

For example, the policy holder’s home state isn’t included in our compound index, but we can still query on it:

db.getCollection('claim').findOne(&NewLine;  {&NewLine;    "policy_holder.first_name": "Janelle",&NewLine;    "policy_holder.location.state": "Kentucky"&NewLine;  }&NewLine;);&NewLine;

Figure 9. Explain plan shows that the index narrowed down the problem.

Explain plan showing that the index was used, but then 111 documents had to be examined and three were returned.

The explain plan shows that the index narrowed down the search from 110,000 documents to 111, which were then scanned to find the three matching documents. If it’s rare for the state to be included in the query, then this can be a good solution.

Partial indexes

The main challenge in this design review was the size of the indexes, and so it’s worth looking into another approach to limit the size of an index.

Imagine that we need to be able to check on the names and email addresses of witnesses to accidents. We can add an index on the relevant fields:

Figure 10. Adding an index to the relevant fields in Atlas.

Screen capture showing an index on witness.name, witness.email, and witness.willing_to_testify. The index uses 9.8 MB of memory.

This index consumes 9.8 MB of cache space and must be updated when any document is added, or when any of these three fields are updated.

Even if a document has null values for the indexed fields, or if the fields aren’t even present in the document, the document will still be included in the index.

If we look deeper into the requirements, we might establish that we only need to query this data for fraudulent claims. That means that we’re wasting space in our index for entries for all of the other claims.

We can exploit this requirement by creating a partial index, setting the partial filter expression to { "claim.status": "Fraud" }. Only documents that match that pattern will be included in the index.

Figure 11. Creating a partial filter in Atlas.

Creating a partial filter on witnesses.email, witnesses.name, witnesses.willing_to_testify with a partial filter condition of {

That reduces the size of the index to 57 KB (a saving of more than 99%):

Figure 12. Indexing sizing report.

Screen grab showing the index size is 57.3 KB for compound, partial index named winess.

Note that queries must include { “claim.status”: “Fraud” } for this index to be used:

db.getCollection('claim').findOne(&NewLine;  {&NewLine;    "witnesses.email": "[email protected]",&NewLine;    "claim.status": "Fraud" &NewLine;  }&NewLine;);&NewLine;

Figure 13. Explain plan providing details on the index keys and documents details.

Explain plan showing that the conditional interface is being used with three index keys examined, three documents examined, and three documents returned.

Conclusion

Indexes are critical to database performance, whether you’re using an RDBMS or MongoDB.
MongoDB allows polymorphic documents, arrays, and embedded objects that aren’t available in a traditional RDBMS. This leads to extra indexing opportunities, but also potential pitfalls.

You should have indexes to optimize all of your frequent queries, but use the wrong type or too many of them and things could backfire. We saw that in this case with indexes taking up too much space and not being as general purpose as the developer believed.

To compound problems, the database may perform well in development and for the early days in production. Things go wrong over time as the collections grow and extra indexes are added. As soon as the working data set (indexes and documents) doesn’t fit in the cache, performance quickly declines.

Well-informed use of compound and partial indexes will ensure that MongoDB delivers the performance your application needs, even as your database grows.

Learn more about MongoDB design reviews

Design reviews are a chance for a design expert from MongoDB to advise you on how best to use MongoDB for your application. The reviews are focused on making you successful using MongoDB.

It’s never too early to request a review. By engaging us early (perhaps before you’ve even decided to use MongoDB), we can advise you when you have the best opportunity to act on it.

This article explained how using a MongoDB schema and set of indexes that match how your application works with data can meet your performance requirements. If you want help to come up with that schema, then a design review is how to get that help.
Would your application benefit from a review? Schedule your design review today.

Want to read more from Andrew? Head to his website.

This article first appeared on Read More