I encountered the following scenario: I’m doing an aggregation for the MasterCollection collection. I’m “joining” this collection with other 9 collections in the aggregation.
In the end, I’m merging everything into the same MasterCollection. The aggregation execution time took 30 minutes, which is not acceptable. We have a single MongoDb instance (Mongo version 7) with 16GB RAM and we are running it in a docker container.
The MasterCollection has 1015787 documents. The average document size is 1.8kB for the MasterCollection. Additional stats for the collections (Collection name, Number of documents, Avg Doc size):
- collection 1016878 40B
- collection2 0 0B
- collection3 232 94B
- collection4 10289 97B
- collection5 10289 97B
- collection6 1747 102B
- collection 1326 103B
- collection8 1016878 42B
- collection9 1016878 58B
Compound indexes are created for the fields that are used in the lookups.
My aggregation looks like this:
MasterCollection.aggregate([
{
$project: {
_id: 1,
field1: 1,
field2: 1,
field3: 1,
},
},
{
$lookup: {
from: 'collection1',
localField: '_id',
foreignField: '_id',
as: 'collection1',
},
},
{
$lookup: {
from: 'collection8',
localField: '_id',
foreignField: '_id',
as: 'collection8',
},
},
{
$lookup: {
from: 'collection9',
localField: '_id',
foreignField: '_id',
as: 'collection9',
},
},
{
$lookup: {
from: 'collection2',
let: {
field1Id: '$field1',
field2Id: '$field2',
},
pipeline: [
{
$match: {
$expr: {
$and: [
{ $eq: ['$_id.field1', '$$field1Id'] },
{ $eq: ['$_id.field2', '$$field2Id'] },
],
},
},
},
{
$project: {
_id: 0,
fieldFromCollection2: 1,
},
},
],
as: 'collection2',
},
},
{
$lookup: {
from: 'colelction3',
let: {
field1Id: '$field1',
field2Id: '$field2',
},
pipeline: [
{
$match: {
$expr: {
$and: [
{ $eq: ['$_id.field1', '$$field1Id'] },
{ $eq: ['$_id.field2', '$$field2Id'] },
],
},
},
},
{
$project: {
_id: 0,
fieldFromCollection3: 1,
},
},
],
as: 'colelction3',
},
},
{
$lookup: {
from: 'collection4',
let: {
field1Id: '$field1',
field2Id: '$field2',
},
pipeline: [
{
$match: {
$expr: {
$and: [
{ $eq: ['$_id.field1', '$$field1Id'] },
{ $eq: ['$_id.field2', '$$field2Id'] },
],
},
},
},
{
$project: {
_id: 0,
fieldFromCollection4: 1,
},
},
],
as: 'collection4',
},
},
{
$lookup: {
from: 'collection5',
let: {
field1Id: '$field1',
field2Id: '$field2',
},
pipeline: [
{
$match: {
$expr: {
$and: [
{ $eq: ['$_id.field1', '$$field1Id'] },
{ $eq: ['$_id.field2', '$$field2Id'] },
],
},
},
},
{
$project: {
_id: 0,
fieldFromCollection5: 1,
},
},
],
as: 'collection5',
},
},
{
$lookup: {
from: 'collection6',
let: {
field1Id: '$field1',
field3Id: '$field3',
},
pipeline: [
{
$match: {
$expr: {
$and: [
{ $eq: ['$_id.field1', '$$field1Id'] },
{ $eq: ['$_id.field3', '$$field3Id'] },
],
},
},
},
{
$project: {
_id: 0,
fieldFromCollection6: 1,
},
},
],
as: 'collection6',
},
},
{
$lookup: {
from: 'collection7',
let: {
field1Id: '$field1',
field2Id: '$field2',
},
pipeline: [
{
$match: {
$expr: {
$and: [
{ $eq: ['$_id.field1', '$$field1Id'] },
{ $eq: ['$_id.field2', '$$field2Id'] },
],
},
},
},
{
$project: {
_id: 0,
fieldFromCollection7: 1,
},
},
],
as: 'collection7',
},
},
{
$unwind: // from each collection
},
{
$project: {
_id: 1,
// project from each collection
},
},
{
$merge: {
into: 'MasterCollection',
on: '_id',
whenMatched: 'merge',
whenNotMatched: 'discard',
},
},
], { allowDiskUse: true })
Do you have any suggestions how to improve this aggregation?
I already tried playing with the indexes, I tried to use $facet and to split the whole aggregation into chunks with $skip and $limit; however I had no positive outcome for improving the aggregation.
_idas two fields, you can try combining them into a single field to do the match:let: { collId: { field1: "$field1", field2: "$field2"} }and then$match: { _id: "$$collId" }- may improve the performance by using the index. Otherwise, I think it's partial withfield1and then a scan forfield2.