0

I encountered the following scenario: I’m doing an aggregation for the MasterCollection collection. I’m “joining” this collection with other 9 collections in the aggregation.

In the end, I’m merging everything into the same MasterCollection. The aggregation execution time took 30 minutes, which is not acceptable. We have a single MongoDb instance (Mongo version 7) with 16GB RAM and we are running it in a docker container.

The MasterCollection has 1015787 documents. The average document size is 1.8kB for the MasterCollection. Additional stats for the collections (Collection name, Number of documents, Avg Doc size):

  • collection 1016878 40B
  • collection2 0 0B
  • collection3 232 94B
  • collection4 10289 97B
  • collection5 10289 97B
  • collection6 1747 102B
  • collection 1326 103B
  • collection8 1016878 42B
  • collection9 1016878 58B

Compound indexes are created for the fields that are used in the lookups.

My aggregation looks like this:

MasterCollection.aggregate([
  {
    $project: {
      _id: 1,
      field1: 1,
      field2: 1,
      field3: 1,
    },
  },
  {
    $lookup: {
      from: 'collection1',
      localField: '_id',
      foreignField: '_id',
      as: 'collection1',
    },
  },
  {
      $lookup: {
        from: 'collection8',
        localField: '_id',
        foreignField: '_id',
        as: 'collection8',
      },
  },
  {
    $lookup: {
      from: 'collection9',
      localField: '_id',
      foreignField: '_id',
      as: 'collection9',
    },
  },
  {
    $lookup: {
      from: 'collection2',
      let: {
        field1Id: '$field1',
        field2Id: '$field2',
      },
      pipeline: [
        {
          $match: {
            $expr: {
              $and: [
                { $eq: ['$_id.field1', '$$field1Id'] },
                { $eq: ['$_id.field2', '$$field2Id'] },
              ],
            },
          },
        },
        {
          $project: {
            _id: 0,
            fieldFromCollection2: 1,
          },
        },
      ],
      as: 'collection2',
    },
  },
  {
    $lookup: {
      from: 'colelction3',
      let: {
        field1Id: '$field1',
        field2Id: '$field2',
      },
      pipeline: [
        {
          $match: {
            $expr: {
              $and: [
                { $eq: ['$_id.field1', '$$field1Id'] },
                { $eq: ['$_id.field2', '$$field2Id'] },
              ],
            },
          },
        },
        {
          $project: {
            _id: 0,
            fieldFromCollection3: 1,
          },
        },
      ],
      as: 'colelction3',
    },
  },
  {
    $lookup: {
      from: 'collection4',
      let: {
        field1Id: '$field1',
        field2Id: '$field2',
      },
      pipeline: [
        {
          $match: {
            $expr: {
              $and: [
                { $eq: ['$_id.field1', '$$field1Id'] },
                { $eq: ['$_id.field2', '$$field2Id'] },
              ],
            },
          },
        },
        {
          $project: {
            _id: 0,
            fieldFromCollection4: 1,
          },
        },
      ],
      as: 'collection4',
    },
  },
  {
    $lookup: {
      from: 'collection5',
      let: {
        field1Id: '$field1',
        field2Id: '$field2',
      },
      pipeline: [
        {
          $match: {
            $expr: {
              $and: [
                { $eq: ['$_id.field1', '$$field1Id'] },
                { $eq: ['$_id.field2', '$$field2Id'] },
              ],
            },
          },
        },
        {
          $project: {
            _id: 0,
            fieldFromCollection5: 1,
          },
        },
      ],
      as: 'collection5',
    },
  },
  {
    $lookup: {
      from: 'collection6',
      let: {
        field1Id: '$field1',
        field3Id: '$field3',
      },
      pipeline: [
        {
          $match: {
            $expr: {
              $and: [
                { $eq: ['$_id.field1', '$$field1Id'] },
                { $eq: ['$_id.field3', '$$field3Id'] },
              ],
            },
          },
        },
        {
          $project: {
            _id: 0,
            fieldFromCollection6: 1,
          },
        },
      ],
      as: 'collection6',
    },
  },
  {
    $lookup: {
      from: 'collection7',
      let: {
        field1Id: '$field1',
        field2Id: '$field2',
      },
      pipeline: [
        {
          $match: {
            $expr: {
              $and: [
                { $eq: ['$_id.field1', '$$field1Id'] },
                { $eq: ['$_id.field2', '$$field2Id'] },
              ],
            },
          },
        },
        {
          $project: {
            _id: 0,
            fieldFromCollection7: 1,
          },
        },
      ],
      as: 'collection7',
    },
  },      
  { 
    $unwind: // from each collection
  },
  {
    $project: {
      _id: 1,
      // project from each collection
    },
  },
  {
    $merge: {
      into: 'MasterCollection',
      on: '_id',
      whenMatched: 'merge',
      whenNotMatched: 'discard',
    },
  },
], { allowDiskUse: true })

Do you have any suggestions how to improve this aggregation?

I already tried playing with the indexes, I tried to use $facet and to split the whole aggregation into chunks with $skip and $limit; however I had no positive outcome for improving the aggregation.

6
  • 1
    Refactor your schema to merge all the collections into one. Lookup/Join is costly in MongoDB. You should reconsider why the data are scattered around in different collections at the first place. Commented Jul 10, 2024 at 14:22
  • For the collections which have the _id as two fields, you can try combining them into a single field to do the match: let: { collId: { field1: "$field1", field2: "$field2"} } and then $match: { _id: "$$collId" } - may improve the performance by using the index. Otherwise, I think it's partial with field1 and then a scan for field2. Commented Jul 10, 2024 at 15:44
  • Review your database design! MongoDB is not a relational database, some NoSQL databases even do not support joins at all. Typically an application has much less number of collections than you have number of tables in according application running in a relational RDBMS. Commented Jul 10, 2024 at 18:22
  • @ray Thanks for the suggestion, I managed to get rid of collection9 and include that field in the MasterCollection. However the data in the other collections represents different counts. Initially we wrote this counts back to the MasterCollection, but the performance was much worse. Commented Jul 11, 2024 at 7:04
  • @aneroid Thanks for the suggestion, I'll look into this. Commented Jul 11, 2024 at 7:05

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.