-1

I need to join two RDDs as part of my programming assignment. The problem is that the first RDD is nested, while the other is flat. I tried different things, but nothing seemed to work. Is there any expert on PySpark that can help me?

First RDD is:

[(('brand', 1), ('queen', 1), ('elizabeth', 1), ...),
(('50', 1), ('worst', 1), ('habit', 2), ...),
 (('cost', 1), ('trump', 1), ('aid', 1), ..., ('hole', 1))]

Second RDD is:

[('brand', 1), ('queen', 3), ('elizabeth', 2), ...]
6
  • you can unpack the first rdd and then do the join Commented Dec 1 at 5:32
  • 1
    Please trim your code to make it easier to find your problem. Follow these guidelines to create a minimal reproducible example. Commented Dec 1 at 6:18
  • 2026 is about to start. who is still using raw rdd ? please use dataframes. Commented Dec 1 at 14:51
  • @Steven it appears to be part of an assignment Commented Dec 1 at 20:56
  • @DerekO Oups, sorry, I did not realise that SO was the right website to get all the answer to your assignments without doing anything by yourself. Commented Dec 3 at 8:58

1 Answer 1

0

First, flatten the nested RDD and then join with the second RDD. I also merged the duplicates; however, you can skip that step if needed.

joined = (
    rdd1
        .flatMap(lambda group: group)
        .reduceByKey(lambda a, b: a + b)
        .join(rdd2)
)
Sign up to request clarification or add additional context in comments.

1 Comment

Your comments are valuable but my problem is little different. I need to keep the nested structure as each structure represents one document and I want to see how many words of that document overlap with overall set of words. I tried something like this rdd1.join(rdd2.map(lambda x:x[0:-1]) but it gives me empty list.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.