2

I wanted to write a function which lists the Counter of dictionary items that appear for at least the number of times df in all other dictionaries.

example:

prune(([{'a': 1, 'b': 10}, {'a': 1}, {'c': 1}], min_df=2)
[Counter({'a': 1}), Counter({'a': 1})]
prune(([{'a': 1, 'b': 10}, {'a': 2}, {'c': 1}], min_df=2)
[Counter({'a': 1}), Counter({'a': 2})]

As we can see that 'a' occurs twice in two dictionaries it gets listed in the output.

My approach:

from collections import Counter
def prune(dicto,df=2):
   new = Counter()
   for d in dicto:
       new += Counter(d.keys())
   x = {}
   for key,value in new.items():
       if value >= df:
           x[key] = value
   print Counter(x)

Output:

Counter({'a': 2})

This gives the output as a combined Counter. As we can see, term 'a' appears 2 times on the whole and hence it satisfies the df condition and gets listed in the output. Now, Can anyone correct me to get the desired output.

6
  • 1
    In your expected output you have two counters. What does each counter signify? Why is having just the one counter not useful? Commented Apr 14, 2015 at 22:21
  • @MartijnPieters: I think OP wants to list the key value pairs that appear in every dictionary, such that each printed key appears in at least df many dictionaries Commented Apr 14, 2015 at 22:36
  • these two dictionaries are like two different documents with word counts in tat specific document Commented Apr 14, 2015 at 22:36
  • @inspectorG4dget: I'd like the OP to make that explicit, rather than have us guess. Commented Apr 14, 2015 at 22:37
  • @MartijnPieters: normally, I'd agree with you (and would have directed my clarification at OP), but I have a feeling that OP's first language is not English and thought this would help Commented Apr 14, 2015 at 22:38

3 Answers 3

5

I would suggest:

from collections import Counter
def prune(dicto, min_df=2):
    # Create all counters
    counters = [Counter(d.keys()) for d in dicto]

    # Sum all counters
    total = sum(counters, Counter()) 

    # Create set with keys of high frequency
    keys = set(k for k, v in total.items() if v >= min_df)

    # Reconstruct counters using high frequency keys
    counters = (Counter({k: v for k, v in d.items() if k in keys}) for d in dicto)

    # With filter(None, ...) we take only the non empty counters.
    return filter(None, counters)

Result:

>>> prune(([{'a': 1, 'b': 10}, {'a': 1}, {'c': 1}], min_df=2)
[Counter({'a': 1}), Counter({'a': 1})]
Sign up to request clarification or add additional context in comments.

1 Comment

You should make counters a generator expression if you are going to filter it.
1

chain the keys and keep the keys from each dict that satisfy the condition.

from itertools import chain

def prune(l, min_df=0):
    # count how many times every key appears
    count = Counter(chain.from_iterable(l))
    # create Counter dicts using keys that appear at least  min_df times
    return filter(None,(Counter(k for k in d if count.get(k) >= min_df) for d in l))

In [14]: prune([{'a': 1, 'b': 10}, {'a': 1}, {'c': 1}], min_df=2)
Out[14]: [Counter({'a': 1}), Counter({'a': 1})]

You can avoid the filter but I am not sure it will be any more efficient:

def prune(l, min_df=0):
        count = Counter(chain.from_iterable(l))
        res = []
        for d in l:
            cn = Counter(k for k in d if count.get(k) >= min_df)
            if cn:
                res.append(cn)
        return res

The loop is pretty much on a par:

In [31]: d = [{'a': 1, 'b': 10}, {'a': 1}, {'c': 1}]    
In [32]: d = [choice(d) for _ in range(1000)]   
In [33]: timeit chain_prune_loop(d, min_df=2)
100 loops, best of 3: 5.49 ms per loop    
In [34]: timeit prune(d, min_df=2)
100 loops, best of 3: 11.5 ms per loop
In [35]: timeit set_prune(d, min_df=2)
100 loops, best of 3: 13.5 ms per loop

1 Comment

@Shashank, yes, I was originally doing something differently, forgot to remove the generator expression
0

This will print out all the values of each key that appears in at least df dictionaries.

def prune(dicts, df):
    counts = {}
    for d in dicts:  # for each dictionary
        for k,v in d.items():  # for each key,value pair in the dictionary
            if k not in counts:  # if we haven't seen this key before
                counts[k] = []
            counts[k].append(v)  # append this value to this key

    for k,vals in counts.items():
        if len(vals) < df:
            continue  # take only the keys that have at least `df` values (that appear in at least `df` dictionaries)
        for val in vals:
            print(k, ":", val)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.