Bradley February 2016

Dict Deconstruction and Reconstruction in Python

I have multiple dictionaries. There is a great deal of overlap between the dictionaries, but they are not identical.

a = {'a':1,'b':2,'c':3}
b = {'a':1,'c':3, 'd':4}
c = {'a':1,'c':3}

I'm trying to figure out how to break these down into the most primitive pieces and then reconstruct the dictionaries in the most efficient manner. In other words, how can I deconstruct and rebuild the dictionaries by typing each key/value pair the minimum number of times (ideally once). It also means creating the minimum number of sets that can be combined to create all possible sets that exist.

In the above example. It could be broken down into:

c = {'a':1,'c':3}
a = dict(c.items() + {'b':2})
b = dict(c.items() + {'d':4})

I'm looking for suggestions on how to approach this in Python.

In reality, I have roughly 60 dictionaries and many of them have overlapping values. I'm trying to minimize the number of times I have to type each k/v pair to minimize potential typo errors and make it easier to cascade update different values for specific keys.

An ideal output would be the most basic dictionaries needed to construct all dictionaries as well as the formula for reconstruction.

Answers


en_Knight February 2016

Here is a solution. It isn't the most efficient in any way, but it might give you an idea of how to proceed.

a = {'a':1,'b':2,'c':3}
b = {'a':1,'c':3, 'd':4}
c = {'a':1,'c':3}

class Cover:
    def __init__(self,*dicts):
        # Our internal representation is a link to any complete subsets, and then a dictionary of remaining elements
        mtx = [[-1,{}] for d in dicts]
        for i,dct in enumerate(dicts):
            for j,odct in enumerate(dicts):
                if i == j: continue # we're always a subset of ourself
                # if everybody in A is in B, create the reference
                if all( k in dct for k in odct.keys() ):
                    mtx[i][0] = j
                    dif = {key:value for key,value in dct.items() if key not in odct}
                    mtx[i][1].update(dif)
                    break 

        for i,m in enumerate(mtx):
            if m[1] == {}: m[1] = dict(dicts[i].items())

        self.mtx = mtx

    def get(self, i):
        r = { key:val for key, val in self.mtx[i][1].items()} 
        if (self.mtx[i][0] > 0): # if we had found a subset, add that
            r.update(self.mtx[self.mtx[i][0]][1])
        return r


cover = Cover(a,b,c) 

print(a,b,c)
print('representation',cover.mtx)
# prints [[2, {'b': 2}], [2, {'d': 4}], [-1, {'a': 1, 'c': 3}]]
# The "-1" In the third element indicates this is a building block that cannot be reduced; the "2"s indicate that these should build from the 2th element 
print('a',cover.get(0)) 
print('b',cover.get(1))
print('c',cover.get(2))

The idea is very simple: if any of the maps are complete subsets, substitute the duplication for a reference. The compression could certainly backfire for certain matrix combinations, and can be easily improved upon

Simple improvements

Change the first line of get, or using your more concise dictionary addition code hinted at in the question, might immediately improve readabil


B. M. February 2016

Below a script to generate a script that reconstruct dictionaries.

For example consider this dictionary of dictionaries:

>>>dicts
{'d2': {'k4': 'k4', 'k1': 'k1'},
 'd0': {'k2': 'k2', 'k4': 'k4', 'k1': 'k1', 'k3': 'k3'},
 'd4': {'k4': 'k4', 'k0': 'k0', 'k1': 'k1'},
 'd3': {'k0': 'k0', 'k1': 'k1'},
 'd1': {'k2': 'k2', 'k4': 'k4'}}

For clarity, we continue with sets because the association key value can be done elsewhere.

sets= {k:set(v.keys()) for k,v in dicts.items()} 

>>>sets
{'d2': {'k1', 'k4'},
 'd0': {'k1', 'k2', 'k3', 'k4'},
 'd4': {'k0', 'k1', 'k4'},
 'd3': {'k0', 'k1'},
 'd1': {'k2', 'k4'}}

Now compute the distances (number of keys to add or/and remove to go from one dict to another):

df=pd.DataFrame(dicts)
charfunc=df.notnull()
distances=pd.DataFrame((charfunc.values.T[...,None] != charfunc.values).sum(1),
                       df.columns,df.columns)

>>>>distances 
    d0  d1  d2  d3  d4
d0   0   2   2   4   3
d1   2   0   2   4   3
d2   2   2   0   2   1
d3   4   4   2   0   1
d4   3   3   1   1   0

Then the script that write the script. The idea is to begin with the shortest set, and then at each step to construct the nearest set from those already built:

script=open('script.py','w')
dicoto=df.count().argmin() # the shortest set
script.write('res={}\nres['+repr(dicoto)+']='+str(sets[dicoto])+'\ns=[\n')
done=[]
todo=df.columns.tolist()
while True :
    done.append(dicoto)
    todo.remove(dicoto)
    if not todo : break
    table=distances.loc[todo,done]
    ito,ifrom=np.unravel_index(table.values.argmin(),table.shape)
    dicofrom=table.columns[ifrom]
    setfrom=sets[dicofrom]
    dicoto=table.index[ito]
    setto=sets[dicoto]
    toadd=setto-setfrom
    toremove=setfrom-setto
    script.write(('('+repr(dicoto)+','+str(toadd)+','+str(toremove)+','
    +repr(dicofrom)+'),\n').replace('set',''))
s 

Post Status

Asked in February 2016
Viewed 1,278 times
Voted 13
Answered 2 times

Search




Leave an answer