zpyder February 2016

Python 2.7, comparing 3 columns of a CSV

What is the easiest/simplest way to iterate through a large CSV file in Python 2.7, comparing 3 columns?

I am a total beginner and have only completed a few online courses, I have managed to use CSV reader to do some basic stats on the CSV file, but nothing comparing groups within each other.

The data is roughly set up as follows:

Group   sub-group   processed
1           a       y
1           a       y
1           a       y
1           b           
1           b
1           b
1           c       y
1           c       y
1           c
2           d       y
2           d       y
2           d       y
2           e       y
2           e
2           e
2           f       y
2           f       y
2           f       y
3           g
3           g
3           g
3           h       y
3           h
3           h

Everything belongs to a group, but within each group are sub-groups of 3 rows (replicates). As we are working through samples, we will adding to the processed column, but we don't always do the full complement, so sometimes there will only be 1 or 2 processed out of the potential 3.

I'm trying to work towards a statistic showing % completeness of each group, with a sub group being "complete" if it has at least 1 row processed (doesn't have to have all 3).

I've managed to get halfway there, by using the following:

for row in reader:
    all_groups[group] = all_groups.get(group,0)+1   
    if not processed == "":
        processed_groups[group] = processed_groups.get(group,0)+1

result = {}
for family in (processed_groups.viewkeys() | all_groups.keys()):
    if group in processed_groups: result.setdefault(group, []).append(processed_groups[group])
        if group in processed_groups: result.setdefault(group, []).append(all_groups[group])

for group,v1 in result.items():
        todo = float(v1[0])
        done = float(v1[1])
        progress = round((100 / done * todo),2)
        print        

Answers


mhawke February 2016

One way to do this is with a couple of defaultdict of sets. The first keeps track of all of the subgroups seen, the second keeps track of those subgroups that have been processed. Using a set simplifies the code somewhat, as does using a defaultdict when compared to using a standard dictionary (although it's still possible).

import csv
from collections import defaultdict

subgroups = defaultdict(set)
processed_subgroups = defaultdict(set)

with open('data.csv') as csvfile:
    for group, subgroup, processed in csv.reader(csvfile):
        subgroups[group].add(subgroup)
        if processed == 'y':
            processed_subgroups[group].add(subgroup)

    for group in sorted(processed_subgroups):
        print("Group {} -- {:.2f}%".format(group, (len(processed_subgroups[group]) / float(len(subgroups[group])) * 100)))

Output

Group 1 -- 66.67%
Group 2 -- 100.00%
Group 3 -- 50.00%

Post Status

Asked in February 2016
Viewed 1,103 times
Voted 4
Answered 1 times

Search




Leave an answer