# Developers Planet

zpyder February 2016

### Python 2.7, comparing 3 columns of a CSV

What is the easiest/simplest way to iterate through a large CSV file in Python 2.7, comparing 3 columns?

I am a total beginner and have only completed a few online courses, I have managed to use CSV reader to do some basic stats on the CSV file, but nothing comparing groups within each other.

The data is roughly set up as follows:

``````Group   sub-group   processed
1           a       y
1           a       y
1           a       y
1           b
1           b
1           b
1           c       y
1           c       y
1           c
2           d       y
2           d       y
2           d       y
2           e       y
2           e
2           e
2           f       y
2           f       y
2           f       y
3           g
3           g
3           g
3           h       y
3           h
3           h
``````

Everything belongs to a group, but within each group are sub-groups of 3 rows (replicates). As we are working through samples, we will adding to the processed column, but we don't always do the full complement, so sometimes there will only be 1 or 2 processed out of the potential 3.

I'm trying to work towards a statistic showing % completeness of each group, with a sub group being "complete" if it has at least 1 row processed (doesn't have to have all 3).

I've managed to get halfway there, by using the following:

``````for row in reader:
all_groups[group] = all_groups.get(group,0)+1
if not processed == "":
processed_groups[group] = processed_groups.get(group,0)+1

result = {}
for family in (processed_groups.viewkeys() | all_groups.keys()):
if group in processed_groups: result.setdefault(group, []).append(processed_groups[group])
if group in processed_groups: result.setdefault(group, []).append(all_groups[group])

for group,v1 in result.items():
todo = float(v1[0])
done = float(v1[1])
progress = round((100 / done * todo),2)
print        ``````
``` ```
``` ```
``` Answers mhawke February 2016 One way to do this is with a couple of defaultdict of sets. The first keeps track of all of the subgroups seen, the second keeps track of those subgroups that have been processed. Using a set simplifies the code somewhat, as does using a defaultdict when compared to using a standard dictionary (although it's still possible). import csv from collections import defaultdict subgroups = defaultdict(set) processed_subgroups = defaultdict(set) with open('data.csv') as csvfile: for group, subgroup, processed in csv.reader(csvfile): subgroups[group].add(subgroup) if processed == 'y': processed_subgroups[group].add(subgroup) for group in sorted(processed_subgroups): print("Group {} -- {:.2f}%".format(group, (len(processed_subgroups[group]) / float(len(subgroups[group])) * 100))) Output Group 1 -- 66.67% Group 2 -- 100.00% Group 3 -- 50.00% ```
``` Post Status Asked in February 2016Viewed 1,103 timesVoted 4Answered 1 times Search Leave an answer ```
``` ```
``` ```
``` Quote of the day: live life .btn-primary{ background-color: #f44336 !important; border-color: #f44336 !important; } Devs Planet ® 2014-2016 www.devsplanet.com Devs Planet © all rights reserved Quick Actions Search // Used to toggle the menu on small screens when clicking on the menu button function myFunction() { var x = document.getElementById("navDemo"); if (x.className.indexOf("w3-show") == -1) { x.className += " w3-show"; } else { x.className = x.className.replace(" w3-show", ""); } } ```