David February 2016

Fit the cumulative percentage line to the sorted histogram output with d3 for a pareto chart histogram

This is what I have so far: https://gist.github.com/daluu/fc1cbcab68852ed3c5fa and http://bl.ocks.org/daluu/fc1cbcab68852ed3c5fa. I'm trying to replicate Excel functionality.

The line fits the default histogram just fine as in the base/original http://bl.ocks.org/daluu/f58884c24ff893186416. And I'm able to sort the histogram in descending frequency, although in doing so, I switched x scales (from linear to ordinal). I can't seem to map the line to the sorted histogram correctly at this point. It should look like the following examples in terms of visual representation:

  • the Excel screenshot in a comment in my gist referenced above
  • the pareto chart sorted histogram in this SO post
  • the pareto chart (similar to but not exactly a sorted histogram) made with d3 here

What's the best design approach to get the remaining part working? Should I have started with a single x scale and not need to switch from linear to ordinal? If so, I'm not sure how to apply the histogram layout correctly using an ordinal scale or how not to use a linear x scale as a source of input to the histogram layout and still get the desired output.

Using the same ordinal scale with the code I have so far, the line looks ok but it's not the curve I am expecting to see.

Any help appreciated.

Answers


Cyril February 2016

Instead of sorting the y.

data.sort(function(a,b){ return b.y - a.y;});

you should be sorting the x

data.sort(function(a,b){ return a.x - b.x;});

Working code here


WittyID February 2016

The main issue with the line is that the cumulative distribution needs to be recalculated after the bar is sorted, or if you're gunning for a static pareto chart, the cumulative distribution needs to be calculated in the target sort order. For this purpose i've created a small function to do this calculation:

function calcCDF(data){
  data.forEach(function(d,i){
      if(i === 0){
      d.cum = d.y/dataset.length
    }else{
      d.cum = (d.y/dataset.length) + data[i-1].cum
    }
  })
  return data
}

In my case, i'm toggling the pareto sort on/off and recalculating the d.cum property each time. One could theoretically create two cumulative dist properties to start with; i.e. d.cum for a regular ordered distribution and say d.ParetoCum for the sorted cumulative, but i'm using d.cum on a tooltip and decided against that.

Per the axis, i'm using a single ordinal scale which i think is cleaner, but required some work on getting the labels to be meaningful for number ranges since tick-marks and labels no longer delineate the bins as one would get with a linear scale. My solution here was to just use the number range as the tick mark e.g. "1 - 1.99" and add a function to alternate tickmarks (got that solution a while ago from Alternating tick padding in d3.js).

For the bar sorting, i'm using this d3 example as a reference in case you need to understand in the context of a simpler/smaller example.

See this fiddle that incorporates all of the above. If you want to use it, i would suggest adding a check to avoid the user being able to toggle off both bars and line (left a note in the code...should be trivial)

Post Status

Asked in February 2016
Viewed 3,040 times
Voted 4
Answered 2 times

Search




Leave an answer