Matt H February 2016

openrefine extracting a number from a text column using regex

I'm trying to parse out a column of data from the OpenFoodFacts dataset that I found via Kaggle. There is a attribute called "serving_size" that contains whatever serving size information is presented on the package for a food item. Most of the time the serving size is expressed in grams (g), however there is often other text as well. I'd like to be able to search through the string, find the number that corresponds to the number of grams, and extract that value into its own field. The value is not just an integer - it might have a decimal.

I'm new to regular expressions, but it seems like it ought to be possible to search for the "g" character and if it is proceeded by any numeric values to extract them. I've found some recipes that suggest this is possible, but so far nothing I've tried has worked. In the OpenRefine documentation they give the example of extracting decimal data using this regex: /[-+]?[0-9]+(.[0-9]+)?/, but there was no variation of that I could get to work in our scenario. I've also tried commands like "value.match(/(.)?(/d+[g]).?/)". I'm finding that I don't understand how regex is supposed to work - when I tell it "/d" I'm expecting that it will ONLY give me back numeric values, however that does not appear to be the case - it gives whatever is there regardless of the character type.

Any help would be appreciated.

Here are some example text strings from the data:

serving_size  
 - 113.5g
 - 20g
 - 1 cup (227g)
 - 4 cookies (15g)
 - 13 pieces (39g)
 - 1/4 packet (21g) makes 1/2 cup
 - 0.75 oz (21g)
 - 1 can (12 FL OZ) 355g
 - 15.2 fl oz (450g)
 - 1 can (355mL)
 - 1/4 tsp (1.4g)
 - 10 fl oz 1 bottle.
 - 20 fl oz
 - 1 envelope (21g)
 - 1 tbsp (4.5g)
 - 45.2g
 - 1/2 pack 142.5gms
 - 1 carré de chocolat de 20g
 - 4 biscottes (≈ 35g) Ce paquet contient 8.5 portions de 4
   biscottes.
 - 0.33L
 - 2galettes 10.5g
 - 0.041649313g
 - 1 package (79g)

Answers


Owen Stephens February 2016

In OpenRefine GREL (the language used to write the transformations) the 'match' function requires the regular expression to match the entire string in the cell - you can't use a partial match.

The output of the 'match' function is an array of all the capture groups. To get a specific value you have to select this from the array, or convert the array to a string.

So for example you could try:

value.match(/.*?(\d+\.?\d*)g(ram)?(s)?\b?.*/)[0]

This will find all strings where there is a number (with or without a decimal point) in front of the letter 'g', or 'gram' or 'grams', followed by a non-word character (e.g. a space or a bracket) and will capture the number as the first member of the resulting array of capture groups.

The '?' is needed after the first '.*' to make this lazy, so that the capture group gets the whole number, not just the last digit.

Post Status

Asked in February 2016
Viewed 3,878 times
Voted 4
Answered 1 times

Search




Leave an answer