openrefine extracting a number from a text column using regex
I'm trying to parse out a column of data from the OpenFoodFacts dataset that I found via Kaggle. There is a attribute called "serving_size" that contains whatever serving size information is presented on the package for a food item. Most of the time the serving size is expressed in grams (g), however there is often other text as well. I'd like to be able to search through the string, find the number that corresponds to the number of grams, and extract that value into its own field. The value is not just an integer - it might have a decimal.
I'm new to regular expressions, but it seems like it ought to be possible to search for the "g" character and if it is proceeded by any numeric values to extract them. I've found some recipes that suggest this is possible, but so far nothing I've tried has worked. In the OpenRefine documentation they give the example of extracting decimal data using this regex: /[-+]?[0-9]+(.[0-9]+)?/, but there was no variation of that I could get to work in our scenario. I've also tried commands like "value.match(/(.)?(/d+[g]).?/)". I'm finding that I don't understand how regex is supposed to work - when I tell it "/d" I'm expecting that it will ONLY give me back numeric values, however that does not appear to be the case - it gives whatever is there regardless of the character type.
In OpenRefine GREL (the language used to write the transformations) the 'match' function requires the regular expression to match the entire string in the cell - you can't use a partial match.
The output of the 'match' function is an array of all the capture groups. To get a specific value you have to select this from the array, or convert the array to a string.
So for example you could try:
This will find all strings where there is a number (with or without a decimal point) in front of the letter 'g', or 'gram' or 'grams', followed by a non-word character (e.g. a space or a bracket) and will capture the number as the first member of the resulting array of capture groups.
The '?' is needed after the first '.*' to make this lazy, so that the capture group gets the whole number, not just the last digit.
Asked in February 2016Viewed 3,878 timesVoted 4Answered 1 times