Neuril February 2016

R apply function on data frame column

I need to , efficiently, parse one of my dataframe column (a url string) and call a function (strsplit) to parse it, e.g.:

url <- c("www.google.com/nir1/nir2/nir3/index.asp")

unlist(strsplit(url,"/"))

My data frame : spark.data.url.clean looks like this:

                    classes              url
 [107,662,685,508,111,654,509] drudgereport.com/level1/level2/level3

This df has 100k rows and I don't want to loop/iterate over it, parse each url separately and write the results to a new data frame. What I DO need/want is to create a new 5 column data frame:

df.result <- data.frame(fullurl = as.character(),baseurl=as.character(), firstlevel = as.character(), secondlevel=as.character(),thirdlevel=as.character(),classificaiton=as.character())

call one of the "apply" family function over spark.data.url.clean$url and to write the results to the new data frame df.result such that the first column (fullurl) will be populated with the relevant spark.data.url.clean$url, the 2nd to 5th columns will be populated with the relevant results from applying

unlist(strsplit(url,"/"))

- taking the only the first, 2nd, 3rd and 4th elements from the resulted vector and putting it in the first,2nd, 3rd and 4th columns in df.result and finally putting the spark.data.url.clean$classes in the new data frame columns df.result$classificaiton

Sorry for the complication and let me know if anything need to be further cleared out.

Answers


doker February 2016

The simple solution is to use:

apply(row, 2, function(col) {})


Heroka February 2016

You could consider using the package splitstackshape to do this; we can use its cSplit-function. Setting drop to F ensures that the original column is preserved. Not that it returns a data.table, not a data.frame.

library(splitstackshape)
output <- cSplit(dat,2,sep="/", drop=F)

data used:

dat <- data.frame(classes="[107,662,685,508,111,654,509]",
                  url="drudgereport.com/level1/level2/level3")


K. Rohde February 2016

There is no need for apply, as far as I see it.

Try this:

spark.data.url.clean <- data.frame(classes = c(107,662,685,508,111,654,509), 
  url = c("drudgereport.com/level1/level2/level3", "drudgeddddreport.com/levelfe1/lefvel2/leveel3", 
          "drudgeaasreport2.com/lefvel13/lffvel244/fel223", "otherurl.com/level1/second/level3", 
          "whateversite.com/level13/level244/level223", "esportsnow.com/first/level2/level3", 
          "reeport2.com/level13/level244/third"), stringsAsFactors = FALSE)

df.result <- spark.data.url.clean

names(df.result) <- c("classification", "fullurl")

df.result[c("baseurl", "firstlevel", "secondlevel", "thirdlevel")] <- do.call(rbind, strsplit(df.result$fullurl, "/"))


docendo discimus February 2016

Here's an option with data.table which should be pretty fast. If your data looks like this:

> df
#                        classes                                   url
#1 [107,662,685,508,111,654,509] drudgereport.com/level1/level2/level3

You can do the following:

library(data.table)
setDT(df)  # convert to data.table 
cols <- c("baseurl", "firstlevel", "secondlevel", "thirdlevel") # define new column names
df[, (cols) := tstrsplit(url, "/", fixed = TRUE)[1:4]]  # assign new columns

Now, the data looks like this:

> df
#                         classes                                   url          baseurl firstlevel secondlevel thirdlevel
#1: [107,662,685,508,111,654,509] drudgereport.com/level1/level2/level3 drudgereport.com     level1      level2     level3

Post Status

Asked in February 2016
Viewed 3,291 times
Voted 8
Answered 4 times

Search




Leave an answer