zhyan February 2016

change dataset to binary dataset

my dataset is :

df=data.frame(x=c(1,4,6,NA,7,NA,9,10,4,NA),
          y=c(10,12,NA,NA,14,18,20,15,12,17),
          z=c(225,198,NA,NA,NA,130,NA,200,NA,99))
df
    x  y   z
1   1 10 225
2   4 12 198
3   6 NA  NA
4  NA NA  NA
5   7 14  NA
6  NA 18 130
7   9 20  NA
8  10 15 200
9   4 12  NA
10 NA 17  99

I want to change dataset to binary dataset as follows

observed elements=1

missed elements=0

 x y z
1  1 1 1
2  1 1 1
3  1 0 0
4  0 0 0
5  1 1 0
6  0 1 1
7  1 1 0
8  1 1 1
9  1 1 0
10 0 1 1

How to do it in R ? my training code is ifelse(df=NA , 0 ,1) .

Answers


A Handcart And Mohair February 2016

You can just use !is.na, like this:

# df[] <- as.numeric(!is.na(df))  # <- Original answer
df[] <- as.integer(!is.na(df))    # <- Thanks @docendodiscimus
df
#    x y z
# 1  1 1 1
# 2  1 1 1
# 3  1 0 0
# 4  0 0 0
# 5  1 1 0
# 6  0 1 1
# 7  1 1 0
# 8  1 1 1
# 9  1 1 0
# 10 0 1 1

If efficiency is of concern, you can try using the "data.table" package:

as.data.table(df)[, lapply(.SD, function(x) as.numeric(!is.na(x)))]
#     x y z
#  1: 1 1 1
#  2: 1 1 1
#  3: 1 0 0
#  4: 0 0 0
#  5: 1 1 0
#  6: 0 1 1
#  7: 1 1 0
#  8: 1 1 1
#  9: 1 1 0
# 10: 0 1 1

Or to assign while replacing:

as.data.table(df)[, (names(df)) := lapply(.SD, function(x) as.numeric(!is.na(x)))][]

Update

If anyone is interested in further benchmarks, you can check out this Gist.

Summary of the benchmarking:

  • If it's sheer speed you're after, go for a "data.table" approach.
  • If you want efficient code in base R, as.integer and + are virtually neck-to-neck, so I think you know where my recommendation would lie.


akrun February 2016

We can wrap with + on the logical matrix to convert it to binary. It should be very fast as well.

+(!is.na(df))
#      x y z
# [1,] 1 1 1
# [2,] 1 1 1
# [3,] 1 0 0
# [4,] 0 0 0
# [5,] 1 1 0
# [6,] 0 1 1
# [7,] 1 1 0
# [8,] 1 1 1
# [9,] 1 1 0
#[10,] 0 1 1

A dplyr option is

library(dplyr)
df %>%
   mutate_each(funs(+(!is.na(.))) )
#   x y z
#1  1 1 1
#2  1 1 1
#3  1 0 0
#4  0 0 0
#5  1 1 0
#6  0 1 1
#7  1 1 0
#8  1 1 1
#9  1 1 0
#10 0 1 1

Benchmarks

set.seed(24)
df <- as.data.frame(matrix(sample(c(NA, 1:20), 5000*5000,
       replace=TRUE), ncol=5000))
system.time(as.numeric(!is.na(df)))
#   user  system elapsed 
#  0.64    0.09    0.73 

system.time(+(!is.na(df)))
#  user  system elapsed 
#  0.42    0.11    0.53 

Post Status

Asked in February 2016
Viewed 3,587 times
Voted 4
Answered 2 times

Search




Leave an answer