We begin by loading the required packages. If you don’t have the packages installed, use install.packages("package_name")
.
library(DT)
library(dplyr)
library(tidyr)
In many data collection software, multiple response questions’ selections (respondent can choose more than one option) are usually stored in a single cell separated with some characters. This problem is mostly common in ODK and other ODK based tools or even Google Forms. In addition, a more general problem could be creating columns from a single column of a dataframe based on some pattern.
To demonstrate how we go about this, we use social media survey data in which reposndents were asked which social media they use. Let us take a look at the variables.
desc <- read.csv("../datasets/multi_resp_desc.csv")
datatable(desc, rownames = FALSE)
In this example, the respondents could have selected more than one social media platforms (see smedia_used). The selected oprions are seperated using “,” and our aim is to create additional variables to hold each of the selected responses. Let us take a look at the data
smedia_df <- read.csv("../datasets/multi_response.csv")
datatable(head(smedia_df), rownames = FALSE)
Some of the ways to solve thus problem may include:
Split columns in excel
Use R functions
In this post, we will show a step by step guide on how to write a R function which will be able to do the job.
As mentioned above, we are going to demonstrate this using smedia_used variable. We would therefore ask ourselves how much information do we have and best can use such?
What we see | What R will understand | Comment |
---|---|---|
“,” | pattern | This can be any pattern (“-”, “space”,…) dependeing on the plattform/problem. |
Individual entries in smedia_used | Number of new variables to create | Number of new variables to be created equals max(number of entries). |
First we need to understand how pattern matching and regular expressions work in R. You can find some help online but I would recommend Regular Expressions in R or Pattern Matching and Replacement as a starting point.
We need to count the number of selections in smedia_used based on the pattern “,”.
countFunc <- function(var, pattern){
nselect <- sapply(regmatches(var, gregexpr(pattern, var)), length) + 1
return(nselect)
}
var
- input variable (could be multiple response)pattern
- what separates the entries. Could be “,”, “_” ….gregexpr()
and regmatches()
- Searches for the pattern and returns the position if TRUE.sapply
- tells R to ‘repeatedly’ search for the pattern and do something - length.length
- what sapply does. Counts the number of patterns found.+1
- counting only happens if there is more than one entry. This accounts for first selection.Let us test the countFunc
.
test_vec <- c("first, second, third", "fourth", "second, fifth")
pattern <- ","
countFunc(test_vec, pattern)
[1] 3 1 2
We’ve correctly counted the entries in test\(\_\)vec. Now let us apply the function, countFunc()
, to our dataset.
x <- pull(smedia_df, "smedia_used")
pattern <- ","
nselected <- countFunc(x, pattern) # Count the number of entries
# Add this column to the dataset
smedia_df2 <- (smedia_df
%>% mutate(nselected = nselected)
)
datatable(smedia_df2, rownames = FALSE)
The last column in the output above correctly counts the number of selections in smedia_df. We now modify countFunc()
to perform all the tasks above and add a few lines to split the multiple responses into new columns.
countFunc <- function(df, var, pattern, remove = FALSE, drop_ns = FALSE){
nselected <- paste0(var, "_nselected")
df <- (df
%>% rename(temp_multi = var)
%>% mutate(temp_nselected = sapply(
regmatches(temp_multi, gregexpr(pattern, temp_multi))
, length
) + 1
)
)
# Max number of new variables to create
maxselected <- max(pull(df, temp_nselected))
# Create the new variable
df <- (df
%>% separate_("temp_multi"
, c(paste0(rep(var, maxselected), "_", 1:maxselected))
, sep = pattern
, remove = remove
, convert = TRUE
)
%>% rename_(.dots = setNames(c("temp_multi", "temp_nselected"), c(var, nselected)))
)
if(drop_ns){
df <- select(df, -c(grep("_nselected$", colnames(df), value = TRUE)))
}
return(df)
}
This function might look complicated but not at all…!
var
and pattern
- same as above.remove = TRUE
- drops the parent multiple response variable.drop_ns
- whether to create a variable for the number of selections per case. Naming is of the form varname_nselected
.mutate()
- counts the number of selected reponses and then create a temporary variable, temp_nselected
.separate()
- based on the pattern and the maximum number of selections made, this line creates new columns immediately after the multiple variable. The created variables keeps same variable name with “_” number.Let us take a look at the output
smedia_df3 <- (smedia_df
%>% countFunc(var = "smedia_used", pattern = ",", remove = FALSE, drop_ns = FALSE)
%>% datatable(., rownames = FALSE, options = list(scrollX = TRUE))
)
smedia_df3
remove = TRUE
and/or drop_ns = TRUE
.Technically, we’ve achieved our aim; counted the selections and managed to create individual variables. However, we can put everything together and make our function more robust.
Suppose we want to split several variable? Of course we can simply do a for loop
but it would be much better if we implemented that in countFunc
so that we don’t have think about how to and maybe give it a different name like multiSeparate
.
multiSeparate <- function(df, vars, pattern, remove = FALSE, drop_ns = FALSE){
for (var in vars){
nselected <- paste0(var, "_nselected")
df <- (df
%>% rename(temp_multi = var)
%>% mutate(temp_nselected = sapply(
regmatches(temp_multi, gregexpr(pattern, temp_multi))
, length
) + 1
)
)
# Max number of new variables to create
maxselected <- max(pull(df, temp_nselected))
# Create the new variable
df <- (df
%>% separate_("temp_multi"
, c(paste0(rep(var, maxselected), "_", 1:maxselected))
, sep = pattern
, remove = remove
, convert = TRUE
)
%>% rename_(.dots = setNames(c("temp_multi", "temp_nselected"), c(var, nselected)))
)
if(drop_ns){
df <- select(df, -c(grep("_nselected$", colnames(df), value = TRUE)))
}
}
return(df)
}
Test our function we create \(3\) additional dummy multiple variables with different separation patterns.
smedia_df <- (smedia_df
%>% mutate(smedia_used_dummy1 = smedia_used
, smedia_used_dummy2 = gsub(",", "-", smedia_used)
, smedia_used_dummy3 = gsub(",", " ", smedia_used)
)
)
multi_vars <- grep("^smedia_used", colnames(smedia_df), value = TRUE)
print("All multiple respnse variables")
[1] “All multiple respnse variables”
multi_vars
[1] “smedia_used”
[2] “smedia_used_dummy1” [3] “smedia_used_dummy2” [4] “smedia_used_dummy3”
smedia_df <- (smedia_df
%>% multiSeparate(vars = multi_vars, pattern = ",| |-", remove = FALSE, drop_ns = FALSE)
%>% datatable(., rownames = FALSE, options = list(scrollX = TRUE))
)
smedia_df
You can download the function from here or markdown file from here.