1 Setup

We begin by loading the required packages. If you don’t have the packages installed, use install.packages("package_name").

library(DT)
library(dplyr)
library(tidyr)

2 Problem

In many data collection software, multiple response questions’ selections (respondent can choose more than one option) are usually stored in a single cell separated with some characters. This problem is mostly common in ODK and other ODK based tools or even Google Forms. In addition, a more general problem could be creating columns from a single column of a dataframe based on some pattern.

To demonstrate how we go about this, we use social media survey data in which reposndents were asked which social media they use. Let us take a look at the variables.

desc <- read.csv("../datasets/multi_resp_desc.csv")
datatable(desc, rownames = FALSE)

In this example, the respondents could have selected more than one social media platforms (see smedia_used). The selected oprions are seperated using “,” and our aim is to create additional variables to hold each of the selected responses. Let us take a look at the data

smedia_df <- read.csv("../datasets/multi_response.csv")
datatable(head(smedia_df), rownames = FALSE)

Some of the ways to solve thus problem may include:

Split columns in excel
Use R functions

In this post, we will show a step by step guide on how to write a R function which will be able to do the job.

3 R way

As mentioned above, we are going to demonstrate this using smedia_used variable. We would therefore ask ourselves how much information do we have and best can use such?

What we see	What R will understand	Comment
“,”	pattern	This can be any pattern (“-”, “space”,…) dependeing on the plattform/problem.
Individual entries in smedia_used	Number of new variables to create	Number of new variables to be created equals max(number of entries).

First we need to understand how pattern matching and regular expressions work in R. You can find some help online but I would recommend Regular Expressions in R or Pattern Matching and Replacement as a starting point.

We need to count the number of selections in smedia_used based on the pattern “,”.

countFunc <- function(var, pattern){
    nselect <- sapply(regmatches(var, gregexpr(pattern, var)), length) + 1
    return(nselect)
}

var - input variable (could be multiple response)
pattern - what separates the entries. Could be “,”, “_” ….
gregexpr() and regmatches() - Searches for the pattern and returns the position if TRUE.
sapply - tells R to ‘repeatedly’ search for the pattern and do something - length.
length - what sapply does. Counts the number of patterns found.
+1 - counting only happens if there is more than one entry. This accounts for first selection.

Let us test the countFunc.

test_vec <- c("first, second, third", "fourth", "second, fifth")
pattern <- ","
countFunc(test_vec, pattern)

[1] 3 1 2

We’ve correctly counted the entries in test\(\_\)vec. Now let us apply the function, countFunc(), to our dataset.

x <- pull(smedia_df, "smedia_used")
pattern <- ","
nselected <- countFunc(x, pattern) # Count the number of entries

# Add this column to the dataset
smedia_df2 <- (smedia_df
   %>% mutate(nselected = nselected)
)
datatable(smedia_df2, rownames = FALSE)

The last column in the output above correctly counts the number of selections in smedia_df. We now modify countFunc() to perform all the tasks above and add a few lines to split the multiple responses into new columns.

countFunc <- function(df, var, pattern, remove = FALSE, drop_ns = FALSE){
    nselected <- paste0(var, "_nselected")
    df <- (df
        %>% rename(temp_multi = var)
        %>% mutate(temp_nselected = sapply(
                regmatches(temp_multi, gregexpr(pattern, temp_multi))
                , length
            ) + 1
        )
    )
    # Max number of new variables to create
    maxselected <- max(pull(df, temp_nselected))

    # Create the new variable
    df <- (df 
        %>% separate_("temp_multi"
            , c(paste0(rep(var, maxselected), "_", 1:maxselected))
            , sep = pattern
            , remove = remove
            , convert = TRUE
        )
        %>% rename_(.dots = setNames(c("temp_multi", "temp_nselected"), c(var, nselected)))
    )
    if(drop_ns){
        df <- select(df, -c(grep("_nselected$", colnames(df), value = TRUE)))
    }
    return(df)
}

This function might look complicated but not at all…!

var and pattern - same as above.
remove = TRUE - drops the parent multiple response variable.
drop_ns - whether to create a variable for the number of selections per case. Naming is of the form varname_nselected.
mutate() - counts the number of selected reponses and then create a temporary variable, temp_nselected.
separate() - based on the pattern and the maximum number of selections made, this line creates new columns immediately after the multiple variable. The created variables keeps same variable name with “_” number.

Let us take a look at the output

smedia_df3 <- (smedia_df
    %>% countFunc(var = "smedia_used", pattern = ",", remove = FALSE, drop_ns = FALSE)
    %>% datatable(., rownames = FALSE, options = list(scrollX = TRUE))
)
smedia_df3

Try remove = TRUE and/or drop_ns = TRUE.

Technically, we’ve achieved our aim; counted the selections and managed to create individual variables. However, we can put everything together and make our function more robust.

3.1 Everything together

Suppose we want to split several variable? Of course we can simply do a for loop but it would be much better if we implemented that in countFunc so that we don’t have think about how to and maybe give it a different name like multiSeparate.

multiSeparate <- function(df, vars, pattern, remove = FALSE, drop_ns = FALSE){
    for (var in vars){
        nselected <- paste0(var, "_nselected")
        df <- (df
            %>% rename(temp_multi = var)
            %>% mutate(temp_nselected = sapply(
                    regmatches(temp_multi, gregexpr(pattern, temp_multi))
                    , length
                ) + 1
            )
        )
        # Max number of new variables to create
        maxselected <- max(pull(df, temp_nselected))

        # Create the new variable
        df <- (df 
            %>% separate_("temp_multi"
                , c(paste0(rep(var, maxselected), "_", 1:maxselected))
                , sep = pattern
                , remove = remove
                , convert = TRUE
            )
            %>% rename_(.dots = setNames(c("temp_multi", "temp_nselected"), c(var, nselected)))
        )
        if(drop_ns){
            df <- select(df, -c(grep("_nselected$", colnames(df), value = TRUE)))
        }
    }
    return(df)
}

We only added 2 lines and changed one input. Your guess is as good as mine.

Test our function we create \(3\) additional dummy multiple variables with different separation patterns.

smedia_df <- (smedia_df
 %>% mutate(smedia_used_dummy1 = smedia_used
        , smedia_used_dummy2 = gsub(",", "-", smedia_used)
        , smedia_used_dummy3 = gsub(",", " ", smedia_used)
 )
)

multi_vars <- grep("^smedia_used", colnames(smedia_df), value = TRUE)
print("All multiple respnse variables")

[1] “All multiple respnse variables”

multi_vars

[1] “smedia_used”
[2] “smedia_used_dummy1” [3] “smedia_used_dummy2” [4] “smedia_used_dummy3”

smedia_df <- (smedia_df
    %>% multiSeparate(vars = multi_vars, pattern = ",| |-", remove = FALSE, drop_ns = FALSE)
    %>% datatable(., rownames = FALSE, options = list(scrollX = TRUE))
)
smedia_df

You can download the function from here or markdown file from here.

Split multiple response entry column(s) in R

Steve Cygu (cygubicko@gmail.com)

2019 Feb 21 (Thu)

1 Setup

2 Problem

3 R way

3.1 Everything together