1 Setup

library(xtable)
library(DT)
library(dplyr)

2 Problem

There are many functions or packages which summarizes variables in a data frame. However, there seem not to offer so much flexibility on how much one can manipute or export the output, for instance, including outputs in .tex files.

In this post, we implement a simple function which entirely depends on base R to generate a more flexible variable(s) summary object.

summarizeDf <- function(df, output = c("simple", "tex"), digits = 1){
    if (!missing(output) & sum(!output %in% c("simple", "tex")) > 0){
      stop("output can only be 'simple' or 'tex'")
   }
   vars <- colnames(df)
   df_summary <- data.frame(Variable = rep(NA, length(vars))
      , Type = rep(NA, length(vars))
      , Summary = rep(NA, length(vars))
   )
   for (i in 1:length(vars)){
      vals <- df[, vars[[i]]]
      if (class(vals) == "numeric" | class(vals) == "integer"){
         df_summary[["Type"]][[i]] <- "numeric"
         df_summary[["Variable"]][[i]] <- vars[[i]]
         df_summary[["Summary"]][[i]] <- paste0("["
            , round(min(vals), digits), ", "
            , round(max(vals), digits), "]; "
            , round(mean(vals), digits), " ("
            , round(sd(vals), digits), ")"
         )
      } else{
         df_summary[["Type"]][[i]] <- "categorical"
         df_summary[["Variable"]][[i]] <- vars[[i]]
         perc <- sort(round(prop.table(table(vals))*100, digits)
            , decreasing = TRUE
         )
         if (missing(output) | sum(output %in% "simple") > 0){
            perc <- paste0(names(perc), " (", perc, "%)")
            df_summary[["Summary"]][[i]] <- paste0(perc
               , collapse = ";\n "
            )
         } else{
            perc <- paste0(names(perc), " (", perc, "\\%)")
            df_summary[["Summary"]][[i]] <- paste0(perc
               , collapse = "; \\\\ & & "
            )
         }
      }
   }
   return(df_summary)
}

summarizeDf summariz(s)es dataframe. Computes ([min, max]; mean (sd)) for numerical or integer variables and frequency distribution (percent) for categorical variables.

Inputs:

df - Input dataframe

output - Specifies the output structure. output = "simple" returns R-output-like output. output = "tex" returns xtable ready format.

digits - Number of digits to return.

Details:

For categorical variables with several categories, output = "tex" is preferrable. Add sanitize.text.function = function(x){x} to xtable print function for .tex.

Value:

It returns an object of class data.frame. Computes ([min, max]; mean (sd)) for numerical or integer variables and frequency distribution (percent) for categorical variable

3 Example(s)

To demonstrate this, we use social media survey data described in this post.

smedia_df <- read.csv("../datasets/multi_response.csv")
datatable(smedia_df, rownames = FALSE)

Output normal R-like summary. See value above.

smedia_df1 <- select(smedia_df, -c("doi"))
smedia_summary <-(smedia_df1
    %>% summarizeDf(.)
)
datatable(smedia_summary, rownames = FALSE)

Generate simple latex-like table.

smedia_summary <-(smedia_df1
    %>% summarizeDf(., output = "simple")
    %>% xtable(., caption = "Simple data summary")
)
summary.tex <- print(smedia_summary, sanitize.text.function = function(x){x}
    , type = "html"
    , scalebox = 0.5
    , include.rownames = FALSE
    , caption.placement = "top"
)

Simple data summary
Variable	Type	Summary
gender	categorical	Female (63.2%); Male (36.8%)
age	numeric	[19, 28]; 22.1 (2)
smedia_used	categorical	Facebook, Whatsapp (26.3%); Twitter, Facebook, Instagram, Whatsapp (18.4%); Twitter, Facebook, Whatsapp (18.4%); Facebook, Instagram, Whatsapp (10.5%); Facebook (5.3%); Twitter, Facebook, Instagram, Pinterest, Whatsapp (5.3%); Twitter, Instagram, Whatsapp (5.3%); Twitter, Facebook, Instagram, Pinterest, Whatsapp, Viber (2.6%); Twitter, Facebook, Pinterest, Whatsapp (2.6%); Twitter, Whatsapp (2.6%); Whatsapp (2.6%)
freq_usage	categorical	Never (34.2%); Multiple times a day (26.3%); A few times a week (18.4%); At least once a day (10.5%); A few times a month (5.3%); Rarely (5.3%)
freq_post	categorical	Never (36.8%); Rarely (23.7%); A few times a week (15.8%); A few times a month (10.5%); At least once a day (7.9%); Multiple times a day (5.3%)

You can also generate .tex file to include in .tex files by changing type = "latex" and also adding file = "filename.tex". See print.xtable.

You can download the function from here or markdown file from here.

Summariz(s)e all the variables in a dataframe

Steve Cygu (cygubicko@gmail.com)

2019 Feb 28 (Thu)

1 Setup

2 Problem

3 Example(s)