1 Setup

library(xtable)
library(DT)
library(dplyr)

2 Problem

There are many functions or packages which summarizes variables in a data frame. However, there seem not to offer so much flexibility on how much one can manipute or export the output, for instance, including outputs in .tex files.

In this post, we implement a simple function which entirely depends on base R to generate a more flexible variable(s) summary object.

summarizeDf <- function(df, output = c("simple", "tex"), digits = 1){
    if (!missing(output) & sum(!output %in% c("simple", "tex")) > 0){
      stop("output can only be 'simple' or 'tex'")
   }
   vars <- colnames(df)
   df_summary <- data.frame(Variable = rep(NA, length(vars))
      , Type = rep(NA, length(vars))
      , Summary = rep(NA, length(vars))
   )
   for (i in 1:length(vars)){
      vals <- df[, vars[[i]]]
      if (class(vals) == "numeric" | class(vals) == "integer"){
         df_summary[["Type"]][[i]] <- "numeric"
         df_summary[["Variable"]][[i]] <- vars[[i]]
         df_summary[["Summary"]][[i]] <- paste0("["
            , round(min(vals), digits), ", "
            , round(max(vals), digits), "]; "
            , round(mean(vals), digits), " ("
            , round(sd(vals), digits), ")"
         )
      } else{
         df_summary[["Type"]][[i]] <- "categorical"
         df_summary[["Variable"]][[i]] <- vars[[i]]
         perc <- sort(round(prop.table(table(vals))*100, digits)
            , decreasing = TRUE
         )
         if (missing(output) | sum(output %in% "simple") > 0){
            perc <- paste0(names(perc), " (", perc, "%)")
            df_summary[["Summary"]][[i]] <- paste0(perc
               , collapse = ";\n "
            )
         } else{
            perc <- paste0(names(perc), " (", perc, "\\%)")
            df_summary[["Summary"]][[i]] <- paste0(perc
               , collapse = "; \\\\ & & "
            )
         }
      }
   }
   return(df_summary)
}

summarizeDf summariz(s)es dataframe. Computes ([min, max]; mean (sd)) for numerical or integer variables and frequency distribution (percent) for categorical variables.

Inputs:

  • df - Input dataframe
  • output - Specifies the output structure. output = "simple" returns R-output-like output. output = "tex" returns xtable ready format.
  • digits - Number of digits to return.

Details:

For categorical variables with several categories, output = "tex" is preferrable. Add sanitize.text.function = function(x){x} to xtable print function for .tex.

Value:

It returns an object of class data.frame. Computes ([min, max]; mean (sd)) for numerical or integer variables and frequency distribution (percent) for categorical variable

3 Example(s)

To demonstrate this, we use social media survey data described in this post.

smedia_df <- read.csv("../datasets/multi_response.csv")
datatable(smedia_df, rownames = FALSE)

Output normal R-like summary. See value above.

smedia_df1 <- select(smedia_df, -c("doi"))
smedia_summary <-(smedia_df1
    %>% summarizeDf(.)
)
datatable(smedia_summary, rownames = FALSE)

Generate simple latex-like table.

smedia_summary <-(smedia_df1
    %>% summarizeDf(., output = "simple")
    %>% xtable(., caption = "Simple data summary")
)
summary.tex <- print(smedia_summary, sanitize.text.function = function(x){x}
    , type = "html"
    , scalebox = 0.5
    , include.rownames = FALSE
    , caption.placement = "top"
)
Simple data summary
Variable Type Summary
gender categorical Female (63.2%); Male (36.8%)
age numeric [19, 28]; 22.1 (2)
smedia_used categorical Facebook, Whatsapp (26.3%); Twitter, Facebook, Instagram, Whatsapp (18.4%); Twitter, Facebook, Whatsapp (18.4%); Facebook, Instagram, Whatsapp (10.5%); Facebook (5.3%); Twitter, Facebook, Instagram, Pinterest, Whatsapp (5.3%); Twitter, Instagram, Whatsapp (5.3%); Twitter, Facebook, Instagram, Pinterest, Whatsapp, Viber (2.6%); Twitter, Facebook, Pinterest, Whatsapp (2.6%); Twitter, Whatsapp (2.6%); Whatsapp (2.6%)
freq_usage categorical Never (34.2%); Multiple times a day (26.3%); A few times a week (18.4%); At least once a day (10.5%); A few times a month (5.3%); Rarely (5.3%)
freq_post categorical Never (36.8%); Rarely (23.7%); A few times a week (15.8%); A few times a month (10.5%); At least once a day (7.9%); Multiple times a day (5.3%)

You can also generate .tex file to include in .tex files by changing type = "latex" and also adding file = "filename.tex". See print.xtable.

You can download the function from here or markdown file from here.