Data Summarization

Author

Dr. Mohammad Nasir Abdullah

Data Summarization

Data summarization is an essential process in statistical programming, which involves reducing and simplifying large datasets into more manageable and understandable forms. This process is crucial as it helps in highlighting the key aspects of the data by extracting important patterns, trends, and relationship. In the realm of statistical analysis, summarization is not just about making the data smaller or simpler, it is about capturing the essence of the data in a way that is both informative and useful for analysis.

Types of Summarization Techniques

  1. Numerical Summarization: This involves using statistical measures to summarize the key characteristics of numerical data. Techniques include calculating measures of central tendency (like mean, median, and mode) and measures of variability or spread (like range, variance, and standard deviation). These techniques are fundamental in providing a quick snapshot of the data’s overall distribution and central values.

  2. Categorical Summarization: When dealing with categorical (or qualitative) data, summarization often involves understanding the frequency or occurrence of different categories. Techniques include creating frequency tables, cross-tabulations, and using measures like mode. This type of summarization is particularly useful in understanding the distribution of categorical variables, like customer categories or product types.

  3. Visual Summarization: Visual representations of data, such as histograms, bar charts, box plots, and scatter plots, provide an intuitive way to summarize and understand complex datasets. These techniques are invaluable in revealing patterns, trends, outliers, and relationships in data that might not be obvious in textual or numerical summaries.

In R, these summarization techniques are supported by a variety of functions and packages, making it an ideal environment for both basic and advanced data summarization tasks. This chapter will delve into these techniques, providing the reader with the knowledge and tools to effectively summarize and interpret large datasets using R.

Summarizing Numerical Data

Introduction to Normality Testing.

Normality testing is a fundamental step in statistical analysis, particularly when deciding which statistical methods to apply. Many statistical techniques assume that the data follows a normal (or Gaussian) distribution. However, real-world data often deviates from this idealized distribution. Therefore, assessing the normality of your dataset is crucial before applying techniques that assume normality, such as parametric tests.

In R, there are several methods to test for normality, including graphical methods like Q-Q plots and statistical tests like the Shapiro-Wilk test. Let’s explore how to perform these tests in R with an example dataset.

Graphical Methods to Identify Normality distribution

1) Histogram

library(palmerpenguins)
library(ggplot2)

data <- na.omit(penguins$body_mass_g)  # Remove NA values

num_bins_sturges <- 1 + log2(length(data)) #calculating number of bins using struge's rules

ggplot(data.frame(data), aes(x = data)) +
    geom_histogram(bins = round(num_bins_sturges)) +
    ggtitle("Histogram with Sturges' Rule") + theme_light()

2) Boxplot

ggplot(penguins, aes(body_mass_g)) + geom_boxplot() +
  ggtitle("A Boxplot showing distribution of weight of Penguins in grams") + 
  theme_minimal() + 
  scale_y_continuous(limits=c(-1,1))

Based on the boxplot, we can clearly see that the data was a bit skewed to the right.

3) Kurtosis and Skewness

Kurtosis is a measure of a sample or population that identifies how flat or peaked it is with respect to a normal distribution. It refers to how concentrated the values are in the center of the distribution. In other hand, sample can be described as a measure of horizontal symmetry with respect to a normal distribution by measuring the skewness coefficient. If the coefficient of skewness can be either positive (right skew), zero (no skew), or negative (left skew).

Steps examining sample normality by skewness and kurtosis:

1) Determine sample mean and standard deviation
2) Determine sample kurtosis and skewness

Formula Skewness:

Formula for Standard Error of Skewness:

Formula Kurtosis:

Formula for Standard Error of Kurtosis:


3) Calculate the standard error of the kurtosis and the standard error of the skewness
4) Calculate Z-Score for the kurtosis and Z-Score for the skewness

Formula for Z-Score Skewness:

Formula for Z-Score Kurtosis:


5) Compare the Z-Score to the critical region obtained from the normal distribution

  • The Z-score is to examine the sample’s approximation to a normal distribution.

  • Z-score values for kurtosis and skewness must fall between -2 and 2 to pass the normality assumption for alpha=0.05.

  • Z-score for Skewness: This score indicates how many standard deviations the skewness of the dataset is from the mean skewness of a normally distributed dataset. A higher absolute value of the z-score indicates a greater degree of asymmetry.

  • Z-score for Kurtosis: Similarly, this score tells us how many standard deviations the kurtosis of the dataset is from the mean kurtosis of a normal distribution. It reflects the “tailedness” or the peak height of the distribution.

library(e1071)
library(dplyr)


penguins %>%
  filter(!is.na(body_mass_g)) %>%
  summarize(skewness = skewness(body_mass_g, type=2, na.rm=T), 
           se_skewness = sqrt((6 * length(body_mass_g) * (length(body_mass_g) - 1)) / ((length(body_mass_g) - 2) * (length(body_mass_g) + 1) * (length(body_mass_g) + 3))), 
           kurtosis =  kurtosis(body_mass_g, type = 2, na.rm=T),
           se_kurtosis = sqrt((4 * se_skewness^2) / (length(body_mass_g))),
           
          Z_Skewness = ((skewness-0)/se_skewness),
          Z_Kurtosis = ((kurtosis - 0)/se_kurtosis))
# A tibble: 1 × 6
  skewness se_skewness kurtosis se_kurtosis Z_Skewness Z_Kurtosis
     <dbl>       <dbl>    <dbl>       <dbl>      <dbl>      <dbl>
1    0.470       0.132   -0.719      0.0143       3.57      -50.4
Interpretation Guideline
  • Z-score between -2 and 2: This range is generally considered to indicate that the skewness or kurtosis is not significantly different from what would be expected in a normal distribution. The data can be considered approximately normal in this aspect.

  • Z-score less than -2: This suggests a significant deviation from normality in a negative direction. For skewness, it would mean a left skew (tail is on the left side). For kurtosis, it indicates a platykurtic distribution (less peaked than a normal distribution).

  • Z-score greater than 2: This indicates a significant deviation in a positive direction. For skewness, it would imply a right skew (tail is on the right side). For kurtosis, it suggests a leptokurtic distribution (more peaked than a normal distribution).

It is clearly seen that the value of Z-Score for skewness is beyond 2, which indicating the data is skewed to the right tail. and the value of Z-Score for kurtosis is less than -2, indicating platykurtic distribution (less peaked than a normal distribution).

4) Q-Q plots

Q-Q plots are scatterplots created by plotting two sets of quantiles against each other: the quantiles from the dataset and the quantiles of a normal distribution. If the data are normally distributed, the points in the Q-Q plot will approximately lie on a straight line.

Interpreting the Q-Q Plot
  • Linearity: In a Q-Q plot, if the data are normally distributed, the points will approximately lie on a straight line. The closer the points are to the line (qqline), the more normal the distribution is.

  • Departures from Linearity: Deviations from the line can indicate departures from normality:

    • Left Tail (lower end of the plot): If the points deviate below the line at the lower end, it suggests a left-skewed distribution (lighter left tail than normal).

    • Right Tail (upper end of the plot): If the points deviate above the line at the upper end, it indicates a right-skewed distribution (heavier right tail than normal).

    • Center of the Plot: If the points deviate from the line in the middle of the plot, it suggests a difference in the median or a distribution with different kurtosis from normal (either more peaked or flatter than a normal distribution).

#Create Q-Q plot
qqnorm(penguins$body_mass_g)

#Adding straight diagonal line to plot
qqline(penguins$body_mass_g)

qqline(penguins$body_mass_g,
       main = 'Q-Q Plot for Normality', 
       xlab = 'Theoretical Dist', 
       ylab = 'Sample dist', 
       col = 'steelblue')

Another way to construct qqplot

library(ggpubr)
ggqqplot(penguins$body_mass_g)

By examining how the points in the Q-Q plot deviate from the reference line, we can infer if and how the body_mass_g distribution deviates from a normal distribution. Remember, a Q-Q plot provides a visual assessment and should be used in conjunction with other methods for a comprehensive analysis of normality.

5) Statistical Test

  1. Shapiro-Wilk Test

    • The Shapiro-Wilk test is widely used for testing normality. This test checks the null hypothesis that the data were drawn from a normal distribution.

    • It is generally preferred for small sample sizes (< 50 samples), but can be used for larger ones.

    shapiro.test(penguins$body_mass_g)
    
        Shapiro-Wilk normality test
    
    data:  penguins$body_mass_g
    W = 0.95921, p-value = 3.679e-08
  2. Kolmogorov-Smirnov Test (K-S Test)

    • The K-S test compares the empirical distribution function of the data with the expected distribution function for a normal distribution.

    • This test is more sensitive towards the center of the distribution than the tails.

    #removing missing value
    data1 <- penguins %>%
      filter(!is.na(body_mass_g))
    # Standardizing the data
    standardized_data <- (data1$body_mass_g - mean(data1$body_mass_g)) / sd(data1$body_mass_g)
    
    # Kolmogorov-Smirnov Test
    ks.test(standardized_data, "pnorm")
    
        Asymptotic one-sample Kolmogorov-Smirnov test
    
    data:  standardized_data
    D = 0.10408, p-value = 0.00121
    alternative hypothesis: two-sided
  3. Anderson-Darling Test

    • Similar to the K-S test, the Anderson-Darling test is a test of goodness of fit.

    • This test gives more weight to the tails than the K-S test and is thus more sensitive to outliers.

    library(nortest)
    ad.test(penguins$body_mass_g)
    
        Anderson-Darling normality test
    
    data:  penguins$body_mass_g
    A = 4.543, p-value = 2.757e-11
  4. Lilliefors Test

    • A modification of the K-S test, the Lilliefors test is specifically designed for testing normality.

    • It is particularly useful when the mean and variance of the distribution are unknown.

    library(nortest)
    lillie.test(penguins$body_mass_g)
    
        Lilliefors (Kolmogorov-Smirnov) normality test
    
    data:  penguins$body_mass_g
    D = 0.10408, p-value = 1.544e-09

Basic Statistical Summarisation

Data summarization is a crucial aspect of statistical analysis, providing a way to describe and understand large datasets through a few summary statistics. Among the key concepts in data summarization are measures of central tendency, position, and dispersion. Each of these measures gives different insights into the nature of the data. [for continuous/numerical data only]

1. Measures of Central Tendency

Measures of central tendency describe the center point or typical value of a dataset. The most common measures are:

  • Mean (Arithmetic Average): The sum of all values divided by the number of values. It’s sensitive to outliers and can be skewed by them.

    # calculating mean of numerical variable
    mean(penguins$body_mass_g, na.rm = TRUE)
    [1] 4201.754
  • Median: The middle value when the data is sorted in ascending order. It’s less affected by outliers and skewness and provides a better central value for skewed distributions.

    #calculating median of numerical variable
    median(penguins$body_mass_g, na.rm=TRUE)
    [1] 4050
  • Mode: The most frequently occurring value in the dataset. There can be more than one mode in a dataset (bimodal, multimodal). Useful in understanding the most common value, especially for categorical data.

    as.numeric(names(sort(table(penguins$body_mass_g), decreasing = TRUE)[1]))
    [1] 3800

2. Measures of Position

Measures of position describe how data points fall in relation to the distribution or to each other. These include:

  • Percentiles: Values below which a certain percentage of the data falls. For example, the 25th percentile (or 1st quartile) is the value below which 25% of the data lies.

    quantile(penguins$body_mass_g, na.rm=TRUE)
      0%  25%  50%  75% 100% 
    2700 3550 4050 4750 6300 
  • Quartiles: Special percentiles that divide the dataset into four equal parts. The median is the second quartile.

    #1st Quartile
    quantile(penguins$body_mass_g, 0.25, na.rm=TRUE)
     25% 
    3550 
    #2nd Quartile
    quantile(penguins$body_mass_g, 0.50, na.rm=TRUE) #same as median
     50% 
    4050 
    #3rd Quartile
    quantile(penguins$body_mass_g, 0.75, na.rm=TRUE)
     75% 
    4750 
  • Interquartile Range (IQR): The range between the first and third quartiles (25th and 75th percentiles). It represents the middle 50% of the data and is a measure of variability that’s not influenced by outliers.

    IQR(penguins$body_mass_g, na.rm=TRUE)
    [1] 1200

3. Measures of Dispersion

Measures of dispersion or variability tell us about the spread of the data points in a dataset:

  • min: To find the minimum value in the variable

    min(penguins$body_mass_g, na.rm = TRUE)
    [1] 2700
  • max: To find the maximum value in the variable

    max(penguins$body_mass_g, na.rm=TRUE)
    [1] 6300
  • length: to find how many observations in the variable

    length(penguins$body_mass_g)
    [1] 344
  • Range: The difference between the highest and lowest values. It’s simple but can be heavily influenced by outliers.

    max(penguins$body_mass_g, na.rm=TRUE) - min(penguins$body_mass_g, na.rm=TRUE)
    [1] 3600
  • Variance: The average of the squared differences from the mean. It gives a sense of the spread of the data, but it’s not in the same unit as the data.

    var(penguins$body_mass_g, na.rm=TRUE)
    [1] 643131.1
  • Standard Deviation (SD): The square root of the variance. It’s in the same units as the data and describes how far data points tend to deviate from the mean.

    sd(penguins$body_mass_g, na.rm=TRUE)
    [1] 801.9545
  • Coefficient of Variation (CV): The ratio of the standard deviation to the mean. It’s a unitless measure of relative variability, useful for comparing variability across datasets with different units or means.

    sd(penguins$body_mass_g, na.rm=TRUE) / mean(penguins$body_mass_g, na.rm=TRUE) * 100
    [1] 19.08618

Summary() Function

The summary() function in R is a generic function used to produce result summaries of various model and data objects. When applied to a data frame, it provides a quick overview of the statistical properties of each column. The function is particularly useful for getting a rapid sense of the data, especially during the initial stages of data analysis.

Key Features of summary() in R:
  1. Applicability to Different Objects: The summary() function can be used on different types of objects in R, including vectors, data frames, and model objects. The output format varies depending on the type of object.

  2. Default Output for Data Frames: For a data frame, summary() typically returns the following statistics for each column:

    • For numeric variables: Minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum.

    • For factor variables: Counts for each level, and NA count if there are missing values.

  3. Handling of Missing Values: The function includes NA values in its output, providing a count of missing values, which is crucial for data cleaning and preprocessing.

  4. Customization: The behavior of summary() can be customized for user-defined classes (S3 or S4) in R. This means that when you create a new type of object, you can also define what summary() should return when applied to objects of this type.

  5. Use in Exploratory Data Analysis (EDA): It is often used as a preliminary step in EDA to get a sense of the data distribution, identify possible outliers, and detect missing values.

summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 
Notes:
  • While summary() provides a quick and useful overview, it’s often just a starting point for data analysis. Depending on the results, you might need more detailed analysis, such as specific statistical tests or detailed data visualizations.

  • The function is particularly handy for quickly checking data after importation, allowing for a rapid assessment of data quality, structure, and potential areas that may require further investigation.

Data Summarization on Matrices

  1. rowMeans(x):

    • Purpose: Calculates the mean of each row in a matrix x.

    • Use Case: Useful when you need to find the average across different variables (columns) for each observation (row).

    • Output: Returns a numeric vector containing the mean of each row.

    data <- penguins
    # Selecting only numeric columns for the matrix
    numeric_data <- data[, sapply(data, is.numeric)]
    numeric_matrix <- as.matrix(na.omit(numeric_data))
    
    # Apply summarization functions
    # Means of each row
    row_means <- rowMeans(numeric_matrix)
  2. colMeans(x):

    • Purpose: Computes the mean of each column in a matrix x.

    • Use Case: Used when you want to find the average value of each variable (column) across all observations (rows).

    • Output: Produces a numeric vector with the mean of each column.

    # Means of each column
    colMeans(numeric_matrix)
       bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
             43.92193          17.15117         200.91520        4201.75439 
                 year 
           2008.02924 
  3. rowSums(x):

    • Purpose: Calculates the sum of each row in a matrix x.

    • Use Case: Helpful for aggregating data across multiple variables (columns) for each individual observation (row).

    • Output: Returns a numeric vector containing the sum of each row.

    # Sums of each row
    row_sums <- rowSums(numeric_matrix)
  4. colSums(x):

    • Purpose: Computes the sum of each column in a matrix x.

    • Use Case: Useful for aggregating data for each variable (column) across all observations (rows).

    • Output: Generates a numeric vector with the sum of each column.

    # Sums of each column
    colSums(numeric_matrix)
       bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
              15021.3            5865.7           68713.0         1437000.0 
                 year 
             686746.0 

Detecting how many missing values

To detect how many missing values in the variable

table(is.na(penguins$body_mass_g))

FALSE  TRUE 
  342     2 

TRUE means number of missing values in the dataset for variable body_mass_g.

Detecting the missing value for the whole variables in the dataset

colSums(is.na(penguins))
          species            island    bill_length_mm     bill_depth_mm 
                0                 0                 2                 2 
flipper_length_mm       body_mass_g               sex              year 
                2                 2                11                 0 

Apply function

We can calculate summary statistics simultaneously by using sapply, this function allow us to get the numerical statistics measures for all numerical values. we will discuss about apply family in other lecture.

Mean

sapply(numeric_data, mean, na.rm=TRUE)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
         43.92193          17.15117         200.91520        4201.75439 
             year 
       2008.02907 

Standard Deviation

sapply(numeric_data, sd, na.rm=TRUE)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
        5.4595837         1.9747932        14.0617137       801.9545357 
             year 
        0.8183559 

Sum / total in a column

sapply(numeric_data, sum, na.rm=T)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
          15021.3            5865.7           68713.0         1437000.0 
             year 
         690762.0 

Basically, we can just substitute the numerical measure inside the 2nd arguments.

Summary Statistics by group

Usually in data analysis, if there are a categorical variable, we would like to compare the numerical measure with the categorical data. For example, we want to know the mean and standard deviation by gender (male and female) instead of sumarizing the mean and standard deviation for whole dataset.

Example: calculating mean and standard deviation for mtcars dataset

#Testing the normality of Miles per Gallon variable by Number of cylinders
#Selecing 4 cylinders
cyl4 <- mtcars %>%
  filter(cyl==4)
lillie.test(cyl4$mpg)

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  cyl4$mpg
D = 0.16784, p-value = 0.5226
#Selecting 6 cylinders
cyl6 <- mtcars%>%
  filter(cyl==6)
lillie.test(cyl6$mpg)

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  cyl6$mpg
D = 0.23502, p-value = 0.2863
#Selecting 8 cylinders
cyl8 <- mtcars%>%
  filter(cyl==8)
lillie.test(cyl8$mpg)

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  cyl8$mpg
D = 0.16305, p-value = 0.3983

Calculating Mean and Standard Deviation for mpg by cyl, since the normality assumption is met.

mtcars %>%
  group_by(factor(cyl)) %>%
  summarise(mean_mpg = mean(mpg),
            sd_mpg = sd(mpg))
# A tibble: 3 × 3
  `factor(cyl)` mean_mpg sd_mpg
  <fct>            <dbl>  <dbl>
1 4                 26.7   4.51
2 6                 19.7   1.45
3 8                 15.1   2.56

Several usefull package for numerical data

1) Hmisc package

library(Hmisc)
Hmisc::describe(penguins$bill_length_mm)
penguins$bill_length_mm 
       n  missing distinct     Info     Mean  pMedian      Gmd      .05 
     342        2      164        1    43.92    43.85    6.274    35.70 
     .10      .25      .50      .75      .90      .95 
   36.60    39.23    44.45    48.50    50.80    51.99 

lowest : 32.1 33.1 33.5 34   34.1, highest: 55.1 55.8 55.9 58   59.6

2) pastecs package

library(pastecs)
stat.desc(penguins$bill_length_mm, desc=TRUE, p=0.95)
     nbr.val     nbr.null       nbr.na          min          max        range 
3.420000e+02 0.000000e+00 2.000000e+00 3.210000e+01 5.960000e+01 2.750000e+01 
         sum       median         mean      SE.mean CI.mean.0.95          var 
1.502130e+04 4.445000e+01 4.392193e+01 2.952205e-01 5.806825e-01 2.980705e+01 
     std.dev     coef.var 
5.459584e+00 1.243020e-01 

3) psych package

library(psych)
psych::describe(penguins$bill_length_mm)
   vars   n  mean   sd median trimmed  mad  min  max range skew kurtosis  se
X1    1 342 43.92 5.46  44.45   43.91 7.04 32.1 59.6  27.5 0.05    -0.89 0.3

we can also describe for the whole dataset

psych::describe(penguins)
                  vars   n    mean     sd  median trimmed    mad    min    max
species*             1 344    1.92   0.89    2.00    1.90   1.48    1.0    3.0
island*              2 344    1.66   0.73    2.00    1.58   1.48    1.0    3.0
bill_length_mm       3 342   43.92   5.46   44.45   43.91   7.04   32.1   59.6
bill_depth_mm        4 342   17.15   1.97   17.30   17.17   2.22   13.1   21.5
flipper_length_mm    5 342  200.92  14.06  197.00  200.34  16.31  172.0  231.0
body_mass_g          6 342 4201.75 801.95 4050.00 4154.01 889.56 2700.0 6300.0
sex*                 7 333    1.50   0.50    2.00    1.51   0.00    1.0    2.0
year                 8 344 2008.03   0.82 2008.00 2008.04   1.48 2007.0 2009.0
                   range  skew kurtosis    se
species*             2.0  0.16    -1.73  0.05
island*              2.0  0.61    -0.91  0.04
bill_length_mm      27.5  0.05    -0.89  0.30
bill_depth_mm        8.4 -0.14    -0.92  0.11
flipper_length_mm   59.0  0.34    -1.00  0.76
body_mass_g       3600.0  0.47    -0.74 43.36
sex*                 1.0 -0.02    -2.01  0.03
year                 2.0 -0.05    -1.51  0.04

4) skimr package

library(skimr)
skim(penguins$bill_length_mm)
Data summary
Name penguins$bill_length_mm
Number of rows 344
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁

we can also describe for the whole dataset

skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

Summarizing Categorical Data

Summarizing categorical data is an essential part of data analysis, especially when dealing with survey results, demographic information, or any data where variables are qualitative rather than quantitative. The goal is to gain insights into the distribution of categories, identify patterns, and make inferences about the population being studied.

Key Concepts in Summarizing Categorical Data:

  1. Frequency Counts:

    • The most basic form of summarization for categorical data is to count the number of occurrences of each category.

    • In R, table() function is commonly used for this purpose.

    # To tabulate categorical data
    table(penguins$species)
    
       Adelie Chinstrap    Gentoo 
          152        68       124 
  2. Proportions and Percentages:

    • Converting frequency counts into proportions or percentages provides a clearer understanding of the data relative to the whole.

    • This is particularly useful when comparing groups of different sizes.

    #to get relative frequency by group
    prop.table(table(penguins$species))
    
       Adelie Chinstrap    Gentoo 
    0.4418605 0.1976744 0.3604651 

To combine the frequency value and percentage value together:

freq1 <- as.data.frame(table(penguins$species))
percent1 <- as.data.frame(prop.table(table(penguins$species))*100)
names(percent1)[2] <- "percentage"
cbind(freq1, percent1[2])
       Var1 Freq percentage
1    Adelie  152   44.18605
2 Chinstrap   68   19.76744
3    Gentoo  124   36.04651
combine <- cbind(freq1, round(percent1[2],2))
combine
       Var1 Freq percentage
1    Adelie  152      44.19
2 Chinstrap   68      19.77
3    Gentoo  124      36.05
  1. Cross-tabulation:

    • Cross-tabulation (or contingency tables) involves summarizing two or more categorical variables simultaneously, making it possible to observe the relationship between them.

    • In R, this can be achieved using the table() function with multiple variables.

    # to tabulate the cross tabulation between species and island
    table(penguins$species, penguins$island)
    
                Biscoe Dream Torgersen
      Adelie        44    56        52
      Chinstrap      0    68         0
      Gentoo       124     0         0

Several Useful packages for categorical data analysis

1) janitor package

library(janitor)
tabyl(penguins$species, sort=TRUE)
 penguins$species   n   percent
           Adelie 152 0.4418605
        Chinstrap  68 0.1976744
           Gentoo 124 0.3604651

2) epiDisplay package

library(epiDisplay)
tab1(penguins$species, sort.group = "decreasing", 
     cum.percent = TRUE)

penguins$species : 
          Frequency Percent Cum. percent
Adelie          152    44.2         44.2
Gentoo          124    36.0         80.2
Chinstrap        68    19.8        100.0
  Total         344   100.0        100.0

It will provide with bar graph.

3) summarytools package

library(summarytools)
summarytools::freq(penguins$species)
Frequencies  
penguins$species  
Type: Factor  

                  Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
--------------- ------ --------- -------------- --------- --------------
         Adelie    152     44.19          44.19     44.19          44.19
      Chinstrap     68     19.77          63.95     19.77          63.95
         Gentoo    124     36.05         100.00     36.05         100.00
           <NA>      0                               0.00         100.00
          Total    344    100.00         100.00    100.00         100.00

4) gmodels package

library(gmodels)
CrossTable(penguins$species, format="SPSS") #will return as spss format

   Cell Contents
|-------------------------|
|                   Count |
|             Row Percent |
|-------------------------|

Total Observations in Table:  344 

          |    Adelie  | Chinstrap  |    Gentoo  | 
          |-----------|-----------|-----------|
          |      152  |       68  |      124  | 
          |   44.186% |   19.767% |   36.047% | 
          |-----------|-----------|-----------|

 

This package can be also use for cross tabulation table, for example tabulation for species and sex

library(gmodels)
CrossTable(penguins$species, penguins$sex,
           format="SPSS", 
           expected = T, #expected value
           prop.r = T, #row total
           prop.c = F, #column total
           prop.t = F, #overall total
           prop.chisq = F, #chi-square contribution of each cell
           chisq = T, #the results of a chi-square
           fisher = F, #the result of a Fisher Exact test
           mcnemar = F) #the result of McNemar test

   Cell Contents
|-------------------------|
|                   Count |
|         Expected Values |
|             Row Percent |
|-------------------------|

Total Observations in Table:  333 

                 | penguins$sex 
penguins$species |   female  |     male  | Row Total | 
-----------------|-----------|-----------|-----------|
          Adelie |       73  |       73  |      146  | 
                 |   72.342  |   73.658  |           | 
                 |   50.000% |   50.000% |   43.844% | 
-----------------|-----------|-----------|-----------|
       Chinstrap |       34  |       34  |       68  | 
                 |   33.694  |   34.306  |           | 
                 |   50.000% |   50.000% |   20.420% | 
-----------------|-----------|-----------|-----------|
          Gentoo |       58  |       61  |      119  | 
                 |   58.964  |   60.036  |           | 
                 |   48.739% |   51.261% |   35.736% | 
-----------------|-----------|-----------|-----------|
    Column Total |      165  |      168  |      333  | 
-----------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  0.04860717     d.f. =  2     p =  0.9759894 


 
       Minimum expected frequency: 33.69369 

Let’s try real example

TB Incidence

By using TB incidence dataset, you can download it from this link: https://dataintror.s3.ap-southeast-1.amazonaws.com/tb_incidence.xlsx

library(readxl)

url <- "https://dataintror.s3.ap-southeast-1.amazonaws.com/tb_incidence.xlsx"
destfile <- "tb_incidence.xlsx"
curl::curl_download(url, destfile)
data1 <- read_excel(destfile)

1) Rename the first column to “country”.

library(dplyr)
data1 <- rename(data1, country = 'TB incidence, all forms (per 100 000 population per year)' )
names(data1[1])
[1] "country"

2) Find the mean for all variables before year 2000.

avg <- dplyr::select(data1, starts_with("1")) 
colnames(avg)
 [1] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999"
colMeans(avg, na.rm=T)
    1990     1991     1992     1993     1994     1995     1996     1997 
105.5797 107.6715 108.3140 110.3188 111.9662 114.1981 115.3527 118.8792 
    1998     1999 
121.5169 125.0435 

3) Create a new variable to represent the mean of TB incidence before year 2000 for all observations.

data1$before_2000_avg <- rowMeans(avg, na.rm=T)
names(data1)
 [1] "country"         "1990"            "1991"            "1992"           
 [5] "1993"            "1994"            "1995"            "1996"           
 [9] "1997"            "1998"            "1999"            "2000"           
[13] "2001"            "2002"            "2003"            "2004"           
[17] "2005"            "2006"            "2007"            "before_2000_avg"
head(data1[, c("country", "before_2000_avg")])
# A tibble: 6 × 2
  country        before_2000_avg
  <chr>                    <dbl>
1 Afghanistan              168  
2 Albania                   26.3
3 Algeria                   41.8
4 American Samoa             8.5
5 Andorra                   28.8
6 Angola                   225. 
new <- data1 %>%
  dplyr::select("country", "before_2000_avg")

4) Summarize the data by mean and standard deviation

summary(new)
   country          before_2000_avg 
 Length:208         Min.   :  3.50  
 Class :character   1st Qu.: 26.45  
 Mode  :character   Median : 61.20  
                    Mean   :113.88  
                    3rd Qu.:175.20  
                    Max.   :637.10  
                    NA's   :1       

Youth Tobacco

Import the dataset from this link: https://dataintror.s3.ap-southeast-1.amazonaws.com/Youth_Tobacco_Survey_YTS_Data.csv

library(readr)
data2 <- read_csv("https://dataintror.s3.ap-southeast-1.amazonaws.com/Youth_Tobacco_Survey_YTS_Data.csv")

1) Explore these variables, take note on the frequency of each categories (MeasureDesc, Gender, and Response)

table(data2$MeasureDesc)

              Percent of Current Smokers Who Want to Quit 
                                                     1205 
Quit Attempt in Past Year Among Current Cigarette Smokers 
                                                     1041 
                                           Smoking Status 
                                                     3783 
                                              User Status 
                                                     3765 
table(data2$Gender)

 Female    Male Overall 
   3256    3256    3282 
table(data2$Response)

 Current     Ever Frequent 
    2514     2520     2514 
library(summarytools)
summarytools::freq(data2$MeasureDesc)
Frequencies  
data2$MeasureDesc  
Type: Character  

                                                                  Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
--------------------------------------------------------------- ------ --------- -------------- --------- --------------
                    Percent of Current Smokers Who Want to Quit   1205     12.30          12.30     12.30          12.30
      Quit Attempt in Past Year Among Current Cigarette Smokers   1041     10.63          22.93     10.63          22.93
                                                 Smoking Status   3783     38.63          61.56     38.63          61.56
                                                    User Status   3765     38.44         100.00     38.44         100.00
                                                           <NA>      0                               0.00         100.00
                                                          Total   9794    100.00         100.00    100.00         100.00
summarytools::freq(data2$Gender)
Frequencies  
data2$Gender  
Type: Character  

                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
------------- ------ --------- -------------- --------- --------------
       Female   3256     33.24          33.24     33.24          33.24
         Male   3256     33.24          66.49     33.24          66.49
      Overall   3282     33.51         100.00     33.51         100.00
         <NA>      0                               0.00         100.00
        Total   9794    100.00         100.00    100.00         100.00
summarytools::freq(data2$Response)
Frequencies  
data2$Response  
Type: Character  

                 Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
-------------- ------ --------- -------------- --------- --------------
       Current   2514     33.31          33.31     25.67          25.67
          Ever   2520     33.39          66.69     25.73          51.40
      Frequent   2514     33.31         100.00     25.67          77.07
          <NA>   2246                              22.93         100.00
         Total   9794    100.00         100.00    100.00         100.00

2) Filter MeasureDesc = Smoking Status, all gender = overall and response = current year

sub_yts <- data2 %>% 
  filter(MeasureDesc == "Smoking Status", 
        Gender == "Overall", 
        Response == "Current") %>% 
  dplyr::select (YEAR, LocationDesc, Data_Value)

head(sub_yts)
# A tibble: 6 × 3
   YEAR LocationDesc Data_Value
  <dbl> <chr>             <dbl>
1  2015 Arizona             3.2
2  2015 Connecticut         0.8
3  2015 Connecticut         5.6
4  2015 Georgia            10.8
5  2015 Hawaii              3  
6  2015 Hawaii              7.4

3) we want to summarize data_value based on YEAR

dplyr::summarize(group_by(sub_yts, YEAR), 
          year_avg=mean(Data_Value, na.rm=T))
# A tibble: 17 × 2
    YEAR year_avg
   <dbl>    <dbl>
 1  1999    20.5 
 2  2000    19.9 
 3  2001    15.7 
 4  2002    16.8 
 5  2003    13.2 
 6  2004    13.9 
 7  2005    14.1 
 8  2006    14.1 
 9  2007    13.0 
10  2008    12.2 
11  2009    11.7 
12  2010    12.3 
13  2011    11.8 
14  2012     9.95
15  2013     7.78
16  2014     7.16
17  2015     6.58