In-Class Exercise 1

Author

Yang Jun

Getting Started

Install and Load R Packages

pacman::p_load(tidyverse)

Import the Data

exam_data <- read_csv("data/Exam_data.csv")
Rows: 322 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): ID, CLASS, GENDER, RACE
dbl (3): ENGLISH, MATHS, SCIENCE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Working with Themes

Changing the colors of plot panel background of theme_minimal() to light blue and the color of grid lines to white.

ggplot(exam_data,
       aes(y=RACE)) +
  geom_bar() +
  theme_minimal() + 
  theme(panel.background = element_rect(fill='lightblue', colour='lightblue'),
        panel.grid.major = element_line(color='white'))

Designing Data-Driven Graphics for Analysis

I. Bar Chart Makeover

Before

y-axis labels is not clear. Bars are not sorted. Frequency values not available.

ggplot(exam_data,
       aes(x=RACE)) +
  geom_bar()

After

The y-axis has a clearer label, the bars are sorted by frequency in descending order, and frequency and percentage labels are provided for each bar.

ggplot(exam_data,
       aes(x=fct_infreq(RACE))) +
  geom_bar() +
  geom_text(aes(label=paste0(after_stat(count), sprintf(' (%.1f%%)', prop*100)), group=1), 
            stat='count', 
            vjust=-0.5, 
            colour='black') +
  labs(x='Race', y='No. of Pupils') +
  scale_y_continuous(limits=c(0,220))

II. Histogram Makeover

Before

Fill and line colours make it difficult to see the individual bins. No mean or median reference lines.

ggplot(exam_data,
       aes(x=MATHS)) +
  geom_histogram(bins=20)

After

Changed fill and line colours. Added mean and median reference lines (red and black respectively).

ggplot(exam_data,
       aes(x=MATHS)) +
  geom_histogram(bins=20,
                 color='black',
                 fill='light blue') +
  geom_vline(xintercept=mean(exam_data$MATHS), color='red', linetype='dashed', size=1) +
  geom_vline(xintercept=median(exam_data$MATHS), color='black', linetype='dashed', size=1)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

III. Histogram Makeover 2

Before

Histograms show distribution of English scores by gender, but without context of all pupils.

ggplot(exam_data,
       aes(x=ENGLISH)) +
  geom_histogram(bins=25) +
  facet_wrap(~GENDER)

After

The histogram of all pupils is added as a light background to provide context of how each gender scores compared to the overall performance.

exam_data_bg <- exam_data[5]

ggplot(exam_data,
       aes(x=ENGLISH, fill=GENDER)) +
  geom_histogram(data=exam_data_bg, fill='grey', alpha=0.5) +
  geom_histogram(colour='black') +
  facet_wrap(~GENDER) +
  guides(fill='none') +
  theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

IV. Scatterplot Makeover

Before

Scatterplot of English vs Maths scores. The axis have different scales even though they have the same units, and there are no reference marks indicating the (passing) score of 50%.

ggplot(exam_data,
       aes(x=MATHS, y=ENGLISH)) +
  geom_point()

After

Both axes are standardised to the same scale. Reference lines are added to indicate scores of 50%.

ggplot(exam_data,
       aes(x=MATHS, y=ENGLISH)) +
  geom_vline(xintercept=50, color='grey70', linetype='dashed', size=1) +
  geom_hline(yintercept=50, color='grey70', linetype='dashed', size=1) +
  geom_point() +
  coord_fixed(xlim=c(0,100),ylim=c(0,100))