Data Visualization Portfolio – R Studio

The following work was completed as part of ECON 315: Data Visualization and Analysis and the University of Nebraska – Lincoln. Included are three homework assignments.

Homework Assignments

Homework 4 – What was the Benefit of the National Supported Work Demonstration?

Introduction

This assignment looks at the National Supported Work Demonstration (NSW) – a subsidized work program that operated in 15 locations in the 1970s. The program provided approximately 10,000 persons with 12-18 months of employment and helped them find a job afterwards. The program costed an average of $5,600 per person in 1978 dollars. To qualify for the program, the individual had to be unemployed, must have been employed full-time for less than half of the preceding 6 months, and fall into one of the following categories:

  • Female and long-term recipient of “Aid to Families with Dependent Children” (AFDC)
  • Ex-drug addict
  • Ex-offender
  • Young school dropout

The primary question of this assignment is whether or not the NSW program benefited its recipients and whether the benefits outweighed the cost.

The data for this assignment is located in Canvas under “Files > Data > nsw.rda”. You can load it into R with the following command.

load("nsw.rda")

The data is a sample of males that participated in the program and those who did not and did not receive significant aid in another form. The relevant variables are

  • id – unique identifier representing the person in the sample
  • year – year of the observation. Observations in 1975 took place before participation in NSW; observations in 1978 took place afterwards.
  • treated – 1 if participated in NSW, 0 if not, and NA if from a different control sample.
  • re – Annual real earnings in 1975 dollars
  • re74 – Annual real earnings in 1974
  • Other variables are controls for age, years of education, demographic dummy variables (1 if part of the group, 0 otherwise)

Use the following libraries to help with the analysis.

library(dplyr)
library(ggplot2)

1. Prepping the Data

Replace all NA values in the treated column with 0s. Then, create a dummy variable that takes a value of 0 before the treatment and 1 after the treatment. Then, create a frequency table (row percent) with treated in rows and black in columns. Are black people over- or under-represented in the NSW treatment group?

Black people represent 80% of the treatment group. They are only 11% of the control group, so this is definitely an over-representation.
The distrubution of this historgram in the two years shows that most people in the control group were earning a salary of around 2500. Income for both the control and treatment group increased in the later year. For the treatment group, the distribution changed from heavily left-skewed to more evenly distributed, meaning that more people moved from earning 0 income to earning 10 20 or 30 thousand a year.
r ggplot(nsw, aes(x = re, fill = factor(treated))) + geom_histogram(binwidth = 5000, position = "dodge") + facet_wrap(~year) + xlim(0, 50000) + labs(title = "Distribution of income by treatment group and year", x = "income", y = "Count", fill = "treatment group")
## Warning: Removed 160 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 4 rows containing missing values (`geom_bar()`).
## 3. Plot
Calculate and show the average real earnings for each group before and after the treatment. Then, create a line plot with year on the x-axis and real earnings on the y-axis. Lines should be plotted for the average earnings of three groups – the control group, the treatment, and the counterfactual of the treatment that is parallel to the control group. Fully label the graph. Does the average treatment effect appear to be positive, negative, or close to zero?

I am not sure how we would determine which of the treatment group are parallel to the control in this case. When looking at just the control and treatment group in the line graph,we can compare the increase in income from each group. Using a regression, we can see that the slope of the treatment group is nearly double that of the control group. This means that the treatment groups’ income increased twice as much. Therefore, I would conclude that the treatment effect was positive.

library(ggplot2)
avg_re_control <- aggregate(re ~ year, data = subset(nsw, treated == 0), mean)
avg_re_treatment <- aggregate(re ~ year, data = subset(nsw, treated == 1), mean)

# Create a line plot for control and treatment groups
plot(avg_re_control$year, avg_re_control$re, type = "l", col = "blue",
     ylim = c(0, 40000), xlab = "Year", ylab = "re",
     main = "Average re for control and treatment groups")
lines(avg_re_treatment$year, avg_re_treatment$re, type = "l", col = "red")
legend("topright", legend = c("Control", "Treatment"), col = c("blue", "red"), lty = 1)

##calculate slopes
control_slope <- lm(re ~ year, data = subset(nsw, treated == 0))$coefficients[2]
treatment_slope <- lm(re ~ year, data = subset(nsw, treated == 1))$coefficients[2]

# Print the slopes
cat("Control slope:", control_slope, "\n")
## Control slope: 461.9562
cat("Treatment slope:", treatment_slope, "\n")
## Treatment slope: 970.0846

4. Regression 1

Perform a simple linear regression to estimate the average treatment effect using the difference-in-differences approach. Interpret the results. Did the intervention help the NSW recipients? Is the effect statistically significant?

The p-values for both the treatment variable and the ‘diff_in_diff’ variable are much less than 0.05, which suggests that they are both significant predictors of real income. This means that the program definitely helped recipients.
In this new regression, the r^2 value is now .16. This is much better than the original regressions value of .01. The inclusion of more variables definitely helps us better explain the variation in real income.
r lm1 <- lm(re ~ treated + diff_in_diff + age + educ + black + married + hisp, data = nsw) summary(lm1)
## ## Call: ## lm(formula = re ~ treated + diff_in_diff + age + educ + black + ## married + hisp, data = nsw) ## ## Residuals: ## Min 1Q Median 3Q Max ## -23881 -6397 1188 6824 138209 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1791.855 291.298 -6.151 7.76e-10 *** ## treated -4822.374 576.685 -8.362 < 2e-16 *** ## diff_in_diff 970.085 264.802 3.663 0.000249 *** ## age 147.929 5.058 29.247 < 2e-16 *** ## educ 659.489 17.958 36.725 < 2e-16 *** ## black -2093.693 159.575 -13.120 < 2e-16 *** ## married 5665.912 122.539 46.238 < 2e-16 *** ## hisp -896.166 201.385 -4.450 8.61e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9681 on 38400 degrees of freedom ## Multiple R-squared: 0.1658, Adjusted R-squared: 0.1656 ## F-statistic: 1090 on 7 and 38400 DF, p-value: < 2.2e-16
## 6. Cost-Benefit Analysis
Do you believe that the benefits of the program outweighed the costs? Use the estimates from your regressions, the stated cost of the program in the introduction, and any other economic considerations. (Don’t get political!)

I looked up the estimated cost of the program since I wasn’t sure if the 5400/person was just treated individuals or all participants. The ojp.gov site says that all in all the cost was 82.4 million.

From the code below, I calculated that the average increase in income for a treated subject was 1455.12 while the control groups increase was only 2.93. For the near 600 participants in the treatment, the total increase in income was $864,345. This obviously does not compare to the 84 million spent on the project, but I think the more valuable result of this study was the knowledge gained. Also, the study only tracked participants for 3 years. The total realized gain from the initial investment into a person would have to be summed over their entire lifetime, and since almost 1500 in income was realized over the first three years, I believe that that income would be reinvested into the welfare of the person and only further encourage them to make more income than the control group.

ctrl_7578 <- subset(nsw, treated == 0) 
ctrl_re_1975 <- subset(ctrl_7578, year == 1975)$re
ctrl_re_1978 <- subset(ctrl_7578, year == 1978)$re 
ctrl_increase <- sum(ctrl_re_1978 - ctrl_re_1975)


treat_7578 <- subset(nsw, treated == 1) 
treat_re_1975 <- subset(treat_7578, year == 1975)$re 
treat_re_1978 <- subset(treat_7578, year == 1978)$re
treat_increase <- sum(treat_re_1978 - treat_re_1975) 

cat("Total increase in income from 1975 to 1978 for control group :", ctrl_increase, "\n")
## Total increase in income from 1975 to 1978 for control group : 26202619
cat("Total increase in income from 1975 to 1978 for treatment group:", treat_increase, "\n")
## Total increase in income from 1975 to 1978 for treatment group: 864345.4
sum(nsw$treated == 0)
## [1] 37814
sum(nsw$treated == 1)
## [1] 594
cat("acg increase in income from 1975 to 1978 for control group per person:", ctrl_increase/37814, "\n")
## acg increase in income from 1975 to 1978 for control group per person: 692.9343
cat("avg increase in income from 1975 to 1978 for treatment group per person:", treat_increase/594, "\n")
## avg increase in income from 1975 to 1978 for treatment group per person: 1455.127
Homework 5 - Predicting Home Loan Defaults

Introduction

In this assignment, you will predict home loan defaults at a micro and macro level. The first half of the assignment asks you to estimate the probabilities of default associated with characteristics such as income or interest rates. The second half of the assignment focuses on the time series data on default rates.

The data for this assignment is located in Canvas under “Files > Data > defaults.rda”. You can load it into R with the following command.

load("~/defaults.rda")
defaults_copy <- defaults

Each observation in the data set is an individual with a home loan. There are 500 observations in each quarter from 1991 to 2022. The relevant variables are

  • date – year and quarter of observation
  • default – 1 if the individual defaulted on their loan, 0 otherwise
  • income – weekly income at the time of observation
  • price – purchase price of home
  • score – credit score at time of observation
  • interest – nominal interest rate on loan

Use the following libraries to help with the analysis.

library(fpp3)
library(margins)
slide <- function(x, f, .before = 0L, .after = 0L, .complete=FALSE, ...){
  n    <- length(x)
  new  <- rep(NA, n)
  for (t in (1+.before):(n-.after) ){
    new[t] <- f(x[(t-.before):(t+.after)], ...)
  }
  return(new)
}

1. Linear Probability Model

Perform a linear regression with default as the dependent variable and the others as independent variables. Interpret the results. Are there any results that are surprising?

All coefficients are statistically significant (p-values < 0.001) except for interest (p-value = 0.00233). The R-squared value (0.03886) indicates that the model explains 3.89% of the variance in default. Overall, the results show that the variables are statistically significant but do not explain very much of the total variance in the dependent variable ———————

model <- lm(default ~ date + income + price + score + interest, data = defaults)
summary(model)
## 
## Call:
## lm(formula = default ~ date + income + price + score + interest, 
##     data = defaults)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.11789 -0.06895 -0.03729  0.00038  1.04146 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.207e-01  1.496e-02  14.750  < 2e-16 ***
## date         3.036e-06  5.211e-07   5.827 5.67e-09 ***
## income       3.105e-06  5.291e-07   5.868 4.42e-09 ***
## price       -1.584e-07  2.216e-08  -7.147 8.98e-13 ***
## score       -3.168e-04  1.093e-05 -28.979  < 2e-16 ***
## interest    -2.321e-03  7.624e-04  -3.044  0.00233 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1906 on 63994 degrees of freedom
## Multiple R-squared:  0.03886,    Adjusted R-squared:  0.03878 
## F-statistic: 517.4 on 5 and 63994 DF,  p-value: < 2.2e-16

2. Logit Model

Perform the same regression as above, except using a logit regression. Show the results and discuss statistical significance. Then, calculate the average marginal effects using the margins command. Discuss the difference in marginal effects between the logit model and the linear probability model.

All coefficients are statistically significant at the 0.05 level, except for the intercept and interest (p-value = 0.00188). Income, score, date and price all have negative relationships with default while interest has a positive relationship. ———————

model <- glm(default ~ date + income + price + score + interest, data = defaults, family = binomial)
summary(model)
## 
## Call:
## glm(formula = default ~ date + income + price + score + interest, 
##     family = binomial, data = defaults)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7926  -0.3355  -0.1950  -0.0822   5.3166  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.529e-01  4.094e-01   1.106  0.26863    
## date         1.415e-04  1.709e-05   8.281  < 2e-16 ***
## income      -1.097e-04  3.528e-05  -3.109  0.00188 ** 
## price       -1.111e-05  1.523e-06  -7.295 2.98e-13 ***
## score       -6.551e-03  4.581e-04 -14.300  < 2e-16 ***
## interest    -9.001e-02  2.131e-02  -4.224 2.40e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21229  on 63999  degrees of freedom
## Residual deviance: 18145  on 63994  degrees of freedom
## AIC: 18157
## 
## Number of Fisher Scoring iterations: 8

3. Feature Engineering

Calculate two extra variables – the ratio of income to housing price and the difference between the interest rate and the average interest rate in the time period. Perform the linear regression from Question 1, replacing the appropriate variables with the newly calculated versions. Interpret the results. How well does this model perform?

This model performs very similarly to the first model, having a difference in R^2 value of only .5%. This is statistically signifigant, but not a dramatic difference from the original regression.

defaults$`incomeoverprice` <- defaults$income / defaults$price

defaults <- defaults %>% 
  group_by(date) %>% 
  mutate(netinterest1 = interest - mean(interest))

model <- lm(default ~ date + incomeoverprice + price + score + netinterest1, data = defaults)
summary(model)
## 
## Call:
## lm(formula = default ~ date + incomeoverprice + price + score + 
##     netinterest1, data = defaults)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12201 -0.06927 -0.03823 -0.00225  1.05785 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.002e-01  8.365e-03  11.977  < 2e-16 ***
## date             2.285e-06  3.029e-07   7.544 4.61e-14 ***
## incomeoverprice -1.032e+00  8.465e-02 -12.196  < 2e-16 ***
## price           -7.786e-08  1.119e-08  -6.961 3.41e-12 ***
## score           -7.592e-05  1.429e-05  -5.311 1.09e-07 ***
## netinterest1     4.412e-03  1.076e-03   4.100 4.14e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1904 on 63994 degrees of freedom
## Multiple R-squared:  0.04129,    Adjusted R-squared:  0.04121 
## F-statistic: 551.2 on 5 and 63994 DF,  p-value: < 2.2e-16

4. Default Rate Across Time

Calculate the default rate in each time period. Then, calculate the trend-cycle component of the default rate (annual centered moving average). Plot the default rate and the trend-cycle component. Fully label the graph and discuss.


library(dplyr)
library(zoo)
## 
## Attaching package: 'zoo'
## The following object is masked from 'package:tsibble':
## 
##     index
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
default_rate <- defaults %>% 
  group_by(date) %>% 
  summarize(default_rate = mean(default))

default_rate_zoo <- zoo(default_rate$default_rate, as.yearqtr(default_rate$date))
trend_cycle <- rollmean(default_rate_zoo, k = 4, na.pad = TRUE)
## Warning in rollmean.zoo(default_rate_zoo, k = 4, na.pad = TRUE): na.pad is
## deprecated. Use fill.
plot(default_rate_zoo, main = "Default Rate vs. Trend-Cycle Component",
     ylab = "Default Rate", xlab = "Date", ylim = c(0, max(default_rate_zoo)))
lines(trend_cycle, col = "red")
legend("topright", c("Default Rate", "Trend-Cycle Component"), 
       col = c("black", "red"), lty = 1)

5. Seasonal Effect

Calculate the seasonal and remainder components of the default rate and the seasonally-adjusted default rate. Which quarters are associated with higher default rates? Plot all components on a single graph and discuss.

Quarters 140-200 have significantly higher default rates.

library(zoo)

library(stlplus)
## Warning: package 'stlplus' was built under R version 4.2.3
default_rate <- defaults %>% 
  group_by(date) %>% 
  summarize(default_rate = mean(default))

default_ts <- ts(default_rate$default_rate, frequency = 4)
decomp <- stl(default_ts, s.window = "periodic")

seasonal <- decomp$time.series[, "seasonal"]
trend <- decomp$time.series[, "trend"]
remainder <- decomp$time.series[, "remainder"]

sa_default_rate <- default_rate$default_rate - seasonal

plot(default_rate$date, default_rate$default_rate, type = "l", xlab = "Date", ylab = "Rate", main = "Default Rate and Components")
lines(default_rate$date, seasonal, col = "red")
lines(default_rate$date, remainder, col = "blue")
lines(default_rate$date, trend, col = "purple")
lines(default_rate$date, sa_default_rate, col = "orange")
legend("topright", c("Default Rate", "Seasonal", "Remainder", "SA default", "trend"), col = c("black", "red", "blue", "orange", "purple"), lty = 1)

6. AR Model

Estimate an \(AR(1)\) model for the default rate. What is the predicted value for Q1 2023? What is the predicted value for Q2 2023? What is the long-run prediction for the default rate?


ar_model <- arima(default_rate$default_rate, order = c(1, 0, 0))

summary(ar_model)
##           Length Class  Mode     
## coef        2    -none- numeric  
## sigma2      1    -none- numeric  
## var.coef    4    -none- numeric  
## mask        2    -none- logical  
## loglik      1    -none- numeric  
## aic         1    -none- numeric  
## arma        7    -none- numeric  
## residuals 128    ts     numeric  
## call        3    -none- call     
## series      1    -none- character
## code        1    -none- numeric  
## n.cond      1    -none- numeric  
## nobs        1    -none- numeric  
## model      10    -none- list
future_dates <- seq(as.Date("2023-12-30"), by = "quarter", length.out = 4)

predict(ar_model, n.ahead = 4, newdata = default_rate[length(default_rate$default_rate)-3:length(default_rate$default_rate)], 
        interval = "prediction")$pred
## Time Series:
## Start = 129 
## End = 132 
## Frequency = 1 
## [1] 0.01815533 0.01830883 0.01846052 0.01861042
Homework 6 - Forecasting COVID Cases

Introduction

In this assignment, you will predict the number of COVID cases in the U.S. for the next 3 months. Be sure to load the following libraries to help with the analysis.

library(fpp3)

The data for this assignment is located in Canvas under “Files > Data > covid.rda”. You can load it into R with the following command.

load("covid.rda")

Each observation in the data set contains the number of Covid cases in the US each week. The data goes from Week 5 in 2020 to Week 15 2023 (April 18). Your goal is to determine the best forecasting model.

1) Time Series Plot

Plot covid cases across time. Be sure to fully label the plot. Describe cases over time and more recently.

Create a seasonal plot. Does there appear to be any relationship between Covid cases and the seasons?

Plot the autocorrelation function of Covid cases. Is there autocorrelation? Discuss.

Cases have been generally low but with massive peaks at the start of 2021 and the start of 2022. This corresponds perfectly with our second graph, which shows a higher rate of cases in the winter months. Winter is typically when cold and flu season occurs, so it makes sense that both graphs show higher numbers in winter months. I do believe that the plot shows autocorrelation since we can see that there are multiple points where the graph reaches or almost reaches the confidence line. Also, since we can clearly observe an oscillating pattern in the autocorrelation plot, this combines with the previous evidence to form a strong case for autocorrelation.


library(ggplot2)
library(lubridate)

covid$week <- as.Date(paste0(covid$week, 1), format = "%Y W%U %u")
ggplot(data = covid, aes(x = week, y = cases)) +
  geom_line() +
  labs(x = "Week", y = "Cases") +
  theme_minimal()

library(dplyr)
get_season <- function(date) {
  month <- month(date)
  if (month %in% c(12, 1, 2)) {
    return("Winter")
  } else if (month %in% c(3, 4, 5)) {
    return("Spring")
  } else if (month %in% c(6, 7, 8)) {
    return("Summer")
  } else {
    return("Fall")
  }
}

covid_seasons <- covid %>%
  mutate(season = sapply(week, get_season)) %>%
  group_by(season) %>%
  summarise(total_cases = sum(cases))

ggplot(data = covid_seasons, aes(x = season, y = total_cases)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "Season", y = "Total Cases") +
  theme_minimal()

library(forecast)
ggAcf(covid$cases, lag.max = 52, main = "Autocorrelation of Cases") +
  theme_minimal()

2) Split the Dataset

Split the dataset into a training set and a testing set. The testing set should have the latest 12 observations. The rest should be in the training set.


total_rows <- 168

training_set <- covid[1:(total_rows - 12),]
testing_set <- covid[(total_rows - 11):total_rows,]

3) Simple Forecasts

Use the training set to forecast covid cases for the next 12 periods in the future using 3 methods: the mean, the latest value (naive), and a random walk with drift. Show the first forecast for each model. How much do the forecasts differ?

Plot the forecasts (without error bands) on top of the full data set, with observations before 2022 filtered out. Which model appears to have performed best based on the plot?


None of the models performed very well, but technically the naive model was the closest to actual. Each of the models predicted either steady or increasing cases, but the actual number of cases went down.

h <- 12

mean_forecast <- function(train_data, h) {
  mean(train_data) * rep(1, h)
}

mean_predictions <- mean_forecast(training_set$cases, 12)


naive_forecast <- function(train_data, h) {
  rep(tail(train_data, 1), h)
}

naive_predictions <- naive_forecast(training_set$cases, 12)



random_walk_drift_forecast <- function(train_data, h) {
  n <- length(train_data)
  drift <- (train_data[n] - train_data[1]) / (n - 1)
  train_data[n] + drift * seq(1, h)
}

random_walk_drift_predictions <- random_walk_drift_forecast(training_set$cases, 12)


forecast_dates <- testing_set$week
forecast_df <- data.frame(week = forecast_dates,
                          mean = mean_predictions,
                          naive = naive_predictions,
                          random_walk_drift = random_walk_drift_predictions)


covid_long <- covid %>%
  select(week, cases) %>%
  mutate(method = "Actual") %>%
  rename(value = cases)

forecast_long <- forecast_df %>%
  pivot_longer(-week, names_to = "method", values_to = "value")

combined_data <- rbind(covid_long, forecast_long)



combined_data_filtered <- combined_data %>%
  filter(week >= as.Date("2022-07-01"))


ggplot(data = combined_data_filtered, aes(x = week, y = value, color = method)) +
  geom_line() +
  labs(x = "Week", y = "Cases", title = "Forecasting Methods Comparison") +
  scale_color_manual(values = c("Actual" = "black", "mean" = "red", "naive" = "blue", "random_walk_drift" = "green")) +
  theme_minimal()

4) ARIMA Forecasts

Estimate 4 different ARIMA models, letting the ARIMA command determine the best choice of p, d, and q. The first model is on the number of cases. The second model uses the natural log of the number of cases. The last two models use a Box-Cox transformation with lambdas of 0.1 and 0.5, respectively.

Show the first forecast for each model. How much do the forecasts differ?

Plot the forecasts (without error bands) on top of the full data set, with observations before 2022 filtered out. Which model appears to have performed best based on the plot?

From these four models, we can clearly see that the Box-Cox transformation with a lambda value of 0.1 performed closest to the actual data. While the slope was quite different, the model predicted a fairly accurate number for the final 12th week.


training_set$log_cases <- log(training_set$cases)
training_set$box_cox_0.1 <- forecast::BoxCox(training_set$cases, lambda = 0.1)
training_set$box_cox_0.5 <- forecast::BoxCox(training_set$cases, lambda = 0.5)


# Model 1: Original number of cases
arima_model_cases <- auto.arima(training_set$cases)

# Model 2: Natural log of the number of cases
arima_model_log_cases <- auto.arima(training_set$log_cases)

# Model 3: Box-Cox transformation with lambda = 0.1
arima_model_box_cox_0.1 <- auto.arima(training_set$box_cox_0.1)

# Model 4: Box-Cox transformation with lambda = 0.5
arima_model_box_cox_0.5 <- auto.arima(training_set$box_cox_0.5)



# Forecast the next 12 periods for each model
forecast_cases <- forecast(arima_model_cases, h = 12)
forecast_log_cases <- forecast(arima_model_log_cases, h = 12)
forecast_box_cox_0.1 <- forecast(arima_model_box_cox_0.1, h = 12)
forecast_box_cox_0.5 <- forecast(arima_model_box_cox_0.5, h = 12)

# Re-transform the forecasts to the original scale (if needed)
forecast_log_cases$mean <- exp(forecast_log_cases$mean)
forecast_box_cox_0.1$mean <- forecast::InvBoxCox(forecast_box_cox_0.1$mean, lambda = 0.1)
forecast_box_cox_0.5$mean <- forecast::InvBoxCox(forecast_box_cox_0.5$mean, lambda = 0.5)


forecast_dates <- testing_set$week
forecast_df_arima <- data.frame(week = forecast_dates,
                                arima_cases = forecast_cases$mean,
                                arima_log_cases = forecast_log_cases$mean,
                                arima_box_cox_0.1 = forecast_box_cox_0.1$mean,
                                arima_box_cox_0.5 = forecast_box_cox_0.5$mean)



covid_long_arima <- covid %>%
  select(week, cases) %>%
  mutate(method = "Actual") %>%
  rename(value = cases)

forecast_long_arima <- forecast_df_arima %>%
  pivot_longer(-week, names_to = "method", values_to = "value")

combined_data_arima <- rbind(covid_long_arima, forecast_long_arima)


combined_data_arima_filtered <- combined_data_arima %>%
  filter(week >= as.Date("2022-07-01"))


ggplot(data = combined_data_arima_filtered, aes(x = week, y = value, color = method)) +
  geom_line() +
  labs(x = "Week", y = "Cases", title = "ARIMA Models Comparison (2022 and Later)") +
  scale_color_manual(values = c("Actual" = "black",
                                "arima_cases" = "red",
                                "arima_log_cases" = "blue",
                                "arima_box_cox_0.1" = "green",
                                "arima_box_cox_0.5" = "purple")) +
  theme_minimal()

5) Testing Accuracy

Test the accuracy of the models in questions 4 and 5 against the test dataset. Pick the two best models and discuss why you chose them. Based off of the accuracy summaries and specifically the RMSE values, the Box-Cox 0.1 and the Box-Cox 0.5 are the best performing models.


mean_predictions_test <- rep(mean(training_set$cases), 12)
naive_predictions_test <- rep(training_set$cases[nrow(training_set)], 12)
random_walk_drift_predictions_test <- training_set$cases[nrow(training_set)] + (1:12)

arima_cases_pred <- forecast(arima_model_cases, h = 12)$mean
arima_log_cases_pred <- exp(forecast(arima_model_log_cases, h = 12)$mean)
arima_box_cox_0.1_pred <- forecast::InvBoxCox(forecast(arima_model_box_cox_0.1, h = 12)$mean, lambda = 0.1)
arima_box_cox_0.5_pred <- forecast::InvBoxCox(forecast(arima_model_box_cox_0.5, h = 12)$mean, lambda = 0.5)


# Mean model
mean_accuracy <- accuracy(f = mean_predictions_test, x = testing_set$cases)

# Naive model
naive_accuracy <- accuracy(f = naive_predictions_test, x = testing_set$cases)

# Random walk with drift model
random_walk_drift_accuracy <- accuracy(f = random_walk_drift_predictions_test, x = testing_set$cases)

# ARIMA models
arima_cases_accuracy <- accuracy(f = arima_cases_pred, x = testing_set$cases)
arima_log_cases_accuracy <- accuracy(f = arima_log_cases_pred, x = testing_set$cases)
arima_box_cox_0.1_accuracy <- accuracy(f = arima_box_cox_0.1_pred, x = testing_set$cases)
arima_box_cox_0.5_accuracy <- accuracy(f = arima_box_cox_0.5_pred, x = testing_set$cases)

summary(mean_accuracy)
##        ME               RMSE             MAE              MPE        
##  Min.   :-445970   Min.   :450986   Min.   :445970   Min.   :-256.9  
##  1st Qu.:-445970   1st Qu.:450986   1st Qu.:445970   1st Qu.:-256.9  
##  Median :-445970   Median :450986   Median :445970   Median :-256.9  
##  Mean   :-445970   Mean   :450986   Mean   :445970   Mean   :-256.9  
##  3rd Qu.:-445970   3rd Qu.:450986   3rd Qu.:445970   3rd Qu.:-256.9  
##  Max.   :-445970   Max.   :450986   Max.   :445970   Max.   :-256.9  
##       MAPE      
##  Min.   :256.9  
##  1st Qu.:256.9  
##  Median :256.9  
##  Mean   :256.9  
##  3rd Qu.:256.9  
##  Max.   :256.9
summary(naive_accuracy)
##        ME               RMSE             MAE              MPE        
##  Min.   :-132682   Min.   :148674   Min.   :132682   Min.   :-85.53  
##  1st Qu.:-132682   1st Qu.:148674   1st Qu.:132682   1st Qu.:-85.53  
##  Median :-132682   Median :148674   Median :132682   Median :-85.53  
##  Mean   :-132682   Mean   :148674   Mean   :132682   Mean   :-85.53  
##  3rd Qu.:-132682   3rd Qu.:148674   3rd Qu.:132682   3rd Qu.:-85.53  
##  Max.   :-132682   Max.   :148674   Max.   :132682   Max.   :-85.53  
##       MAPE      
##  Min.   :85.53  
##  1st Qu.:85.53  
##  Median :85.53  
##  Mean   :85.53  
##  3rd Qu.:85.53  
##  Max.   :85.53
summary(random_walk_drift_accuracy)
##        ME               RMSE             MAE              MPE        
##  Min.   :-132689   Min.   :148681   Min.   :132689   Min.   :-85.54  
##  1st Qu.:-132689   1st Qu.:148681   1st Qu.:132689   1st Qu.:-85.54  
##  Median :-132689   Median :148681   Median :132689   Median :-85.54  
##  Mean   :-132689   Mean   :148681   Mean   :132689   Mean   :-85.54  
##  3rd Qu.:-132689   3rd Qu.:148681   3rd Qu.:132689   3rd Qu.:-85.54  
##  Max.   :-132689   Max.   :148681   Max.   :132689   Max.   :-85.54  
##       MAPE      
##  Min.   :85.54  
##  1st Qu.:85.54  
##  Median :85.54  
##  Mean   :85.54  
##  3rd Qu.:85.54  
##  Max.   :85.54
summary(arima_cases_accuracy)
##        ME               RMSE             MAE              MPE        
##  Min.   :-282977   Min.   :357386   Min.   :301880   Min.   :-193.4  
##  1st Qu.:-282977   1st Qu.:357386   1st Qu.:301880   1st Qu.:-193.4  
##  Median :-282977   Median :357386   Median :301880   Median :-193.4  
##  Mean   :-282977   Mean   :357386   Mean   :301880   Mean   :-193.4  
##  3rd Qu.:-282977   3rd Qu.:357386   3rd Qu.:301880   3rd Qu.:-193.4  
##  Max.   :-282977   Max.   :357386   Max.   :301880   Max.   :-193.4  
##       MAPE      
##  Min.   :199.8  
##  1st Qu.:199.8  
##  Median :199.8  
##  Mean   :199.8  
##  3rd Qu.:199.8  
##  Max.   :199.8
summary(arima_log_cases_accuracy)
##        ME              RMSE             MAE              MPE       
##  Min.   :149045   Min.   :153775   Min.   :149045   Min.   :78.29  
##  1st Qu.:149045   1st Qu.:153775   1st Qu.:149045   1st Qu.:78.29  
##  Median :149045   Median :153775   Median :149045   Median :78.29  
##  Mean   :149045   Mean   :153775   Mean   :149045   Mean   :78.29  
##  3rd Qu.:149045   3rd Qu.:153775   3rd Qu.:149045   3rd Qu.:78.29  
##  Max.   :149045   Max.   :153775   Max.   :149045   Max.   :78.29  
##       MAPE      
##  Min.   :78.29  
##  1st Qu.:78.29  
##  Median :78.29  
##  Mean   :78.29  
##  3rd Qu.:78.29  
##  Max.   :78.29
summary(arima_box_cox_0.1_accuracy)
##        ME             RMSE            MAE             MPE             MAPE     
##  Min.   :63928   Min.   :79792   Min.   :66786   Min.   :26.48   Min.   :29.3  
##  1st Qu.:63928   1st Qu.:79792   1st Qu.:66786   1st Qu.:26.48   1st Qu.:29.3  
##  Median :63928   Median :79792   Median :66786   Median :26.48   Median :29.3  
##  Mean   :63928   Mean   :79792   Mean   :66786   Mean   :26.48   Mean   :29.3  
##  3rd Qu.:63928   3rd Qu.:79792   3rd Qu.:66786   3rd Qu.:26.48   3rd Qu.:29.3  
##  Max.   :63928   Max.   :79792   Max.   :66786   Max.   :26.48   Max.   :29.3
summary(arima_box_cox_0.5_accuracy)
##        ME              RMSE             MAE              MPE        
##  Min.   :-66513   Min.   :131891   Min.   :104200   Min.   :-58.77  
##  1st Qu.:-66513   1st Qu.:131891   1st Qu.:104200   1st Qu.:-58.77  
##  Median :-66513   Median :131891   Median :104200   Median :-58.77  
##  Mean   :-66513   Mean   :131891   Mean   :104200   Mean   :-58.77  
##  3rd Qu.:-66513   3rd Qu.:131891   3rd Qu.:104200   3rd Qu.:-58.77  
##  Max.   :-66513   Max.   :131891   Max.   :104200   Max.   :-58.77  
##       MAPE      
##  Min.   :72.35  
##  1st Qu.:72.35  
##  Median :72.35  
##  Mean   :72.35  
##  3rd Qu.:72.35  
##  Max.   :72.35

6) Forecasting the Summer

Use the two choices question 5, forecast Covid cases for the next 12 periods with the full dataset. Show the forecasts in a table. Then plot each model’s forecasts on top of actual Covid cases in 2023. Compare and contrast the two forecasts. Which forecast to think is more likely?


I believe that the 0.1 model is more likely. As the weather warms up, cases have started to trend downwards. However, due to summer vacation, graduation parties, and other factors that the model most likely cannot predict, numbers could definitely go up. I think that the 0.1 model is more accurate, but the 0.5 model may end up being more predictive in this case.

library(forecast)
library(ggplot2)

# Forecast the next 12 weeks
forecast_arima_box_cox_0.1 <- forecast(arima_model_box_cox_0.1, h = 12)
forecast_arima_box_cox_0.5 <- forecast(arima_model_box_cox_0.5, h = 12)

# Invert the Box-Cox transformation for the forecasts
forecast_arima_box_cox_0.1$mean <- InvBoxCox(forecast_arima_box_cox_0.1$mean, lambda = 0.1)
forecast_arima_box_cox_0.5$mean <- InvBoxCox(forecast_arima_box_cox_0.5$mean, lambda = 0.5)

# Prepare the data for ggplot
forecast_df <- data.frame(
  week = seq(as.Date("2023-05-01"), by = "week", length.out = 12),
  box_cox_0.1 = forecast_arima_box_cox_0.1$mean,
  box_cox_0.5 = forecast_arima_box_cox_0.5$mean
)

forecast_df <- pivot_longer(forecast_df, -week, names_to = "method", values_to = "value")

# Plot the forecasts
ggplot() +
  geom_line(data = filter(covid, week >= as.Date("2022-01-01")), aes(x = week, y = cases), color = "black") +
  geom_line(data = forecast_df, aes(x = week, y = value, color = method)) +
  labs(title = "Forecast of COVID cases using ARIMA models with Box-Cox transformations",
       x = "Week",
       y = "Cases",
       color = "Method") +
  theme_minimal()