Take-home exercise3

This article is to discuss the financial health of the city of Engagement, Ohio USA (VAST Challenge 2022).

Li Minqi https://www.linkedin.com/in/minqi-li/ (School of Computing and Information System)https://example.com/spacelysprokets
2022-05-21

1. Overview

1.1 Introduction

In this take-home exercise, we will discuss the financial health of the city of Engagement and identify which businesses appear to be more prosperous or more struggling over the period covered by the data set.

The data used in this exercise is from VAST Challenge 2022, and processed by RStudio. It is assumed that the volunteer participants are representative of the city’s population.

1.2 Design challenges

Challenge 1: How to joint different tables

VAST Challenge 2022 provided a large number of data files. Upon review, out of total files, eleven files can used in this study and they all have similar primary keys, such as building ID and participant ID. To have an overview of all these data files, jointing tables and data cleaning are required to be performed in R for reproducibility.

Challenge 2: How to handle missing value

When we looked at the income data of participants, we found that not all participants have income for the whole period covered by the dataset. It will affect the output if we calculate the average income for specific industry or education level. Therefore, we will need to exclude the participants who do not have complete information.

Challenge 3: How to handle outliers

From the dataset of expenditure, it is observed that some of spending are extremely high. However, outliers affect the effectiveness of distribution graphs, since the majority of the data points becomes visually trivial in the attempt to fit the ourliers. In view of this, we will exclude the outliers or transform the data set into percentiles.

Challenge 4: Raw Data File is too big to upload

The financial journal file used to derive income data is 80.7 MB, larger than the 50 MB size limit set by GitHub. Therefore, the raw data cannot be pushed to GitHub repository directly.

This can be resolved by preparing the dataframe for visualization, and then saving it in RDS format. The raw data is not pushed or commited while only the saved RDS file is referenced for visualization, and pushed to GitHub repository

2. Data Preparation

2.1 R packages installation

The following code chunk installs the required R packages and loads them onto RStudio environment.

packages = c('tidyverse', 'knitr', 'ggdist', 'scales', 'grid', 'gridExtra','formattable', 'patchwork', 'shiny','ggridges','gganimate','gifski','RColorBrewer','zoo','plotly','reprex','gifski','png')

for(p in packages){
  if(!require(p,character.only= T)){
    install.packages(p)
  }
  library(p,character.only=T)
}

2.2 Import Datasets

This article will use the dataset provided by VAST Challenge 2022. The code chunk below import data from the folder into R by using read_csv() of readr and save it as an tibble data frame. In order to compare, we convert all timestemp from files into date formate.

participants <- read_csv('Data/Participants.csv')
financial <- read_csv('Data/FinancialJournal.csv') %>%
  mutate(Dates = as.Date(timestamp,format="%m/%d/%Y")) 
Apartments <- read_csv('Data/Apartments.csv')
Employers <- read_csv('Data/Employers.csv')
Pubs <- read_csv('Data/Pubs.csv')
Restaurants <- read_csv('Data/Restaurants.csv')
Schools <- read_csv('Data/Schools.csv')

2.3 Data Wrangling

Expenditure

According to the financial file, expenditure from residents can be categorized into four areas, which are food, education, recreation and shelter. To find out the expenditure level of the residents, all categories of spending are summed up according to month.

Food <- financial %>%
  filter(category == 'Food') %>%
  select(Dates,amount,participantId) %>%
  group_by(Dates) %>%
  summarise(Daily_food = abs(sum(amount)), group_size= length(participantId))
Education <- financial %>%
  filter(category == 'Education') %>%
  select(Dates,amount,participantId) %>%
  group_by(Dates) %>%
  summarise(Daily_Education = abs(sum(amount)),group_size= length(participantId))
Recreation <- financial %>%
  filter(category == 'Recreation') %>%
  select(Dates,amount,participantId) %>%
  group_by(Dates) %>%
  summarise(Daily_Recreation = abs(sum(amount)),group_size= length(participantId))
Shelter <- financial %>%
  filter(category == 'Shelter') %>%
  select(Dates,amount,participantId) %>%
  group_by(Dates) %>%
  summarise(Daily_Shelter = abs(sum(amount)),group_size= length(participantId))

2.4 Save as and Read RDS

Dataframes of all expenditure are saved and read in RDS format to avoid uploading large files to Git.

saveRDS(Food, 'data/Food.rds')
Food <- readRDS('data/Food.rds')

saveRDS(Education, 'data/Education.rds')
Education <- readRDS('data/Education.rds')

saveRDS(Recreation, 'data/Recreation.rds')
Recreation <- readRDS('data/Recreation.rds')

saveRDS(Shelter, 'data/Shelter.rds')
Shelter <- readRDS('data/Shelter.rds')

3. Visualization and Insights

Movement of residents spending on food

p1<-ggplot(Food, aes(x = Dates, y = Daily_food, 
                      size = group_size, 
                      colour = 'red')) +
  geom_point(alpha = 0.7, 
             show.legend = FALSE) +
  scale_size(range = c(2, 20)) +
  labs(title = 'Date: {frame_time}', 
       x = 'Dates', 
       y = 'Daily_food')+ ylim(8000,12500)+ transition_time(Dates) +
  ease_aes('linear') 
animate(p1, duration = 20)

Movement of residents spending on Education

p2<- ggplot(Education, aes(x = Dates, y = Daily_Education, 
                      size = group_size)) +
  geom_point(alpha = 0.7, 
             show.legend = FALSE, colour = 'Green') +
  scale_size(range = c(2, 20)) +
  labs(title = 'Date: {frame_time}', 
       x = ' Dates', 
       y = ' Daily_Education') + transition_time(Dates) +
  ease_aes('linear') 

animate(p2, duration = 20)

Movement of residents spending on Shelter

p3<- ggplot(Shelter, aes(x = Dates, y = Daily_Shelter, 
                      size = group_size)) +
  geom_point(alpha = 0.7, 
             show.legend = FALSE, colour = 'Blue') +
  scale_size(range = c(2, 20)) +
  labs(title = 'Date: {frame_time}', 
       x = ' Dates', 
       y = ' Daily_Shelter') + transition_time(Dates) +
  ease_aes('linear') 

animate(p3, duration = 20)

4. Conclusion

Education is the most struggling business.

The the animation of people daily expenditure on education, we can infer that less people spend money in education and the amount of money people spend on education is also getting much lower.

People spend a bit less on food, but industry still performs well.

The movement of amount that participants spending in eating remains in a range of 10k to 11k per day. Although there was a slight decrease when we compared with the amount at the beginning of this research, it is stilling in a healthy condition.

There is a significant improvement in real estate market.

According to the research, people spent less on rental in between of April 2022 and May 2022, but the market recovered rapidly in June 2022. It performed well in both revenue and customer size.