Take home exercise 6

Li Minqi https://www.linkedin.com/in/minqi-li/ (School of Computing and Information System)https://example.com/spacelysprokets
2022-06-06

Overview

Introduction

In this exercise, we will work on the bullet point 2 of challenge 1 of VAST Challenge 2022. We will reveal the patterns of community interactions of the city of Engagement, Ohio USA by using social network analysis approach.

Methodology

Data set about participants’ social network details will be used to analyse and visualize the network structure. I will create graph object data frames and manipulate them using functions of dplyr, lubridate,and tidygraph.Subsequently, I will build the network graph visualization, advanced graph visualization, and interactive network visualization by using ggraph,tidygraph, and visNetwork respectively.

Data Preparation

Installing and Loading of Packages

The following code chunk is to install and load the required packages for the analysis.

Show
packages = c('igraph', 'tidygraph', 
             'ggraph', 'visNetwork', 
             'lubridate', 'clock',
             'tidyverse', 'graphlayouts','dplyr','patchwork')

for(p in packages){
  if(!require(p,character.only= T)){
    install.packages(p)
  }
  library(p,character.only=T)
}

Loading Raw Data Set

Due to the size of data set, I will use read_csv() to import the data files into R and convert the format into RDS thereafter as chunks showing below.

The dataset has 7 million rows of value which is too much for following analysis. We will reduce the size before we start.According to the data, we can tell that the number of trip from the Jun 2022 onwards was moving up and down within a range. Therefore we will use data from May to Oct 2022 for our following study.

Import data by using read_csv()

Show
socialnetwork <- read_csv('Data/SocialNetwork.csv')%>%
    filter(timestamp < '2022-11-01', timestamp > '2022-05-31')

Convert into RDS by using write_rds()

Show
write_rds(socialnetwork,"data/rds/socialnetwork.rds")

Read RDS file by using read_rds()

Show
socialnetwork<- read_rds('data/rds/socialnetwork.rds')

Data Wrangling

Before we start to clean the data, we will use glimpse() to examine the structure of the data frame.

Show
glimpse(socialnetwork)
Rows: 2,737,738
Columns: 3
$ timestamp         <dttm> 2022-05-31, 2022-05-31, 2022-05-31, 2022-~
$ participantIdFrom <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ~
$ participantIdTo   <dbl> 955, 27, 967, 548, 171, 367, 366, 278, 100~

The format of the data looks fine for our further usage. By using the code chunk below, we are going to add a new column to indicate the day for each date.

Show
socialnetwork<-socialnetwork %>%
  mutate(Weekday = wday(timestamp,label = TRUE,abbr = FALSE))

The revised data frame structure:

Show
glimpse(socialnetwork)
Rows: 2,737,738
Columns: 4
$ timestamp         <dttm> 2022-05-31, 2022-05-31, 2022-05-31, 2022-~
$ participantIdFrom <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ~
$ participantIdTo   <dbl> 955, 27, 967, 548, 171, 367, 366, 278, 100~
$ Weekday           <ord> Tuesday, Tuesday, Tuesday, Tuesday, Tuesda~

Read data from Participant file.

Show
participant_nodes<- read_csv('Data/Participants.csv')

Wragling attributes

In order to create network map, we will need nodes data and edges data.

Reduce data size

There are 5 attribute under nodes. Here we generate the distribution of each attributes to find if any significant difference.

Show
Education <- participant_nodes %>%
    mutate(Education= fct_infreq(educationLevel)) %>%
  ggplot(aes(x= Education)) +
  geom_bar(fill= "#69b3a2") +
  labs(y= 'No. of\nResidents', subtitle = "Distribution of Residents' Education Level") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

Age <- ggplot(data= participant_nodes, 
       aes(x= age)) +
  geom_bar(fill= '#6eba6a') +
  ylim(0, 50)  +
  labs(y= 'No. of\nResidents', x= 'Age Group',
       subtitle = "Distribution of Residents' Age") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

Joviality <-ggplot(data= participant_nodes, 
       aes(x= joviality)) +
  geom_histogram (binwidth=0.1, fill='#808de8', color="#e9ecef", alpha=0.9) +
  labs(y= 'No. of\nResidents', x= 'Joviality',
       subtitle = "Distribution of Residents' Joviality") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

Hobby <- participant_nodes %>%
    mutate(Hobby= fct_infreq(interestGroup)) %>%
  ggplot(aes(x= Hobby)) +
  geom_bar(fill= "#69b3a2") +
  labs(y= 'No. of\nResidents', subtitle = "Distribution of Residents' Interest Group") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

Household <- ggplot(data= participant_nodes, 
       aes(x= householdSize)) +
  geom_bar(fill= '#6eba6a') +
  ylim(0, 500)  +
  labs(y= 'No. of\nResidents', x= 'Household',
       subtitle = "Distribution of Residents' household") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

(Education+Age)/(Joviality+Hobby)/Household + plot_annotation(
    title ="Distribution of Weekend Participant's Attribute",
    caption = 'Vast Challenge 2022'
  ) &
    theme(plot.title = element_text(size = 14, face = 'bold'))

To further reduce the data size, we will group them into 8 tiles. 1st tile represents the lowest value and increase thereafter and then we will drop the participant under joviality tile 1. .

Show
participant_nodes <- participant_nodes%>%
  mutate(agegroup = as.character(ntile(age, 8))) %>%
  mutate(jovialitygroup = as.character(ntile(joviality, 8)))
Show
participants_to_keep <- participant_nodes %>%
  filter(jovialitygroup !=1)

Due to too many variable of age and joviality, we will group up the age column and joviality.

Show
participants_to_keep$agebyrange = cut(participants_to_keep$age,c(0,20,40,60,80,100))

participants_to_keep$jovialitybyrange = cut(participants_to_keep$joviality,c(0,0.2,0.4,0.6,0.8,1))

Edges data table:

Show
socialnetwork_reduce <- socialnetwork %>%
  filter(participantIdFrom %in% participants_to_keep$participantId, participantIdTo %in% participants_to_keep$participantId)

Nodes data table:

Show
notes_reduce <- participants_to_keep %>%
  filter(participantId %in% socialnetwork_reduce$participantIdFrom,participantId %in% socialnetwork_reduce$participantIdTo)
Show
socialnetwork_aggregated <- socialnetwork_reduce %>%
  group_by(participantIdFrom, participantIdTo,Weekday) %>%
    summarise(Weight = n()) %>%
  filter(participantIdFrom!=participantIdTo) %>%
  filter(Weight > 10) %>%
  ungroup()

Select Weight>10, so need to reduce the nodes again.

Show
notes_reduce <- notes_reduce %>%
  filter(participantId %in% socialnetwork_aggregated$participantIdFrom,participantId %in% socialnetwork_aggregated$participantIdTo)

To create network graph:

Show
socialnetwork_graph <- graph_from_data_frame(socialnetwork_aggregated,vertices=notes_reduce)%>%
  as_tbl_graph()
socialnetwork_graph
# A tbl_graph: 729 nodes and 37058 edges
#
# A directed multigraph with 11 components
#
# Node Data: 729 x 11 (active)
  name  householdSize haveKids   age educationLevel interestGroup
  <chr>         <dbl> <lgl>    <dbl> <chr>          <chr>        
1 1                 3 TRUE        25 HighSchoolOrC~ B            
2 2                 3 TRUE        35 HighSchoolOrC~ A            
3 4                 3 TRUE        43 Bachelors      H            
4 5                 3 TRUE        32 HighSchoolOrC~ D            
5 6                 3 TRUE        26 HighSchoolOrC~ I            
6 7                 3 TRUE        27 Bachelors      A            
# ... with 723 more rows, and 5 more variables: joviality <dbl>,
#   agegroup <chr>, jovialitygroup <chr>, agebyrange <chr>,
#   jovialitybyrange <chr>
#
# Edge Data: 37,058 x 4
   from    to Weekday Weight
  <int> <int> <chr>    <int>
1     1   558 Sunday      12
2     1   558 Monday      12
3     1   558 Tuesday     12
# ... with 37,055 more rows

Visualisation and Analysis

General Network Plot

Show
g <- ggraph(socialnetwork_graph,
       layout = "nicely") +
  geom_edge_link(aes()) +
  geom_node_point(aes())

g + theme_graph()

Education:

According to the map above, we can see that there are still a lot of observation and the link are complex. Learning from peer, I decided to centralize the nodes first until we can see the map clearly.

Show
socialnetwork_graph %>% 
    mutate(centrality = centrality_eigen()) %>%
    filter(centrality > 0.2) %>%
    ggraph(layout = 'nicely') + 
    geom_edge_link(aes()) + 
    geom_node_point(aes(size = centrality, colour = educationLevel)) + 
    labs(title = 'Education Level', 
         colour = "educationLevel") +
    theme_graph()

Analysis of the plot:

From the above plots, we can infer that most of education level in the town are bachelors and high school. Most of participant with Bachelors degree would be most likely within the same network.

Interest group

We would apply the same method on the attribute of interest group to see if there is any insight.

Show
socialnetwork_graph %>% 
    mutate(centrality = centrality_eigen()) %>%
    filter(centrality > 0.2) %>%
    ggraph(layout = 'nicely') + 
    geom_edge_link(aes()) + 
    geom_node_point(aes(size = centrality, colour = interestGroup)) + 
    labs(title = 'Interest Group', 
         colour = "interestGroup") +
    theme_graph()

Analysis of the plots:

Participant with same interest would gather more frequently and have the same network with other participants who have the same interest, especially participants with interst I and G.

Weekend VS Weekday

Under this section, we sould compare participants’ joviality level on different days.

Show
set_graph_style()
g <- socialnetwork_graph %>% 
        mutate(centrality = centrality_eigen()) %>%
        filter(centrality > 0.3) %>%
        ggraph(layout = "nicely") + 
        geom_edge_link(aes(), alpha=0.2) +
        scale_edge_width(range = c(0.1, 5)) +
        geom_node_point(aes(colour = jovialitybyrange), size = 2)
g + facet_edges(~Weekday)