In this exercise, we will work on the bullet point 2 of challenge 1 of VAST Challenge 2022. We will reveal the patterns of community interactions of the city of Engagement, Ohio USA by using social network analysis approach.
Data set about participants’ social network details will be used to analyse and visualize the network structure. I will create graph object data frames and manipulate them using functions of dplyr, lubridate,and tidygraph.Subsequently, I will build the network graph visualization, advanced graph visualization, and interactive network visualization by using ggraph,tidygraph, and visNetwork respectively.
The following code chunk is to install and load the required packages for the analysis.
packages = c('igraph', 'tidygraph',
'ggraph', 'visNetwork',
'lubridate', 'clock',
'tidyverse', 'graphlayouts','dplyr','patchwork')
for(p in packages){
if(!require(p,character.only= T)){
install.packages(p)
}
library(p,character.only=T)
}
Due to the size of data set, I will use read_csv() to import the data files into R and convert the format into RDS thereafter as chunks showing below.
The dataset has 7 million rows of value which is too much for following analysis. We will reduce the size before we start.According to the data, we can tell that the number of trip from the Jun 2022 onwards was moving up and down within a range. Therefore we will use data from May to Oct 2022 for our following study.
Import data by using read_csv()
socialnetwork <- read_csv('Data/SocialNetwork.csv')%>%
filter(timestamp < '2022-11-01', timestamp > '2022-05-31')
Convert into RDS by using write_rds()
write_rds(socialnetwork,"data/rds/socialnetwork.rds")
Read RDS file by using read_rds()
socialnetwork<- read_rds('data/rds/socialnetwork.rds')
Before we start to clean the data, we will use glimpse() to examine the structure of the data frame.
glimpse(socialnetwork)
Rows: 2,737,738
Columns: 3
$ timestamp <dttm> 2022-05-31, 2022-05-31, 2022-05-31, 2022-~
$ participantIdFrom <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ~
$ participantIdTo <dbl> 955, 27, 967, 548, 171, 367, 366, 278, 100~
The format of the data looks fine for our further usage. By using the code chunk below, we are going to add a new column to indicate the day for each date.
socialnetwork<-socialnetwork %>%
mutate(Weekday = wday(timestamp,label = TRUE,abbr = FALSE))
The revised data frame structure:
glimpse(socialnetwork)
Rows: 2,737,738
Columns: 4
$ timestamp <dttm> 2022-05-31, 2022-05-31, 2022-05-31, 2022-~
$ participantIdFrom <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ~
$ participantIdTo <dbl> 955, 27, 967, 548, 171, 367, 366, 278, 100~
$ Weekday <ord> Tuesday, Tuesday, Tuesday, Tuesday, Tuesda~
Read data from Participant file.
participant_nodes<- read_csv('Data/Participants.csv')
In order to create network map, we will need nodes data and edges data.
There are 5 attribute under nodes. Here we generate the distribution of each attributes to find if any significant difference.
Education <- participant_nodes %>%
mutate(Education= fct_infreq(educationLevel)) %>%
ggplot(aes(x= Education)) +
geom_bar(fill= "#69b3a2") +
labs(y= 'No. of\nResidents', subtitle = "Distribution of Residents' Education Level") +
theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))
Age <- ggplot(data= participant_nodes,
aes(x= age)) +
geom_bar(fill= '#6eba6a') +
ylim(0, 50) +
labs(y= 'No. of\nResidents', x= 'Age Group',
subtitle = "Distribution of Residents' Age") +
theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))
Joviality <-ggplot(data= participant_nodes,
aes(x= joviality)) +
geom_histogram (binwidth=0.1, fill='#808de8', color="#e9ecef", alpha=0.9) +
labs(y= 'No. of\nResidents', x= 'Joviality',
subtitle = "Distribution of Residents' Joviality") +
theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))
Hobby <- participant_nodes %>%
mutate(Hobby= fct_infreq(interestGroup)) %>%
ggplot(aes(x= Hobby)) +
geom_bar(fill= "#69b3a2") +
labs(y= 'No. of\nResidents', subtitle = "Distribution of Residents' Interest Group") +
theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))
Household <- ggplot(data= participant_nodes,
aes(x= householdSize)) +
geom_bar(fill= '#6eba6a') +
ylim(0, 500) +
labs(y= 'No. of\nResidents', x= 'Household',
subtitle = "Distribution of Residents' household") +
theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))
(Education+Age)/(Joviality+Hobby)/Household + plot_annotation(
title ="Distribution of Weekend Participant's Attribute",
caption = 'Vast Challenge 2022'
) &
theme(plot.title = element_text(size = 14, face = 'bold'))
To further reduce the data size, we will group them into 8 tiles. 1st tile represents the lowest value and increase thereafter and then we will drop the participant under joviality tile 1. .
participant_nodes <- participant_nodes%>%
mutate(agegroup = as.character(ntile(age, 8))) %>%
mutate(jovialitygroup = as.character(ntile(joviality, 8)))
participants_to_keep <- participant_nodes %>%
filter(jovialitygroup !=1)
Due to too many variable of age and joviality, we will group up the age column and joviality.
Select Weight>10, so need to reduce the nodes again.
To create network graph:
socialnetwork_graph <- graph_from_data_frame(socialnetwork_aggregated,vertices=notes_reduce)%>%
as_tbl_graph()
socialnetwork_graph
# A tbl_graph: 729 nodes and 37058 edges
#
# A directed multigraph with 11 components
#
# Node Data: 729 x 11 (active)
name householdSize haveKids age educationLevel interestGroup
<chr> <dbl> <lgl> <dbl> <chr> <chr>
1 1 3 TRUE 25 HighSchoolOrC~ B
2 2 3 TRUE 35 HighSchoolOrC~ A
3 4 3 TRUE 43 Bachelors H
4 5 3 TRUE 32 HighSchoolOrC~ D
5 6 3 TRUE 26 HighSchoolOrC~ I
6 7 3 TRUE 27 Bachelors A
# ... with 723 more rows, and 5 more variables: joviality <dbl>,
# agegroup <chr>, jovialitygroup <chr>, agebyrange <chr>,
# jovialitybyrange <chr>
#
# Edge Data: 37,058 x 4
from to Weekday Weight
<int> <int> <chr> <int>
1 1 558 Sunday 12
2 1 558 Monday 12
3 1 558 Tuesday 12
# ... with 37,055 more rows
g <- ggraph(socialnetwork_graph,
layout = "nicely") +
geom_edge_link(aes()) +
geom_node_point(aes())
g + theme_graph()
According to the map above, we can see that there are still a lot of observation and the link are complex. Learning from peer, I decided to centralize the nodes first until we can see the map clearly.
socialnetwork_graph %>%
mutate(centrality = centrality_eigen()) %>%
filter(centrality > 0.2) %>%
ggraph(layout = 'nicely') +
geom_edge_link(aes()) +
geom_node_point(aes(size = centrality, colour = educationLevel)) +
labs(title = 'Education Level',
colour = "educationLevel") +
theme_graph()
Analysis of the plot:
From the above plots, we can infer that most of education level in the town are bachelors and high school. Most of participant with Bachelors degree would be most likely within the same network.
We would apply the same method on the attribute of interest group to see if there is any insight.
socialnetwork_graph %>%
mutate(centrality = centrality_eigen()) %>%
filter(centrality > 0.2) %>%
ggraph(layout = 'nicely') +
geom_edge_link(aes()) +
geom_node_point(aes(size = centrality, colour = interestGroup)) +
labs(title = 'Interest Group',
colour = "interestGroup") +
theme_graph()
Analysis of the plots:
Participant with same interest would gather more frequently and have the same network with other participants who have the same interest, especially participants with interst I and G.
Under this section, we sould compare participants’ joviality level on different days.
set_graph_style()
g <- socialnetwork_graph %>%
mutate(centrality = centrality_eigen()) %>%
filter(centrality > 0.3) %>%
ggraph(layout = "nicely") +
geom_edge_link(aes(), alpha=0.2) +
scale_edge_width(range = c(0.1, 5)) +
geom_node_point(aes(colour = jovialitybyrange), size = 2)
g + facet_edges(~Weekday)