Overview

Introduction

In this exercise, we will work on the bullet point 2 of challenge 1 of VAST Challenge 2022. We will reveal the patterns of community interactions of the city of Engagement, Ohio USA by using social network analysis approach.

Consider the social activities in the community. What patterns do you see in the social networks in the town?

Methodology

Data set about participants’ social network details will be used to analyse and visualize the network structure. I will create graph object data frames and manipulate them using functions of dplyr, lubridate,and tidygraph.Subsequently, I will build the network graph visualization, advanced graph visualization, and interactive network visualization by using ggraph,tidygraph, and visNetwork respectively.

Data Preparation

Installing and Loading of Packages

The following code chunk is to install and load the required packages for the analysis.

Show

packages = c('igraph', 'tidygraph', 
             'ggraph', 'visNetwork', 
             'lubridate', 'clock',
             'tidyverse', 'graphlayouts','dplyr','patchwork')

for(p in packages){
  if(!require(p,character.only= T)){
    install.packages(p)
  }
  library(p,character.only=T)
}

Loading Raw Data Set

Due to the size of data set, I will use read_csv() to import the data files into R and convert the format into RDS thereafter as chunks showing below.

The dataset has 7 million rows of value which is too much for following analysis. We will reduce the size before we start.According to the data, we can tell that the number of trip from the Jun 2022 onwards was moving up and down within a range. Therefore we will use data from May to Oct 2022 for our following study.

Import data by using read_csv()

Show

socialnetwork <- read_csv('Data/SocialNetwork.csv')%>%
    filter(timestamp < '2022-11-01', timestamp > '2022-05-31')

Convert into RDS by using write_rds()

Show

write_rds(socialnetwork,"data/rds/socialnetwork.rds")

Read RDS file by using read_rds()

Show

socialnetwork<- read_rds('data/rds/socialnetwork.rds')

Data Wrangling

Before we start to clean the data, we will use glimpse() to examine the structure of the data frame.

Show

glimpse(socialnetwork)

Rows: 2,737,738
Columns: 3
$ timestamp         <dttm> 2022-05-31, 2022-05-31, 2022-05-31, 2022-~
$ participantIdFrom <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ~
$ participantIdTo   <dbl> 955, 27, 967, 548, 171, 367, 366, 278, 100~

The format of the data looks fine for our further usage. By using the code chunk below, we are going to add a new column to indicate the day for each date.

Show

socialnetwork<-socialnetwork %>%
  mutate(Weekday = wday(timestamp,label = TRUE,abbr = FALSE))

The revised data frame structure:

Show

glimpse(socialnetwork)

Rows: 2,737,738
Columns: 4
$ timestamp         <dttm> 2022-05-31, 2022-05-31, 2022-05-31, 2022-~
$ participantIdFrom <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ~
$ participantIdTo   <dbl> 955, 27, 967, 548, 171, 367, 366, 278, 100~
$ Weekday           <ord> Tuesday, Tuesday, Tuesday, Tuesday, Tuesda~

Read data from Participant file.

Show

participant_nodes<- read_csv('Data/Participants.csv')

Wragling attributes

In order to create network map, we will need nodes data and edges data.

Reduce data size

There are 5 attribute under nodes. Here we generate the distribution of each attributes to find if any significant difference.

Show

Education <- participant_nodes %>%
    mutate(Education= fct_infreq(educationLevel)) %>%
  ggplot(aes(x= Education)) +
  geom_bar(fill= "#69b3a2") +
  labs(y= 'No. of\nResidents', subtitle = "Distribution of Residents' Education Level") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

Age <- ggplot(data= participant_nodes, 
       aes(x= age)) +
  geom_bar(fill= '#6eba6a') +
  ylim(0, 50)  +
  labs(y= 'No. of\nResidents', x= 'Age Group',
       subtitle = "Distribution of Residents' Age") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

Joviality <-ggplot(data= participant_nodes, 
       aes(x= joviality)) +
  geom_histogram (binwidth=0.1, fill='#808de8', color="#e9ecef", alpha=0.9) +
  labs(y= 'No. of\nResidents', x= 'Joviality',
       subtitle = "Distribution of Residents' Joviality") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

Hobby <- participant_nodes %>%
    mutate(Hobby= fct_infreq(interestGroup)) %>%
  ggplot(aes(x= Hobby)) +
  geom_bar(fill= "#69b3a2") +
  labs(y= 'No. of\nResidents', subtitle = "Distribution of Residents' Interest Group") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

Household <- ggplot(data= participant_nodes, 
       aes(x= householdSize)) +
  geom_bar(fill= '#6eba6a') +
  ylim(0, 500)  +
  labs(y= 'No. of\nResidents', x= 'Household',
       subtitle = "Distribution of Residents' household") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

(Education+Age)/(Joviality+Hobby)/Household + plot_annotation(
    title ="Distribution of Weekend Participant's Attribute",
    caption = 'Vast Challenge 2022'
  ) &
    theme(plot.title = element_text(size = 14, face = 'bold'))

To further reduce the data size, we will group them into 8 tiles. 1st tile represents the lowest value and increase thereafter and then we will drop the participant under joviality tile 1. .

Show

participant_nodes <- participant_nodes%>%
  mutate(agegroup = as.character(ntile(age, 8))) %>%
  mutate(jovialitygroup = as.character(ntile(joviality, 8)))

Show

participants_to_keep <- participant_nodes %>%
  filter(jovialitygroup !=1)

Due to too many variable of age and joviality, we will group up the age column and joviality.

Show

participants_to_keep$agebyrange = cut(participants_to_keep$age,c(0,20,40,60,80,100))

participants_to_keep$jovialitybyrange = cut(participants_to_keep$joviality,c(0,0.2,0.4,0.6,0.8,1))

Edges data table:

Show

socialnetwork_reduce <- socialnetwork %>%
  filter(participantIdFrom %in% participants_to_keep$participantId, participantIdTo %in% participants_to_keep$participantId)

Nodes data table:

Show

notes_reduce <- participants_to_keep %>%
  filter(participantId %in% socialnetwork_reduce$participantIdFrom,participantId %in% socialnetwork_reduce$participantIdTo)

Show

socialnetwork_aggregated <- socialnetwork_reduce %>%
  group_by(participantIdFrom, participantIdTo,Weekday) %>%
    summarise(Weight = n()) %>%
  filter(participantIdFrom!=participantIdTo) %>%
  filter(Weight > 10) %>%
  ungroup()

Select Weight>10, so need to reduce the nodes again.

Show

notes_reduce <- notes_reduce %>%
  filter(participantId %in% socialnetwork_aggregated$participantIdFrom,participantId %in% socialnetwork_aggregated$participantIdTo)

To create network graph:

Show

socialnetwork_graph <- graph_from_data_frame(socialnetwork_aggregated,vertices=notes_reduce)%>%
  as_tbl_graph()
socialnetwork_graph

# A tbl_graph: 729 nodes and 37058 edges
#
# A directed multigraph with 11 components
#
# Node Data: 729 x 11 (active)
  name  householdSize haveKids   age educationLevel interestGroup
  <chr>         <dbl> <lgl>    <dbl> <chr>          <chr>        
1 1                 3 TRUE        25 HighSchoolOrC~ B            
2 2                 3 TRUE        35 HighSchoolOrC~ A            
3 4                 3 TRUE        43 Bachelors      H            
4 5                 3 TRUE        32 HighSchoolOrC~ D            
5 6                 3 TRUE        26 HighSchoolOrC~ I            
6 7                 3 TRUE        27 Bachelors      A            
# ... with 723 more rows, and 5 more variables: joviality <dbl>,
#   agegroup <chr>, jovialitygroup <chr>, agebyrange <chr>,
#   jovialitybyrange <chr>
#
# Edge Data: 37,058 x 4
   from    to Weekday Weight
  <int> <int> <chr>    <int>
1     1   558 Sunday      12
2     1   558 Monday      12
3     1   558 Tuesday     12
# ... with 37,055 more rows

Visualisation and Analysis

General Network Plot

Show

g <- ggraph(socialnetwork_graph,
       layout = "nicely") +
  geom_edge_link(aes()) +
  geom_node_point(aes())

g + theme_graph()

Education:

According to the map above, we can see that there are still a lot of observation and the link are complex. Learning from peer, I decided to centralize the nodes first until we can see the map clearly.

Show

socialnetwork_graph %>% 
    mutate(centrality = centrality_eigen()) %>%
    filter(centrality > 0.2) %>%
    ggraph(layout = 'nicely') + 
    geom_edge_link(aes()) + 
    geom_node_point(aes(size = centrality, colour = educationLevel)) + 
    labs(title = 'Education Level', 
         colour = "educationLevel") +
    theme_graph()

Analysis of the plot:

From the above plots, we can infer that most of education level in the town are bachelors and high school. Most of participant with Bachelors degree would be most likely within the same network.

Interest group

We would apply the same method on the attribute of interest group to see if there is any insight.

Show

socialnetwork_graph %>% 
    mutate(centrality = centrality_eigen()) %>%
    filter(centrality > 0.2) %>%
    ggraph(layout = 'nicely') + 
    geom_edge_link(aes()) + 
    geom_node_point(aes(size = centrality, colour = interestGroup)) + 
    labs(title = 'Interest Group', 
         colour = "interestGroup") +
    theme_graph()

Analysis of the plots:

Participant with same interest would gather more frequently and have the same network with other participants who have the same interest, especially participants with interst I and G.

Weekend VS Weekday

Under this section, we sould compare participants’ joviality level on different days.

Show

set_graph_style()
g <- socialnetwork_graph %>% 
        mutate(centrality = centrality_eigen()) %>%
        filter(centrality > 0.3) %>%
        ggraph(layout = "nicely") + 
        geom_edge_link(aes(), alpha=0.2) +
        scale_edge_width(range = c(0.1, 5)) +
        geom_node_point(aes(colour = jovialitybyrange), size = 2)
g + facet_edges(~Weekday)

Take home exercise 6

Overview

Introduction

Methodology

Data Preparation

Installing and Loading of Packages

Loading Raw Data Set

Data Wrangling

Wragling attributes

Reduce data size

Edges data table:

Nodes data table:

Visualisation and Analysis

General Network Plot

Education:

Interest group

Weekend VS Weekday