Sankey Chart using plotly library
Parag Verma
Introduction
In this blog, we will look at how to create a simple Sankey chart using plotly library.Sankey diagrams are often used to represent flow of a metric through a network. Simple use cases can include flow of water through a cement plant, flow of Asset under Mangement across various broker dealers, etc. In this blog we are going to explore an econometric example where we will talk about average education and wages for Males and Females. The purpose behind taking this example is that the variables are self explanatory and it is not a typical network problem.In most practical cases, we dont have a well defined network at our disposal.So it gives us an understanding of how to plot a Sankey diagram for non network cases and leverage the great explanatory value of the plot.
Installing libraries
Lets install plotly and other libraries used to create the plot
package.name<-c("dplyr","tidyr","carData","plotly")
for(i in package.name){
if(!require(i,character.only = T)){
install.packages(i)
}
library(i,character.only = T)
}
# Ecdat package has the 'Health Insurance and Hours Worked By Wives' data
data(SLID)
df<-SLID
head(SLID)
wages education age sex language
1 10.56 15.0 40 Male English
2 11.00 13.2 19 Male English
3 NA 16.0 49 Male Other
4 17.76 14.0 46 Male Other
5 NA 8.0 71 Male English
6 14.00 16.0 50 Female English
Step 1:Average education and wages for different levels of gender
In this plot, we are trying to study the following things:
- How does Male and Female fair in terms of mean years of education
- Who is earning more on an average
- Collectively we are trying to see-Impact of years of education on man earnings
interim.df<-df%>%
group_by(sex)%>%
summarise(MeanEducation=mean(education,na.rm=T),
MeanWages=mean(wages,na.rm=T))
interim.df
# A tibble: 2 x 3
sex MeanEducation MeanWages
<fct> <dbl> <dbl>
1 Female 12.4 13.9
2 Male 12.6 17.2
Assigning appropriate values to education and salary flows across nodes
male_education<-12.59
female_education<-12.59
male_wages<-17.22
female_wages<-13.88
Step 2:Initialising the plotly object
fig <- plot_ly(
type = "sankey",
orientation = "h",
node = list(
label = c("Education", "Male", "Female", "Salary"),
color = c("orange", "orange", "orange", "orange"),
pad = 15,
thickness = 15,
line = list(
color = "black",
width = 0.5
)
),
link = list(
source = c(0,0,1,2),
target = c(1,2,3,3),
value = c(male_education,female_education,male_wages,female_wages)
)
)
fig <- fig %>% layout(
title = "Basic Sankey Diagram",
font = list(
size = 10
)
)
fig
Step 3:How to read the graph
Connections to the left of Gender(Male and Female) represents inputs in the form of education. Connections to the right indicates output in the form of wages. We can see that for the same level of education, Males earn higher wages in comparison to Females. Obviously there are several factors at play but this helps us understand the role of education in impacting wages.
Link to Previous R Blogs
List of Datasets for Practise
https://hofmann.public.iastate.edu/data_in_r_sortable.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html