Wednesday, September 9, 2020

Blog 36: Sankey Charts in R

Sankey Chart using plotly library


Introduction

In this blog, we will look at how to create a simple Sankey chart using plotly library.Sankey diagrams are often used to represent flow of a metric through a network. Simple use cases can include flow of water through a cement plant, flow of Asset under Mangement across various broker dealers, etc. In this blog we are going to explore an econometric example where we will talk about average education and wages for Males and Females. The purpose behind taking this example is that the variables are self explanatory and it is not a typical network problem.In most practical cases, we dont have a well defined network at our disposal.So it gives us an understanding of how to plot a Sankey diagram for non network cases and leverage the great explanatory value of the plot.


Installing libraries

Lets install plotly and other libraries used to create the plot

package.name<-c("dplyr","tidyr","carData","plotly")

for(i in package.name){

  if(!require(i,character.only = T)){

    install.packages(i)
  }
  library(i,character.only = T)

}


# Ecdat package has the 'Health Insurance and Hours Worked By Wives' data
data(SLID)
df<-SLID
head(SLID)
  wages education age    sex language
1 10.56      15.0  40   Male  English
2 11.00      13.2  19   Male  English
3    NA      16.0  49   Male    Other
4 17.76      14.0  46   Male    Other
5    NA       8.0  71   Male  English
6 14.00      16.0  50 Female  English


Step 1:Average education and wages for different levels of gender


In this plot, we are trying to study the following things:

  • How does Male and Female fair in terms of mean years of education
  • Who is earning more on an average
  • Collectively we are trying to see-Impact of years of education on man earnings
interim.df<-df%>%
  group_by(sex)%>%
  summarise(MeanEducation=mean(education,na.rm=T),
            MeanWages=mean(wages,na.rm=T))

interim.df
# A tibble: 2 x 3
  sex    MeanEducation MeanWages
  <fct>          <dbl>     <dbl>
1 Female          12.4      13.9
2 Male            12.6      17.2


Assigning appropriate values to education and salary flows across nodes

male_education<-12.59
female_education<-12.59


male_wages<-17.22
female_wages<-13.88

Step 2:Initialising the plotly object

fig <- plot_ly(
    type = "sankey",
    orientation = "h",

    node = list(
      label = c("Education", "Male", "Female", "Salary"),
      color = c("orange", "orange", "orange", "orange"),
      pad = 15,
      thickness = 15,
      line = list(
        color = "black",
        width = 0.5
      )
    ),

    link = list(
      source = c(0,0,1,2),
      target = c(1,2,3,3),
      value =  c(male_education,female_education,male_wages,female_wages)
    )
  )
fig <- fig %>% layout(
    title = "Basic Sankey Diagram",
    font = list(
      size = 10
    )
)

fig


Step 3:How to read the graph

Connections to the left of Gender(Male and Female) represents inputs in the form of education. Connections to the right indicates output in the form of wages. We can see that for the same level of education, Males earn higher wages in comparison to Females. Obviously there are several factors at play but this helps us understand the role of education in impacting wages.


No comments:

Post a Comment

Web Scraping Tutorial 4- Getting the busy information data from Popular time page from Google

Popular Times Popular Times In this blog we will try to scrape the ...