Researchers’ Blog Explorations with Gephi

Hi everyone! Sorry to miss our team update on Tuesday morning, but hopefully this post will give a sense of what I've been up to. I welcome suggestions on priorities for next steps.

I have been working with Gephi to visualize the family-chain network in Saskatchewan. I produced the last visual I posted with the help of Elijah Meeks, the Digital Humanities Specialist at Stanford, and since then, have been working with the same dataset but new variations on visualization. The one I posted is based on a network attraction/repulsion model rather than a geographical model (I'm still working out kinks to the geographical visualization), meaning that nodes that are linked are grouped closer together and those that are not are farther away. Family chains are blue nodes and destinations in Saskatchewan are red nodes.

Gephi offers a set of powerful calculation tools that allow us to further explore the qualities of the Saskatchewan network. One basic and useful calculation is the in-degree and out-degree of the network nodes. In a directed graph (one-way relationships), which is what we have in an immigrant-destination model, in-degree refers to the number of incoming edges and out-degree refers to the number of outgoing edges. In classic social network analysis a high in-degree is typically called popularity and a high out-degree is referred to as gregariousness. For our purposes, Saskatchewan towns receiving a high number of immigrants could likewise be called popular (as an immigrant destination) and families going to a high number of different destinations could be considered diffused or prevalent, depending on how you look at it.

To give you an idea of our network, here are the degree measures for the 9 nodes with the highest degrees:
Regina 75
Moose Jaw 74
Saskatoon 60
Swift Current 36
Weyburn 10
Yu-Bak Yim-Hoi Ping 7
Li-Chin Sun-Hock San 7
Gull Lake 5
Yorktown 5

Regina is receiving the most different immigrant families and the Yu from Bak Yim are immigrating to the most number of different places in Saskatchewan. Pretty interesting that some families sneak into the highest degree measure lists ahead of most places. The majority of the nodes in the network have degree measures of 1, and the top 9 are unusual.

The other Gephi tools I started exploring include Modularity, Connected Components, PageRank, and Graph Density. I am still trying to figure out completely what these numbers mean in absolute terms, but in relative terms, the measures offer a basic understanding of the network.

Modularity measures how well a network decomposes into modular communities. A high modularity score indicates sophisticated internal structure.This structure, often called a community structure, describes how the the network is compartmentalized into sub-networks. These sub-networks (or communities) have been shown to have significant real-world meaning. The network modularity score is 0.843 and there are 80 distinct communities in a network with 486 nodes. A community can be easily grouped into a set of nodes with dense connections between them. While it's unclear to me what Gephi considers dense and how this network may compare to other networks, it is definitely worthwhile to note that there are "community" relationships within the network, and we're not just looking at random migration. At the same time, Saskatchewan obviously has a diverse set of families and destinations--there is no single draw in the province.

The Connected Component feature calculates the number of strongly and weakly connected components in the network. The Saskatchewan network has 486 strongly connected components and 70 weakly connected components. (I believe there may be as many components as nodes because nodes can belong to multiple components). Again, not sure how that compares to other immigrant networks in absolute terms, but in relative terms, we know that the network has many more strongly connected components than weakly connected components. My best guess is that many of the weakly connected components can be found at the periphery of the graph that I attached.

Graph Density is not entirely useful in our case. It refers to how close the network is to "complete." A complete network has edges between every node in the network and measure 1. Our graph is set up such that blue nodes cannot be connected to blue nodes and red nodes cannot be connected to red. Thus, our Graph Density is a meager 0.004 (undirected)--very very far from being complete. Still, interesting to know.

PageRank is a feature made with social networking in mind. It measures the importance of each node within the network. The metric assigns each node a probability that is the probability of being at that page after many clicks. The standard adjacency matrix is normalized so that the columns of the matrix sum to 1. We're not looking at webpages, but PageRank could still be a helpful way to think about the network. How important is an individual node? We can measure this not only by how many other nodes are connected to it, but how many nodes THOSE nodes are connected to. If we look at the top 9 PageRank nodes for the Saskatchewan network, we'll see similarities but also differences from the degree measure list.

Regina 0.066
Moose Jaw 0.066
Saskatoon 0.053
Swift Current 0.032
Weyburn 0.01
Li-Chin Sun-Hock San 0.007
Yu-Bak Yim-Hoy Ping 0.006
Assinobia 0.005
Maple Creek 0.005

Apparently, even though the Li of Chin Sun and Yu of Bak Yim have the same number of connections (go to the same number of destinations), the Li go to better connected places somewhat more than the Yu do. Likewise, although Gull Lake and Yorktown receive more distinct immigrant families than Assinobia or Maple Creek, Assinobia and Maple Creek receive better connected immigrant families.

Looking forward to your comments!

Gephi Exploration Image

Head Tax Record SunWoy RC PowerPoint