Graph Analyses with Python and NetworkX

Graph Analysis with
Python and NetworkX

Graph Theory
The Mathematical study of the application and
properties of graphs, originally motivated by the
study of games of chance.
Traced back to Euler’s work on the Konigsberg Bridges problem (1735), leading to the
concept of Eulerian graphs.

A Graph, G, consists of a finite set denoted by V or V(G) and a collection E or E
(G) of ordered or unordered pairs {u,v} where u and v ∈ V
vertices (nodes)
edges (links)

Graphs can be directed or undirected
DiGraphs, the edges are ordered pairs: (u,v)

Describing Graphs
Network Definitions
Cardinality
Order
Size
Graphs as sets
Local Cyclicity

Representing Graphs
Adjacency Matrix
[[0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 1, 0, 0, 0, 0, 1],
[1, 0, 1, 0, 0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 1, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 1, 0, 0, 1],
[0, 0, 1, 0, 0, 0, 0, 0, 1, 0]]

Undirected graphs have symmetric adjacency matrices.
Representing Graphs
Adjacency Matrix
[[0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 1, 0, 0, 0, 0, 1],
[1, 0, 1, 0, 0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 1, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 1, 0, 0, 1],
[0, 0, 1, 0, 0, 0, 0, 0, 1, 0]]

Describing Vertices
Directed Networks Undirected Networks
Node Neighbors
Degree

Describing Graphs by Structure

a path from i to k
Length(p) = # of nodes in path
Paths(i,j) = set of paths from i to j
Shortest (unweighted) path length
Paths in a Network
Traversal is about following paths between vertices

Diameter(G)
Paths in a Network

Classes of Graph Algorithms
Generally speaking CS Algorithms are
designed to solve the classes of Math
problems, but we can further categorize them
into the following:
1. Traversal (flow, shortest distance)
2. Search (optimal node location)
3. Subgraphing (find minimum weighted
spanning tree)
4. Clustering (group neighbors of nodes)

For Reference
Bellman-Ford Algorithm | Dijkstra's Algorithm
Ford-Fulkerson Algorithm | Kruskai's Algorithm
Nearest neighbor | Depth-First and Breadth-
First

Ryan vs. Biden Debate (Twitter Reaction)
http://thecaucus.blogs.nytimes.com/2012/10/16/who-won-presidential-debate-on-twitter/?_r=0

Topics shifting over time
http://informationandvisualization.de/blog/graphbased-visualization-topic-shifts

Pearson OpenClass Graph
http://thinkaurelius.com/2013/05/13/educating-the-planet-with-pearson/

Graph Analysis for Big Data (Uber Trips in San Francisco)
http://radar.oreilly.com/2014/07/there-are-many-use-cases-for-graph-databases-and-analytics.html

Information Flows
http://web.math.princeton.edu/math_alive/5/Lab1/Networks.html

Why Graphs?
1. Graphs are abstractions of real life
2. Represent information flows that exist
3. Explicitly demonstrate relationships
4. Enable computations across large datasets
5. Allow us to compute locally to areas of
interest with small traversals
6. Because everyone else is doing it
(PageRank, SocialGraph)

Machine Learning using Graphs
- Machine Learning is iterative but iteration
can also be seen as traversal.

- Machine Learning requires many instances
with which to fit a model to make predictions.

- Important analyses are graph algorithms:
clusters, influence propagation, centrality.

- Performance benefits on sparse data

- Many domains have structures already
modeled as graphs (health records, finance)
- Performance benefits on sparse data
- More understandable implementation

Iterative PageRank in Python
def pageRank(G, s = .85, maxerr = .001):
n = G.shape[0]
# transform G into markov matrix M
M = csc_matrix(G,dtype=np.float)
rsums = np.array(M.sum(1))[:,0]
ri, ci = M.nonzero()
M.data /= rsums[ri]
sink = rsums==0 # bool array of sink states
# Compute pagerank r until we converge
ro, r = np.zeros(n), np.ones(n)
while np.sum(np.abs(r-ro)) > maxerr:
ro = r.copy()
for i in xrange(0,n):
Ii = np.array(M[:,i].todense())[:,0] # inlinks of state i
Si = sink / float(n) # account for sink states
Ti = np.ones(n) / float(n) # account for teleportation
r[i] = ro.dot( Ii*s + Si*s + Ti*(1-s) )
return r/sum(r) # return normalized pagerank

Graph-Based PageRank in Gremlin
pagerank = [:].withDefault{0}
size = uris.size();
uris.each{
count = it.outE.count();
if(count == 0 || rand.nextDouble() > 0.85) {
rank = pagerank[it]
uris.each {
pagerank[it] = pagerank[it] / uris.size()
}
}
rank = pagerank[it] / it.outE.count();
it.out.each{
pagerank[it] = pagerank[it] + rank;
}
}

● Existence: Does there exist [a path, a
vertex, a set] within [constraints]?
● Construction: Given a set of [paths,
vertices] is a [constraint] graph construction
possible?
● Enumeration: How many [vertices, edges]
exist with [constraints], is it possible to list
them?
● Optimization: Given several [paths, etc.] is
one the best?
Classes of Graph Analyses

… are graphs!
http://randomwire.com/linkedin-inmaps-visualises-professional-connections/

A social network is a data structure whose nodes are composed of
actors (proper nouns except places) that transmit information to each
other according to their relationships (links) with other actors.
Semantic definitions of both actors and relationships are illustrative:
Actor: person, organization, place, role
Relationship: friends, acquaintance, penpal, correspondent
Social networks are complex - they have non-trivial topological features
that do not occur in simple networks.
Almost any system humans participate and communicate in can be
modeled as a social network (hence rich semantic relevance)

- Scale Free Networks
http://en.wikipedia.org/wiki/Scale-free_network
- Degree distribution follows a power law
- Significant topological features (not random)
- Commonness of vertices with a degree that greatly exceeds
the average degree (“hubs”) which serve some purpose
- Small World Networks
http://en.wikipedia.org/wiki/Small-world_network
- Most nodes aren’t neighbors but can be reached quickly
- Typical distance between two nodes grows proportionally to the
logarithm of the order of the network.
- Exhibits many specific clusters
Attributes of a Social Network

Degree Distribution
- We’ve looked so far at per node properties
(degree, etc) and averaging them gives us
some information.
- Instead, let’s look at the entire distribution
-
Degree Distribution:
Power Law distribution:

Network Topology
http://filmword.blogspot.com/2010/04/emerging-brain.html

Graphs contain semantically relevant information - “Property Graph”
https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model

Property Graph Model
The primary data model for Graphs, containing these elements:
1. a set of vertices
○ each vertex has a unique identifier.
○ each vertex has a set of outgoing edges.
○ each vertex has a set of incoming edges.
○ each vertex has a collection of properties defined by a map
from key to value.
2. a set of edges
○ each edge has a unique identifier.
○ each edge has an outgoing tail vertex.
○ each edge has an incoming head vertex.
○ each edge has a label that denotes the type of relationship
between its two vertices.
○ each edge has a collection of properties defined by a map from
key to value.

Graph: An object that contains vertices and edges.
Element: An object that can have any number of key/value pairs
associated with it (i.e. properties)
Vertex: An object that has incoming and outgoing edges.
Edge: An object that has a tail and head vertex.

Modeling property graphs with labels, relationships, and properties.
http://neo4j.com/developer/guide-data-modeling/

File Based Serialization
<xml />
GraphML
{json}
GraphSon
NetworkX
Gephi

Graph Databases
http://graphdatabases.com/

Neo4j: Querying with Cypher and a visual interface.

Relational Data: Use GraphGen
http://konstantinosx.github.io/graphgen-project/
NetworkX
GraphGen

Relational Data: Use GraphGen
http://konstantinosx.github.io/graphgen-project/

Sample graph - note the shortest paths from D to F and A to E.

What is the most important vertex?

Identification of vertices that play the most
important role in a particular network (e.g. how
close to the center of the core is the vertex?)

A measure of popularity, determines nodes that
can quickly spread information to a localized
area.

Degree centrality simply ranks nodes by their degree.
k=4
k=4
k=1
k=1
k=1
k=2k=3

Shows which nodes are likely pathways of
information and can be used to determine how
a graph will break apart of nodes are removed.

Betweenness: the sum of the fraction of all the pairs of shortest paths that pass
through that particular node. Can also be normalized by the number of nodes or
an edge weight.

A measure of reach; how fast information will
spread to all other nodes from a single node.

Closeness: average number of hops to reach any other node in the network.
The reciprocal of the mean distance: n-1 / size(G) - 1 for a neighborhood, n

A measure of related influence, who is closest
to the most important people in the Graph?
Kind of like “power behind the scenes” or
influence beyond popularity.

Eigenvector: proportional to the sum of centrality scores of the neighborhood.
(PageRank is a stochastic eigenvector scoring)

Detection of communities or groups that exist in
a network by counting triangles.
Measures “transitivity” - tripartite relationships
that indicate clusters
T(i) = # of triangles with i as a vertex
Local Clustering Coefficient Graph Clustering Coefficient
Counting the number of triangles is a start towards
making inferences about “transitive closures” - e.g.
predictions or inferences about relationships

Green lines are connections from the node, black are the other connections

ki
= 6
T(i) = 4
Ci
= (2*4) / (6*(6-1)) = 0.266

Partitive classification utilizes subgraphing techniques
to find the minimum number of splits required to divide
a graph into two classes. Laplacian Matrices are often
used to count the number of spanning trees.

Distance based techniques like k-Nearest Neighbors
embed distances in graphical links, allowing for very
fast computation and blocking of pairwise distance
computations.

Just add probability! Bayesian Networks are directed,
acyclic graphs that encode conditional dependencies
and can be trained from data, then used to make
inferences.

Layouts
- Open Ord
- http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=731088
- Draws large scale undirected graphs with visual clusters
- Yifan Hu
- http://yifanhu.net/PUB/graph_draw_small.pdf
- Force Directed Layout with multiple levels and quadtree
- Force Atlas
- http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0098679
- A continuous force directed layout (default of Gephi)
- Fruchterman Reingold
- http://cs.brown.edu/~rt/gdhandbook/chapters/force-directed.pdf
- Graph as a system of mass particles (nodes are particles, edges are
springs) This is the basis for force directed layouts
Others: circular, shell, neato, spectral, dot, twopi …

Force Directed
http://en.wikipedia.org/wiki/Force-directed_graph_drawing

Hierarchical Graph Layout
https://seeingcomplexity.files.wordpress.com/2011/02/tree_graph_example.gif

Lane Harrison, The Links that Bind Us: Network Visualizations
http://blog.visual.ly/network-visualizations

The Hairball
http://www.slideshare.net/OReillyStrata/visualizing-networks-beyond-the-hairball

Edge Bundling
https://seeingcomplexity.wordpress.com/2011/02/05/hierarchical-edge-bundles/

Region Bundling
http://infosthetics.com/archives/2007/03/hierarchical_edge_bundles.html

Plan of Study
Extraction of Network from Email
Introduction to NetworkX
Analyzing our Email Networks
Visualizing our Email Network
Relief from Gephi

Graph Extraction from an email MBox

Graph Analyses with Python and NetworkX

Graph Analyses with Python and NetworkX

More Related Content

What's hot

Similar to Graph Analyses with Python and NetworkX

More from Benjamin Bengfort

Recently uploaded

Graph Analyses with Python and NetworkX