Minimum Spanning Trees

In this guide, we’ll solve problem 107 in Project Euler, a popular series of programming problems. Please take a few moments to familiarize yourself with the problem and download the text file containing the adjacency matrix.

Project Euler generally discourages sharing problem solutions, but most problems also aren’t such straight-forward applications of fundamental algorithms.

At some point, I’ll make another post for the other Project Euler problems, giving detailed hints and reading assignments instead of explicit solutions.

But here, we’re going to focus on implementation. You should own and follow along with a good algorithms textbook. I recommend Dasgupta’s text.

To follow along, create a new directory and place the adjacency matrix file in it. Then download Scala, open your command prompt, navigate to the directory, type scala to open your REPL, and enter the code line by line (resist the urge to copy and paste).

Why is this important?

Some businesses depend on network effects. You’d probably only create a Facebook account, for example, if a large number of your friends and family were already on Facebook. For these companies, the goal is to be first to the market, then grab as much market share as possible. That’s the only way to survive.

And before Facebook, there was Friendster.

Because the development team at Friendster insisted on analyzing every connection between all tens of millions of users instead of strictly focusing on minimum spanning trees of users and their friends, memory requirements ballooned. As they were forced to upgrade to larger and larger servers, ensuing technical problems drove away much of their userbase, and they ended up losing out on the business that made Mark Zuckerberg a billionaire.

Poorly crafted algorithms can lose you billions. Know your stuff.

Before we begin

In this implementation, I assume the reader has some basic familiarity with graphs (the kind with vertices and edges, not the sort that track stocks). You should know what trees, vertices, edges, and adjacency matrices are.

I also assume the reader understands object-oriented programming and knows a few basics of functional programming. You should know what a map does, for example. If you know Java, you should be able to grok Scala easily enough.

What language are we using?

Scala! (Pronounced Scah-laa)

Like Python and Ruby, it allows imperative and functional programming, as well as exposes a great REPL. In addition, unlike those languages, it’s compiled, statically-typed, runs on the JVM, and interacts seamlessly with Java code. It’s basically the perfect language.

If you’d like to learn Scala, I recommend Scala creator Martin Odersky’s Scala By Example. For a reference to quickly glance into if you just want to know something specific, I recommend Alvin Alexander’s Scala Cookbook.

If not, translating these algorithms into a language like Python should be trivial. I’ve chosen Scala for its expressiveness, and am reasonably confident that code of the same clarity would take more lines in any other language.

Kruskal’s algorithm

Intro

In this problem, given a graph, we’d like to find a way to link all vertices of the graph without any redundant edges. When those edges have weights, we’d like to link the graph while minimizing the total weight of the resulting tree as much as possible. In short, we’d like to build a minimum spanning tree.

The key insight here is that we want to avoid creating cycles in our final graph (otherwise the network would no longer be minimal, as an edge could be removed without compromising connectivity). We can do this by building up our final tree through joining smaller, intermediate trees. If we join two minimum spanning trees for two subsets of our initial vertices, we get the minimum spanning tree for the subsets combined. The minimum spanning tree for two vertices is simply the lowest-weight edge connecting them.

Kruskal’s algorithm is easy to grasp. First store every vertex of the original graph as a one-node tree. Then, sort the edges in the original graph by weight and iterate through them. If two vertices aren’t already connected in some tree, connect their trees in a tree. If they’re already connected, the existing edge between them is already the cheapest, so we ignore the new edge and keep iterating.

Since we just keep grabbing the smallest edge not already in any tree and adding it to our final tree, this is a greedy algorithm. By the end of this, we’ll have our minimum spanning tree!

Disjoint Sets

How can we check if two vertices are in the same tree? We need some kind of data structure to keep track of which tree each vertex is in, and the means to combine two trees.

This is best accomplished using disjoint sets.

In Kruskal’s algorithm, we’ll have each vertex start out as its own set. When we first add two vertices to the same tree, we set one’s parent vertex to the other vertex, and the parent becomes the shared root of the new set. This will be our set union function.

To check if two vertices are part of the same set, we’ll simply see if their root vertices are the same. If so, they’re part of the same tree. This will be our set find function.

Implementation

First, let’s create two classes to represent vertices and edges in our graph. Feel free to type the following into your REPL.

class Vertex(val id: Int, var dsParent: Vertex = null)
class Edge(val weight: Int, val start: Vertex, val end: Vertex)

Each vertex has a unique id and a disjoint set parent. In Scala, val represents immutable values. After a vertex is created, its id can never be changed. However, its disjoint set parent is declared as var, meaning it’s mutable. This will be important to union two sets.

Each edge has a weight, start vertex, and end vertex. These are all immutable, since we never want to change the initial graph.

Let’s now define the DisjointSet object.

object DisjointSet {

In Scala, an object is a class that only has one object. It’s equivalent to a class in Java that has only static methods.

Let’s go ahead and implement our find function. This function will ride a vertex’s parent reference up to the root (the root’s parent is null), then return the root’s id.

    def find(v: Vertex): Int = {
        var current = v
        
        while (current.dsParent != null) {
            current = current.dsParent
        }
        
        current.id
    }

Now let’s implement union. Given an array of vertices indexed by id and two vertices, it finds the root of both vertices, then sets the parent of one root to the other root.

    def union(vertices: Vector[Vertex], v1: Vertex, v2: Vertex) = {
        vertices(find(v1)).dsParent = vertices(find(v2))
    }
}

We’re now ready to solve the problem! Let’s start by loading the adjacency matrix file into memory as a String array.

val adjMatrix = io.Source.fromFile("p107_network.txt").getLines.toIndexedSeq

Now, let’s create our vertices. Each row of the matrix contains the edges for one Vertex, so we create a Vertex array indexed by id, setting each entry’s parent to null (each vertex is presently a one-node tree’s root).

val vertices = Vector.range(0, adjMatrix.size).map(new Vertex(_))

Now we loop through the String array to build an ArrayBuffer[Edge] and populate each Edge with a start vertex, end vertex, and weight.

var edges = collection.mutable.ArrayBuffer[Edge]()
for ((adjMatrixRow, i) <- adjMatrix.zipWithIndex) {
    val rowEdgeWeights = adjMatrixRow.split(",").map(_ match { case "-" => -1 
                                                               case x => x.toInt })
    for ((weight, j) <- rowEdgeWeights.zipWithIndex) {
        if (weight != -1) {
            edges += new Edge(weight, vertices(i), vertices(j))
        }
    }
}

Now, since we want to find the final weight saving gained by creating a minimum spanning tree, we start by finding the total weight of the initial graph. Since the graph is undirected, the adjacency matrix counts every edge twice, so we need to divide our final result by two.

val totalWeight = (edges.map(_.weight).sum)/2

Now we sort the ArrayBuffer[Edge] by weight, then apply Kruskal’s algorithm. mst stores our minimum spanning tree, being built edge by edge.

val sortedEdges = edges.sortWith(_.weight < _.weight)
var mst = collection.mutable.ArrayBuffer[Edge]()
for (edge <- sortedEdges) {
    if (DisjointSet.find(edge.start) != DisjointSet.find(edge.end)) {
        mst += edge
        DisjointSet.union(vertices, edge.start, edge.end)
    }
}
val mstWeight = mst.map(_.weight).sum

Finally, we arrive at our solution!

totalWeight - mstWeight

Wow. That was quick, wasn’t it? In fifty lines of code, we’ve solved a Project Euler problem with real-world applications and implemented Kruskal’s algorithm!

Do you understand everything so far? Great! Now prove it in the language of your choice!

I hope you’ve found this helpful, and wish you luck on your algorithmic journey!

Created on November 21 2016, last modified on October 1 2017.


Twitter
Facebook