Lab 2: Union Find (8 Points)
Chris Tralie
Overview / Logistics
In this lab you will implement an efficient data structure for the disjoint set data structure known as union find. You will implement and test several accelerations to union find that make the "amortized cost" (average cost) of an operation near constant time(!)
Click here to download the starter code for this lab, which provides a reference implementation of the slow disjoint set data structure, as well as code to compare this to your implementations in this lab in the file test.py
Learning Objectives
- Convert a problem into a format required by an abstract data type
- Implement union find with rank-based merging and path compression in python
- Experimentally measure and plot the amortized cost of operations as the input size scales
What To Submit
When you are finished, submit your files unionfind.py
, unionrank.py
, and unionrankpc.py
to canvas, as well as your amortized cost plot and an answer to the complexity question as a comment on canvas.
Background
Review the pseudocode and example picture below of the basic union find algorithm. If needed, click here to rewatch the video I made that explains this
Basic Union Find Implementation (2 Points)
Your Task: Following the pseudocode in the above picture, create a file called unionfind.py
which contains a class called UnionFind
with methods for root
, find
, and union
. For simplicity in this lab, you can assume that you start off with individual bubbles for the numbers 0, 1, ..., N-1, and you can pass N
as a parameter to the constructor, just as in the naive implementation you explored in the module.
Create some simple tests to convince yourself that this is working before you proceed to the next section.
Rank-Based Merging (2 Points)
As we discussed in the pre-lab, it is possible for union find to devolve to a linked list in the worst case, making the worst case find
operation O(N). But there's an extra step we can do to maintain a better tree when the union
happens. In addition to storing the parent references, we'll store something called the rank
of each node n, which, for now, we can think of as the "height," or the length of the longest path from any node which follows a path through our node n. In particular, the rules for the rank are as follows:
- The rank of every node starts off as 0
-
If we do a
union(i, j)
andrank(i) > rank(j)
, theni
becomesj
's parent -
If we do a
union(i, j)
andrank(i) < rank(j)
, thenj
becomesi
's parent -
If we do a
union(i, j)
andrank(i) = rank(j)
, theni
becomesj
's parent, and we increment the rank ofi
by 1
Let's look at an example of this. I've bolded the ranks that are currently at roots, since only the roots are involved in a union.
|
Now let's suppose we call union(4, 1)
. This will merge the roots 0 and 3.
To accomplish this, one possibility would be make 0's parent be 3, as shown below:
|
The other option is to make 3's parent be 0, as shown below:
|
This is definitely the better option, and it's precisely what the rank-based merging algorithm suggests we do! If we look at the length from 8 to its root in the first case, it increased by 1, but in the second case, it remained the same. In other words, the second case maintains shorter trees.
Your task: Copy and paste your code from unionfind.py
into a file named unionrank.py
. Modify your union
method in unionrank.py
so that it implements rank-based merging.
Path Compression (2 Points)
There's another on-the-fly optimization we can make to the union find structure known as path compression. At a high level, we want to try to avoid very long branches in our forest, because this will make it slower to find the root. So what we can do is some bookkeeping to make future calls of root are more efficient. It's easiest to see this with an example. Suppose we start with the forest below:
|
Then, let's suppose we called root(9)
. All of the highlighted nodes (9, 6, 1, 7) below will be visited on the path on the way to 9's root:
This is a long path relative to the number of elements here, but we can cut it down for future iterations by backing up and creating a shortcut from all of the highlighted nodes to the root that was found. The image below shows this, with parent pointers bolded to indicate what was changed
|
Notice how much shorter the tree is now! This is one of the ways we get towards near constant time performance on future queries.
Your Task: Create a copy of your file unionrank.py
called unionrankpc.py
. In this new file, modify your root
method to implement path compression, and make sure your union find still works properly. Leave the rank code in union
untouched! Surprisingly, even though the heights are changing, it turns out to be a good idea to follow the exact same rank algorithm in tandem with path compression.
Amortized Cost Plots (2 Points)
Now we will get a sense for the efficiency of the tree-based union find compared to one of our original ideas empirically by testing it out and timing it on inputs of varying size. Path compression may take a long time on some calls to root
, but we're banking on it saving us time in future operations. Hence, we are interested not in the cost of an individual operation, but in the amortized cost, or the average cost of unions and finds over many operations. Furthermore, whenever we're thinking about the performance of algorithms, we want to know how the algorithms scale in performance as the input size scales up.
Your task: Modify all three of your union find classes to have two local variables:
-
_operations
: A running count of how many times a parent arrow was followed inroot
and how many times a parent was reassigned inunion
-
_calls
: The total number of calls made to union and find.
I've done a similar thing in djsetslow.py
. Then, at the end of many operations, the amortized cost can be computed as _operations/_calls
.
Assuming you've done all of this properly, I've created a program in test.py
to call your code and the slow disjoint set code on many random tests of different sizes, and to plot the results. The program will also let you know if your find operations don't agree with those in djsetslow.py
, which you can consider a reference answer.
Question: What does the amortized cost of the slow implementation appear to be in big-O notation, as a function of N? What does the amortized cost of the optimized union find with both rank-based merging and path compression appear to be as a function of N?
For the bored...
Testing only once per input size is not reliable, because the operations chosen are random, and we may have gotten lucky or unlucky. To fix this, you should run multiple different random tests for a single input size and average their results. This will smooth things out a bit