Scott Logic Ltd

Sorted Lists in Java

Mark Rhodes, December 22nd, 2010

This post goes through an implementation a SortedList in Java which ensures its elements are kept in order and guarantees O(log(n)) run time for all the basic operations: get, contains, remove and add. The source code, javadoc, unit tests and class files for this post are available here: scottlogic-utils-mr-1.4.zip.

Sorting is one of the most common operations applied to lists and as such Java has built in mechanisms for doing it, like the Comparable and Comparator interfaces and the Collections.sort methods. These are great when you have a static list that needs to be ordered, but sometimes you want the list to remain sorted after some altering operations have been applied to it (e.g. if you add an element to an ArrayList which you’ve sorted, chances are that it’s no longer in the right order). For some reason, the java.util package is lacking a proper SortedList, and since they’re quite handy, I thought I write my own.

Alternatives
As with all data structures, whether a SortedList the right tool for the job depends on how the data is going to be used. The java.util package contains a host of different data structures, all of which have their place, but unfortunately it (at least at the moment) is missing the SortedList. A comparison between some of the the Java’s built in data structures and a SortedList is given below:

add(Object elem) remove(Object elem) get(int index) contains(Object elem) multiple equal elements
ArrayList O(1)* O(n) O(1) O(n) YES
LinkedList O(1) O(n) O(n) O(n) YES
TreeSet O(log(n)) O(log(n)) N/A O(log(n)) NO
PriorityQueue O(log(n)) O(n) N/A O(n) YES
SortedList O(log(n)) O(log(n)) O(log(n)) O(log(n)) YES
* – amortized constant time (inserting n objects takes O(n) time).

If you’re not likely to change the data structure much, you might want to just use an ArrayList or a regular array and sort it when you need, which can be done relatively quickly (O(n*log(n))), using Java’s Collections.sort or Arrays.sort methods respectively. So long as you don’t need to sort it too often, there’s no problem.

If you’re not sure how many times you’ll need to sort it, it’s quicker to ensure that the data is always kept in order. If you don’t need to store multiple equal elements and don’t need random access to the data (i.e. you don’t need to run get(int index)) then it’s probably best to use Java’s TreeSet. If you need to store multiple equal elements but don’t need random access to the data, or quick contains/remove methods then Java’s PriorityQueue might be the way to go. If these cases don’t apply, then a SortedList might be what you want.

Implementing a custom List in Java
All Lists in Java should implement the java.util.List interface, however, rather than implement this directly, the easiest way to start off writing your own List class is have it extend the AbstractList class instead. This class implements the List interface and provides default implementations of the methods, reducing the amount of code you need to write. Most of these default implementations just throws an UnsupportedOperationException, but some are useful. For example, the default listIterator and iterator methods provide you with working iterators once you’ve provided an implementation for the get(int index) method.

It’s also easy to configure the iterators provided by the AbstractList to exhibit fail-fast behaviour. This means that if the list is modified after the iterator has been created, other than through the iterator itself, then calling any method other than hasNext() on the iterator will (hopefully!) cause the cause a ConcurrentModificationException to be thrown, as is standard for all of Java’s non-synchronized Collection classes. To do this, you just need to increment the modCount variable whenever you add or remove an element from the list. For example, the add method in the given implementation has the following structure:

public boolean add(T object){
    boolean treeAltered;
    if(object != null){
         //add the object to the list..
         ...
         treeAltered = true;
         modCount++;
    }
    return treeAltered;
}

Here I took the decision not to allow null values to be added in the list, just to simplify the other methods and since on the vast majority of applications this is what you want. If you really need to add a null value you can always just add a special singleton instance of type T which you know represents null instead of null itself. The effect of the calling modCount++; in the add method is that the following code will now throw a ConcurrentModificationException.

SortedList<Integer> list = new SortedList<Integer>();
Iterator<Integer> itr = list.iterator();
list.add(1);
itr.next();

Another thing to consider when writing a custom List class (as with any class!) is how you are going to define its type and the constructors. Since a SortedList needs a way of comparing the elements it stores, I’ve decided to leave the type definition simple but only supply a constructor which takes a Comparator, this constructor has the following signature:

public SortedList(Comparator<? super T> comparator)

If you’re not used to Java generics then this line might look a bit odd! The type definition, “<? super T>>” just says that the given comparator must be capable of comparing elements of type T, which is the type of element that the list is able to store.

This might seem like a weird decision, since Java’s built in TreeSet doesn’t enforce the same restriction; it also allows a no-argument constructor. The reason is that this no-argument constructor is used to create a TreeSet which orders elements by their natural order; i.e. the order you get if you use the compareTo method on the set’s elements. The draw back of this design decision is that there is no way to say that either the elements must be Comparable, or you need to supply a Comparator, so you need to add a runtime check for this (i.e. note the ClassCastException thrown by the TreeSet‘s add method).

If you need this behaviour with this SortedList, you can you implement a very simple subclass of it which passes in a Comparator providing this natural ordering. An example implementation of this is given in this file: SortedListNatural.java, since it’s so short I’ve also included all the code in it below too:

public class SortedListNatural<T extends Comparable<? super T>> extends SortedList<T> {
	public SortedListNatural(){
		super(new Comparator<T>(){
			public int compare(T one, T two){
				return one.compareTo(two);
			}
		}); 
	}
}

Note that the type definition restricts the class so that only comparable objects can be stored in it; removing the need for any runtime check.

AVL Trees as Sorted Lists
In order to obtain the logarithmic run times for the standard operations, the SortedList needs to be based on some kind of balanced tree structure. I’ve used an AVL tree, which is pretty much the simplest form of balanced tree you can have (so less chance of mistakes!) and since it ensures the tree remains very balanced (more so than say a Red-Black Tree), it means that that the get and contains methods always run efficiently.

An AVL tree is a binary search tree, which rebalances itself whenever the height of any node’s subtree becomes at least two larger than it’s other subtree. The rebalancing requires implementing just two methods – the left and the right tree rotations. I won’t go into all the details here, but if you’re interested, a pretty easy to follow introduction to AVL trees is available at: lecture1, lecture2 and lecture3.

When it comes to actually implementing an AVL tree, the most obvious way to do it in Java is to have an inner Node class to represent the individual positions in the tree, and then have the main class hold a reference to the root Node of the tree. The Node class needs to be defined somewhat recursively, as each Node needs to maintain references to both the children nodes and their parent node. In order to use an AVL tree as a SortedList, you need this Node class to be slightly different than in a standard implementation, the changes required are that:

  1. Allow more than one node to store values that are equal in terms of the given comparator.
  2. Nodes need to remember the total number of children they have

The first alteration is required so that the tree can be used as a List rather than as a Set and the second in order to implement the get(int index) method efficiently.

The start of the Node class in the implementation looks like this:

private int NEXT_NODE_ID = Integer.MIN_VALUE;
 
private class Node implements Comparable<Node> {
    final int id = NEXT_NODE_ID++; //get the id and increment it..
    T value; //the data value being stored at this node 
 
    Node leftChild;
    Node rightChild;
    Node parent;
 
    //The "cached" values used to speed up methods..
    int height;
    int numChildren; 
 
    //Compares the t values using the comparator, if they are equal it uses the
    //node it - older nodes considered to be smaller..
    public int compareTo(Node other){
        int comparison = comparator.compare(value, other.value);
        return (comparison == 0) ? (id-other.id) : comparison;
    }
    ...

As the Node class is an inner class there is no need to “parameterise” the type definition, it automatically inherits the definition of the T from the SortedList parent class. The list allows multiple values to be stored by giving each node a unique id, which is incremented as each element is added. The Node‘s compareTo method then uses this when comparing values, so that nodes with the same value according to the comparator, are distinguished by their unique id. The height and numChildren fields are really just cached values, since their values could be obtained by examining the child nodes. It’s up to the implementation to ensure that these values are maintained as changes are made to the tree. In the given implementation, this is all done in the updateCachedValues method of the Node class:

private void updateCachedValues(){
	Node current = this;
	while(current != null){
		if(current.isLeaf()){
			current.height = 0;
			current.numChildren = 0;
		} else {
			//deal with the height..
			int leftTreeHeight = (current.leftChild == null) ? 0 : current.leftChild.height;
			int rightTreeHeight = (current.rightChild == null) ? 0 : current.rightChild.height;
			current.height = 1 + Math.max(leftTreeHeight, rightTreeHeight);
 
			//deal with the number of children..
			int leftTreeSize = (current.leftChild == null) ? 0 : current.leftChild.sizeOfSubTree();
			int rightTreeSize = (current.rightChild == null) ? 0 : current.rightChild.sizeOfSubTree();                   
			current.numChildren = leftTreeSize + rightTreeSize;
		}
	   //propagate up the tree.. 
	   current = current.parent;
	}
}

So long as this method is called on the appropriate node each time the tree is structurally altered, the values will remain correct. It’s not always obvious which node constitutes the “appropriate one”, but it should always be the node which was altered with the lowest height in the resulting tree (it’s always the case that there is one such node).

The only key method that is missing from a standard AVL tree implementation that is required to make it work as a List is the get(int index) method. As I mentioned before, this method is going to make use of the numChildren field of the Node class to so that it can be implemented efficiently. Once this field is in place, it’s not difficult to implement – the method just needs to traverse the tree, making sure that it remembers how many smaller values there are than those at the current node; this effectively tells you the index of the first value stored at the current node. In the provided implementation, the code look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
@Override
public T get(int index){
	return findNodeAtIndex(index).value;
}
 
private Node findNodeAtIndex(int index){
	if(index < 0 || index >= size()){ 
		throw new IllegalArgumentException(index + " is not valid index.");
	}
	Node current = root;
	//the the number of smaller elements of the current node as we traverse the tree..
	int totalSmallerElements = (current.leftChild == null) ? 0 : current.leftChild.sizeOfSubTree();
	while(current!= null){  //should always break, due to constraint above..
		if(totalSmallerElements == index){
			break;
		}
		if(totalSmallerElements > index){ //go left..
			current = current.leftChild;
			totalSmallerElements--;
			totalSmallerElements -= (current.rightChild == null) ? 0 : current.rightChild.sizeOfSubTree();
		} else { //go right.. 
			totalSmallerElements++;
			current = current.rightChild;
			totalSmallerElements += (current.leftChild == null) ? 0 : current.leftChild.sizeOfSubTree();
		}
	}
	return current;
}

Here the sizeOfSubTree method just returns one plus the number of children values of the node. The totalSmallerElements variable effectively stores the index of the current node is maintained in lines as the tree is traversed.

Doing without recursion
You might have noticed that the code so far has been iterative rather than recursive. Generally, most operations involving trees are written using recursion, but since iterative solutions tend to be quicker, I’ve stuck to using iteration throughout the implementation (the only exception is with the structurallyEqualTo method which is just there for testing). For methods where you just need to traverse the tree, like the get or contains methods, turning it from a recursive method to a iterative one is just a case of adding a while loop and keeping a reference to the current Node that you’re looking at. For example, you go from something like:

void method(){
    if(/*some condition holds for this node*/){
       return this;
    } else if(/*need to traverse left*/){
        return leftChild.method();
    } else {  //need to traverse right..
        return rightChild.method();
    }
}

to something like:

void method(){
    Node currentPosition = this; 
    while(currentPosition != null){
        if(/*some condition holds for currentPosition*/){
            return currentPosition;
        } else if(/*need to traverse left*/){
            currentPosition = leftChild;
        } else {  //need to traverse right..
            currentPosition = rightChild;
        }
}

The only difficulty comes when the method needs to go back to nodes that have previously been visited, (i.e. those that can’t be written with just simple tail recursion). For instance, if you want to print all the elements in the tree in order; with recursion this is just a few lines:

void printAll(){
    if(leftChild != null){
        leftChild.printAll();
    }
    printValues(); //prints the values at this node..
    if(rightChild != null){
        rightChild.printAll();
    }
}

This could then be invoked on the root node to print the whole tree. It’s really not obvious how to do this without using recursion! To overcome this, the Node class in the implementation defines a couple of handy iterative methods – the smallestNodeInSubTree, which finds the smallest node in the sub-tree rooted at the node and successor, which find the next largest node in the tree (so returns null for the node storing the largest values in the tree). They are defined like this:

public Node smallestNodeInSubTree(){ 
     Node current = this;
     while(current != null){
         if(current.leftChild == null){
             break;
         } else {
            current = current.leftChild;
        }
    }
    return current;
}
 
public Node successor(){
   Node successor = null;
   if(rightChild != null){
	   successor = rightChild.smallestNodeInSubTree();
   } else if(parent != null){
	   Node current = this;
	   while(current != null && current.isRightChildOfParent()){
		   current = current.parent;
	   }
	   successor = current.parent;
   }
   return successor;
}

With these in place, you could write an iterative version of the printAll method like this:

void printAll(){
    Node current = this.smallestNodeInSubTree();
    while(current != null){
        current.printValues(); //prints the values at the current node..
        current = current.successor();
    }
}

Unit Tests
I always find that when writing code like this, it’s really easy to make a mistake, so to build up some confidence in it I wrote some junit tests for the SortedList class which are included in the download. If you find a problem with it that’s not covered by them please let me know.

Mark Rhodes


This entry was posted on Wednesday, December 22nd, 2010 and is filed under Blog.

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site .


11 Responses to “Sorted Lists in Java”

  1. Leandro says:

    The link to download the source is broken….. can you fix it???

  2. Mark Rhodes says:

    Cheers for letting me know – it should be fixed now. I’ve developed a few other data structures since I wrote this post, refactored the code and packaged them up in a single jar now so it should be a bit easier to use.

  3. Jakub Herkel says:

    Nice code but there is a small problem with method removeAll. When list is empty it returns exception.
    if(index = size()){
    throw new IllegalArgumentException(index + ” is not valid index.”);
    }

    • Mark Rhodes says:

      Cheers. Can you be more specific about when you get the error – when I run:

      SortedList sl = new SortedList(comp);
      boolean altered = sl.removeAll(Arrays.asList(1, 2, 3));

      I get what I expect – that there is no exception thrown and “altered” is false.

  4. Chih-Chiang says:

    Hi, Thanks for the great code.
    I am using the sortedlist to store my object. and I ran the profiler to inspect my memory usage. Does the sortedlist replicate each stored object twice in the list as “Node”? Because the profiler showed the number of instances of “Node” is equal to the numbers of the objects I inserted to the list, but the size of the all instance of Node are twice of the size of all objects.

    • Mark Rhodes says:

      Hi Chih-Chiang,

      No that’s not the case, a single Node object wraps a single data item. There is no way to deep clone an arbitrary Java object in this way. Each node class stores a reference to the element it’s storing, a reference to each of it’s two children, a reference to it’s parent node and three other integers; effectively, each node stores 6, 32 bit integers. The amount of memory required to store an individual object that is in the list has no effect on the amount of memory required by the data structure itself. Try using the list to store Objects which contain large arrays and you should see that the memory overhead of using the list is minimal in comparison to the size of the actual objects being stored.

  5. cpc says:

    Can I use it in open souce (GPL V3) program ?

Leave a Reply

© 2013 Scott Logic Ltd. All Rights Reserved.