Sorted list implementation
Implementation Details¶The Sorted Containers internal implementation is based on a couple observations. The first is that Pythons list is fast, really fast. Lists have great characteristics for memory management and random access. The second is that bisect.insort is fast. This is somewhat counter-intuitive since it involves shifting a series of items in a list. But modern processors do this really well. A lot of time has been spent optimizing mem-copy/mem-move-like operations both in hardware and software. But using only one list and bisect.insort would produce sluggish behavior for lengths exceeding ten thousand. So the implementation of Sorted List uses a list of lists to store elements. In this way, inserting or deleting is most often performed on a short list. Only rarely does a new list need to be added or deleted. Sorted List maintains three internal variables: _lists, _maxes, and _index. The first is simply the list of lists, each member is a sorted sublist of elements. The second contains the maximum element in each of the sublists. This is used for fast binary-search. The last maintains a tree of pair-wise sums of the lengths of the lists. Lists are kept balanced using the load factor. If a sublists length exceeds double the load then it is split in two. Likewise at half the load it is combined with its neighbor. By default this factor is 1,000 which seems to work well for lengths up to ten million. Lengths above that are recommended a load factor that is the square root to cube root of the average length. (Although you will probably exhaust the memory of your machine before that point.) Experimentation is also recommended. A load factor performance comparison is also provided. For more in-depth analysis, read Performance at Scale which benchmarks Sorted Containers with ten billion elements. Finding an element is a two step process. First the _maxes list, also known as the maxes index, is bisected which yields the position of a sorted sublist. Then that sublist is bisected for the location of the element. Compared to tree-based implementations, using lists of lists has a few advantages based on memory usage.
Traditional tree-based designs have better big-O notation but that ignores the realities of todays software and hardware. For a more in-depth analysis, read Performance at Scale. Indexing uses the _index list which operates as a tree of pair-wise sums of the lengths of the lists. The tree is maintained as a dense binary tree. Its easiest to explain with an example. Suppose _lists contains sublists with these lengths (in this example, we assume the load factor is 4): list(map(len, _lists)) -> [3, 5, 4, 5, 6]
Given these lengths, the first row in the index is the pair-wise sums: [8, 9, 6, 0]
We pad the first row with zeros to make its length a power of 2. The next rows of sums work similarly: [17, 6]
[23]
Then all the rows are concatenated in reverse order so that the index is finally: [23, 17, 6, 8, 9, 6, 0, 3, 5, 4, 5, 6]
With this list, we can efficiently compute the index of an item in a sublist and, vice-versa, find an item given an index. Details of the algorithms to do so are contained in the docstring for SortedList._loc and SortedList._pos. For example, indexing requires traversing the tree to a leaf node. Each node has two children which are easily computable. Given an index, pos, the left-child is at pos * 2 + 1 and the right-child is at pos * 2 + 2. When the index is less than the left-child, traversal moves to the left sub-tree. Otherwise, the index is decremented by the left-child and traversal moves to the right sub-tree. At a leaf node, the indexing pair is computed from the relative position of the node as compared with the offset and the remaining index. For example, given the following index: _index = 14 5 9 3 2 4 5
_offset = 3
Tree:
14
5 9
3 2 4 5
Indexing position 8 involves iterating like so:
The final index pair from our example is (2, 3) which corresponds to index 8 in the sorted list. Maintaining the position index in this way has several advantages:
The construction and maintenance of the positional index is unusual compared to other traditional designs. Whether the design is novel, I (Grant Jenks) do not know. Until shown otherwise, I would like to refer to it as the Jenks index. Each sorted container has a function named _check for verifying consistency. This function details the data-type invariants. |