The suffix array algorithm
— competitive-programming, suffix-array — 19 min read
Introduction: searching for elements in a list
Searching for an element in a list is one of the classical problems in Computer Science. The simple solution consists of looking one by one at each element, starting from the beginning of the list. This is a linear algorithm, but we can't do much better for a general list, since in the worst case we'll have to check all the elements of the list to see if the element we're looking for is there. This algorithm is called linear search.
def linear_search(ls: list[Elem], target: Elem) -> Optional[int]: for i, elem in enumerate(ls): if elem == target: return i return None
If you know that your list is sorted in some kind of way, then we can exploit this to craft a more efficient algorithm, binary search. With binary search, we look for the element the same way we look for a word in the dictionary. We first look at the element in the middle, decide if it's our element, and if not, whether to look at the first half of the list or the second half. We repeat this process until we find our element in the list, or we run out of places to look for. Since at each iteration we're discarding half of the elements (instead of only one like we did in the linear search) the runtime of this algorithm is logarithmic. Therefore, it's preferable to search for items in an ordered list rather than an unordered one.
def binary_search(ls: list[Elem], target: Elem) -> Optional[int]: beg = 0 end = len(ls) - 1 while beg <= end: # Search in ls[beg:end] while not empty mid = (beg + end) // 2 # Get middle element if ls[mid] == target: return mid elif ls[mid] < target: # Target after middle beg = mid + 1 elif target < ls[mid]: # Target before middle end = mid - 1 return None
If we want to search for an element in a list, then we can only use linear search, since we don't have the guarantee that the list will be sorted. However, if we know that this list will be queried more than once, that is, we'll need to search for more than one element, then we could improve our algorithm by first sorting the list and then performing the search in the sorted version. We could even prevent modifying the original list by sorting a list of indices instead of the list itself. That way, we could search in this second list of indices to get the index of the element in our original list.
Actually, the same list can be ordered in different ways depending on how we decide to compare its elements. We can compare the elements directly, or use different criteria. For example, the list would be sorted as if we use by criteria the number of prime factors each number has. has no prime factors, , and are primes themselves and features twice since . When our criteria involves comparing something other than the element, we call that which we compare the key. Therefore, when we talked about sorting the list of indices instead of the original list we meant sorting the list of indices where the keys are the elements which they point to in the original list.
def sort_indices(ls: list[Elem]) -> list[int]: indices = list(range(len(ls))) return sorted(indices, key = lambda idx : ls[idx])
However ordering an unordered list is no trivial matter. This is also one of those classical problems in Computer Science, most often taught to Computer Science students in their first algorithms class. There are many different sorting algorithms: insertion sort, selection sort, bubble sort, merge sort, quick sort, etc. These algorithms make use of comparisons to sort the list, and the most efficient ones have a complexity bound of . This bound is actually tight, which means that you can't have a more efficient algorithm relying on comparisons.
Why is this bound tight?
Imagine you have a list of length . Then, the sorted list is a permutation of the elements of the list since the first element will have possible places to go, but the second one will only have , since it can't take the place occupied by the first element. Similarly, the -the element can't take the spots taken by the previous elements, until the final element which only has one final spot to take. We can therefore count the number of possible orders as .
In our algorithm, each time we make a comparison is to make a decision. We can draw a binary tree from our algorithm, where each node is a comparison it makes and its two children are the continuation of the algorithm, depending on whether the comparison is true or not. The leafs of this tree would be the permutation of the original list which makes it be sorted.
We've calculated that there are such permutations, and we know from algorithms class that the minimum height of a binary tree with leaves is . In this case, the height of the tree represents the worst case performance of our algorithm, that is the maximum number of comparisons you'll have to make to sort a list. Therefore, the number of comparisons needed for an algorithm is . Knowing that