Algorithms and Data Structures
Algorithm Complixity
description | order of growth | description | example |
---|---|---|---|
constant |
1 | statement | add two numbers |
logarithmic |
logN | divide in half | binary search |
linear |
N | loop | find the maximum |
linearithmic |
NlogN | divide and conquer | mergesort |
quadratic |
N^2 | double loop | check all pairs |
cubic |
N^3 | triple loop | check all triples |
exponential |
2^N | exhasutive search | chech all subsets |
Sorting
Sorting is the process of rearranging a sequence of objects so as to put them in some logical order.
Selection Sort
One of the simplest sorting algorithms works as follows: First, find the smallest item in the array and exchange it with the first entry (itself if the first entry is already the smallest). Then, find the next smallest item and exchange it with the second entry. Continue in this way until the entire array is sorted. This method is called selection sort
because it works by repeatedly selecting the smallest remaining item.
Data movement is minimal
. Each of the N
exchanges changes the value of two array entries, so selection sort uses N
exchanges—the number of array accesses is a linear function of the array size. None of the other sorting algorithms that we consider have this property (most involve linearithmic
or quadratic
growth).
public class SelectionSort {
public static void sort(Comparable[] a) {
int n = a.length;
for (int i = 0; i < n; i++) {
// Exchange a[i] with smallest entry in a[i+1...N).
int min = i; // index of minimal entr.
for (int j = i+1; j < n; j++) {
if (less(a[j], a[min])) {
min = j;
}
}
exch(a, i, min);
}
}
}
Insertion Sort
In a computer implementation, we need to make space to insert the current item by moving larger items one position to the right, before inserting the current item into the vacated position.
Unlike that of selection sort
, the running time of insertion sort depends on the initial order of the items in the input. For example, if the array is large and its entries are already in order (or nearly in order), then insertion sort is much, much faster than if the entries are randomly ordered or in reverse order.
public class InsertionSort {
public static void sort(Comparable[] a) {
int n = a.length;
for (int i = 1; i < n; i++) {
// Insert a[i] among a[i-1], a[i-2], a[i-3]... ..
for (int j = i; j > 0 && less(a[j], a[j-1]); j--) {
exch(a, j, j-1);
}
}
}
}
Insertion sort
is an excellent method for partially sorted arrays and is also a fine method for tiny arrays. These facts are important not just because such arrays frequently arise in practice, but also because both types of arrays arise in intermediate stages of advanced sorting algorithms.
Mergesort
Mergesort
: to sort an array, divide it into two halves, sort the two halves (recursively), and then merge the results. As you will see, one of mergesort’s most attractive properties is that it guarantees to sort any array of N
items in time proportional to N log N
. Its prime disadvantage is that it uses extra space proportional to N
.
public class MergeSort {
private static Comparable[] aux;
public static void merge(Comparable[] a, int lo, int mid, int hi) {
int i = lo, j = mid + 1;
for (int k = lo; k <= hi; k++) {
aux[k] = a[k];
}
for (int k = lo; k <= hi; k++) {
if (i > mid) a[k] = aux[j++];
else if (j > hi) a[k] = aux[i++];
else if (less(aux[j], aux[i])) a[k] = aux[j++];
else a[k] = aux[i++];
}
}
public static void sort(Comparable[] a) {
aux = new Comparable[a.length];
sort(a, 0, a.length - 1);
}
private static void sort(Comparable[] a, int lo, int hi) {
if (hi <= lo) return;
int mid = lo + (hi - lo) / 2;
sort(a, lo, mid); // Sort left half
sort(a, mid + 1, hi); // Sort right half
merge(a, lo, mid, hi);
}
}
Searching
data sctructure | implementation | pros | cons |
---|---|---|---|
linked list |
SequentialSearch | best for tiny ST | slow for large ST |
ordered array |
BinarySearch | optimal search and space, order based ops | slow insert |
binary search tree |
BST | easy to implement, order-based ops | no guarantees spaces for links |
balanced BST |
RedBlack | optimal search and insert, order-based ops | space for links |
hash tables |
SeparateChainingHash LinearProbingHash | fast search/insert for common types of data | need hash for each type no order-based ops space for links/empty |
Binary Search Tree
A binary search tree (BST)
is a binary tree where each node has a Comparable key (and an associated value) and satisfies the restriction that the key in any node is larger than the keys in all nodes in that node’s left subtree and smaller than the keys in all nodes in that node’s right subtree.
Order-based methods and deletion. An important reason that BSTs
are widely used is that they allow us to keep the keys in order. As such, they can serve as the basis for implementing the numerous methods in our ordered symbol-table API
that allow clients to access key-value pairs not just by providing the key, but also by relative key order.
Minimum and maximum. If the left
link of the root
is null
, the smallest key in a BST
is the key
at the root
; if the left
link is not null
, the smallest key in the BST
is the smallest key in the subtree
rooted
at the node
referenced by the left link.
Floor and ceiling. If a given key key
is less than the key at the root of a BST, then the floor of key
(the largest key in the BST less than or equal to key ) must be in the left subtree. If key is greater than the key at the root, then the floor of key
could be in the right subtree, but only if there is a key smaller than or equal to key in the right subtree; if not (or if key is equal to the key at the root), then the key at the root is the floor of key
.
Delete the minimum/maximum. The most difficult BST
operation to implement is the delete()
method that removes a key-value
pair from the symbol table. As a warmup, consider deleteMin()
(remove the key-value
pair with the smallest key). As with put()
we write a recursive method that takes a link to a Node
as argument and returns a link to a Node
, so that we can reflect changes to the tree by assigning the result to the link used as argument. For deleteMin()
we go left until finding a Node
that has a null
left link and then replace the link to that node by its right link (simply by returning the right link in the recursive method). The deleted node, with no link now pointing to it, is available for garbage collection. Our standard recursive setup accomplishes, after the deletion, the task of setting the appropriate link in the parent and updating the counts in all nodes in the path to the root. The symmetric method works for deleteMax()
.
In summary, BSTs
are not difficult to implement and can provide fast search
and insert
for practical applications of all kinds if the key
insertions are well-approximated by the random-key
model.
Hash Tables
Search algorithms that use hashing consist of two separate parts. The first part is to compute a hash function that transforms the search key into an array index. Ideally, different keys would map to different indices. Thus, the second part of a hashing search is a collision-resolution process that deals with this situation. After describing ways to compute hash functions, we shall consider two different approaches to collision resolution: separate chaining and linear probing.
Hashing is a classic example of a time-space tradeoff. If there were no memory limitation, then we could do any search with only one memory access by simply using the key as an index in a (potentially huge) array. This ideal often cannot be achieved, however, because the amount of memory required is prohibitive when the number of possible key values is huge. On the other hand, if there were no time limitation, then we can get by with only a minimum amount of memory by using sequential search in an unordered array. Hashing provides a way to use a reasonable amount of both memory and time to strike a balance between these two extremes.
With hashing, you can implement search and insert for symbol tables that require constant (amortized) time per operation in typical applications, making it the method of choice for implementing basic symbol tables in many situations.
In summary, we have three primary requirements in implementing a good hash function for a given data type:
- It should be consistent—equal keys must produce the same hash value.
- It should be efficient to compute.
- It should uniformly distribute the keys.
Summary Searching
It is clear from the table that, for typical applications, your decision comes down to a choice between hash tables and binary search trees.
The advantages of hashing over BST implementations are that the code is simpler and search times are optimal (constant), if the keys are of a standard type or are sufficiently simple that we can be confident of developing an efficient hash function for them that (approximately) satisfies the uniform hashing assumption. The advantages of BSTs over hashing are that they are based on a simpler abstract interface (no hash function need be designed); red-black BSTs can provide guaranteed worst-case performance; and they support a wider range of operations (such as rank, select, sort, and range search). As a rule of thumb, most programmers will use hashing except when one or more of these factors is important, when red-black BSTs are called for.
Java’s java.util.TreeMap
and java.util.HashMap
libraries are symbol-table implementations based on red-black BSTs
and hashing
with separate chaining
respectively.
Graphs
Maps. A person who is planning a trip may need to answer questions such as “What is the shortest route from Providence to Princeton?”
Web content. When we browse the web, we encounter pages that contain references (links) to other pages and we move from page to page by clicking on the links.
Schedules. A manufacturing process requires a variety of jobs to be performed, under a set of constraints that specify that certain tasks cannot be started until certain other tasks have been completed. How do we schedule the tasks such that we both respect the given constraints and complete the whole process in the least amount of time?
Software. A compiler builds graphs to represent relationships among modules in a large software system. The items are the various classes or modules that comprise the system; connections are associated either with the possibility that a method in one class might call another (static analysis) or with actual calls while the system is in operation (dynamic analysis). We need to analyze the
To organize the presentation, we progress through the four most important types of graph models: undirected graphs (with simple connections), digraphs (where the direction of each connection is significant), edge-weighted graphs (where each connection has an associated weight), and edge-weighted digraphs (where each connection has both a direction and a weight).
Undirected Graphs
A graph
is a set of vertices
and a collection of edges
that each connect a pair of vertices.
When there is an edge connecting two vertices, we say that the vertices are adjacent
to one another and that the edge is incident
to both vertices. The degree
of a vertex is the number of edges incident to it. A subgraph
is a subset of a graph’s edges (and associated vertices) that constitutes a graph. Many computational tasks involve identifying subgraphs of various types. Of particular interest are edges that take us through a sequence
of vertices in a graph.
A path
in a graph is a sequence of vertices connected by edges. A simple path
is one with no repeated vertices. A cycle
is a path with at least one edge whose first and last vertices are the same. A simple cycle
is a cycle with no repeated edges or vertices (except the requisite repetition of the first and last vertices). The length
of a path or a cycle is its number of edges.
A graph is connected
if there is a path from every vertex to every other vertex in the graph. A graph that is not connected
consists of a set of connected components
, which are maximal connected subgraphs.
A tree
is an acyclic
connected graph. A disjoint set of trees is called a forest
. A spanning tree
of a connected graph is a subgraph that contains all of that graph’s vertices and is a single tree. A spanning forest
of a graph is the union of spanning trees of its connected components.
The three data structures that immediately suggest themselves for representing graphs:
- An
adjacency matrix
, where we maintain a V-by-V boolean array, with the entry in rowv
and columnw
defined to betrue
if there is an edge adjacent to both vertexv
and vertexw
in the graph, and to befalse
otherwise. This representation fails on the first count— graphs with millions of vertices are common and the space cost for the V^2 boolean values needed is prohibitive. - An
array of edges
, using an Edge class with two instance variables of type int. This direct representation is simple, but it fails on the second count implementingadj()
would involve examining all the edges in the graph. - An
array of adjacency lists
, where we maintain a vertex-indexed array of lists of the vertices adjacent to each vertex. This data structure satisfies both requirements for typical applications and is the one that we will use throughout this chapter.
Adjacency-lists data structure. The standard graph representation for graphs that are not dense is called the adjacency-lists data structure
, where we keep track of all the vertices adjacent to each vertex on a linked list that is associated with that vertex. We maintain an array of lists so that, given a vertex, we can immediately access its list. This Graph
implementation achieves the following performance characteristics:
- Space usage proportional to
V
+E
- Constant time to add an edge
- Time proportional to the degree of
v
to iterate through vertices adjacent tov
(constant time per adjacent vertex processed)
Depth-first search to find paths in a graph
public class DepthFirstPaths {
private boolean[] marked; // Has dfs() been called for this vertex?
private int[] edgeTo; // last vertex on known path to this vertex
private final int s; // source
public DepthFirstPaths(Graph G, int s) {
marked = new boolean[G.V()];
edgeTo = new int[G.V()];
this.s = s;
dfs(G, s);
}
private void dfs(Graph G, int v) {
marked[v] = true;
for (int w : G.adj(v)) {
if (!marked[w]) {
edgeTo[w] = v;
dfs(G, w);
}
}
}
public boolean hasPathTo(int v) {
return marked[v];
}
public Iterable<Integer> pathTo(int v) {
if (!hasPathTo(v)) {
return null;
}
Stack<Integer> path = new Stack<Integer>();
for (int x = v; x != s; x = edgeTo[x]) {
path.push(x);
}
path.push(s);
return path;
}
}
Breadth-first search to find paths in a graph
public class BreadthFirstPaths {
private boolean[] marked; // Is a shortest path to this vertex known?
private int[] edgeTo; // last vertex on known path to this vertex
private final int s; // source
public BreadthFirstPaths(Graph G, int s) {
marked = new boolean[G.V()];
edgeTo = new int[G.V()];
this.s = s;
bfs(G, s);
}
private void bfs(Graph G, int s) {
Queue<Integer> queue = new Queue<Integer>();
marked[s] = true; // Mark the source
queue.enqueue(s); // and put it on the queue.
while (!q.isEmpty()) {
int v = queue.dequeue(); // Remove next vertex from the queue.
for (int w : G.adj(v))
if (!marked[w]) // For every unmarked adjacent vertex,
{
edgeTo[w] = v; // save last edge on a shortest path,
marked[w] = true; // mark it because path is known,
queue.enqueue(w); // and add it to the queue.
}
}
}
public boolean hasPathTo(int v) {
return marked[v];
}
public Iterable<Integer> pathTo(int v)
}
Directed Graphs
In directed graphs
, edges are one-way: the pair of vertices that defines each edge is an ordered pair that specifies a one-way adjacency. Many applications (for example, graphs that represent the web, scheduling constraints, or telephone calls) are naturally expressed in terms of directed graphs. The one-way restriction is natural, easy to enforce in our implementations, and seems innocuous; but it implies added combinatorial structure that has profound implications for our algorithms and makes working with directed graphs quite different from working with undirected graphs.
Minimum Spanning Trees
An edge-weighted graph
is a graph model where we associate weights or costs with each edge. Such graphs are natural models for many applications. In an airline map where edges represent flight routes, these weights might represent distances or fares.
Recall that a spanning tree
of a graph is a connected subgraph with no cycles that includes all the vertices. A minimum spanning tree
(MST ) of an edge-weighted graph is a spanning tree whose weight(the sum of the weights of its edges) is no larger than the weight of any other spanning tree.
Shortest Path
A shortest path
from vertex s
to vertex t
in an edge-weighted digraph is a directed path from s
to t
with the property that no other such path has a lower weight.
Strings
Characters. A String
is a sequence of characters. Characters are of type char
and can have one of 2^16 possible values.
Immutability. String objects are immutable, so that we can use them in assignment statements and as arguments and return values from methods without having to worry about their values changing.
Indexing. The operation that we perform most often is extract a specified character from a string that the charAt()
method in Java’s String class provides. We expect charAt()
to complete its work in constant time, as if the string were stored in a char[]
array.
Length. In Java, the find the length of a string operation is implemented in the length()
method in String. Again, we expect length()
to complete its work in constant time.
Substring. Java’s substring()
method implements the extract a specified substring operation. Again, we expect a constant-time implementation of this method, as in Java’s standard implementation.
Concatenation. In Java, the create a new string formed by appending one string to another operation is a built-in operation (using the +
operator) that takes time proportional to the length of the result. For example, we avoid forming a string by appending one character at a time because that is a quadratic process in Java. (Java has a StringBuilder
class for that use.)
String Sorts
LSD string sort. MSD string sort.
Tries
Tries it is a search tree known as a trie
, a data structure built from the characters of the string keys that allows us to use the characters of the search key to guide the search.
As with search trees, tries are data structures composed of nodes that contain links that are either null or references to other nodes. Each node is pointed to by just one other node, which is called its parent (except for one node, the root, which has no nodes pointing to it), and each node has R links, where R is the alphabet size. Often, tries have a substantial number of null links, so when we draw a trie, we typically omit null links. Although links point to nodes, we can view each link as pointing to a trie, the trie whose root is the referenced node. Each link corresponds to a character value—since each link points to exactly one node, we label each node with the character value corresponding to the link that points to it (except for the root, which has no link pointing to it). Each node also has a corresponding value, which may be null or the value associated with one of the string keys in the symbol table. Specifically, we store the value associated with each key in the node corresponding to its last character.
Substring Search
Brute-force substring search
An obvious method for substring search is to check, for each possible position in the text at which the pattern could match, whether it does in fact match.
public static int search(String pat, String txt) {
int M = pat.length();
int N = txt.length();
for (int i = 0; i <= N - M; i++) {
int j;
for (j = 0; j < M; j++)
if (txt.charAt(i + j) != pat.charAt(j))
break;
if (j == M)
return i; // found
}
return N; // not found
}
Knuth-Morris-Pratt substring search
The basic idea behind the algorithm discovered by Knuth, Morris, and Pratt is this: whenever we detect a mismatch, we already know some of the characters in the text (since they matched the pattern characters prior to the mismatch). We can take advantage of this information to avoid backing up the text pointer over all those known characters.
Rabin-Karp fingerprint search
The method developed by M. O. Rabin and R. A. Karp is a completely different approach to substring search that is based on hashing. We compute a hash function for the pattern and then look for a match by using the same hash function for each possible M
-character substring of the text. If we find a text substring with the same hash value as the pattern, we can check for a match. This process is equivalent to storing the pattern in a hash table, then doing a search for each substring of the text, but we do not need to reserve the memory for the hash table because it would have just one entry. A straightforward implementation based on this description would be much slower than a brute-force search (since computing a hash function that involves every character is likely to be much more expensive than just comparing characters), but Rabin and Karp showed that it is easy to compute hash functions for M
-character substrings in constant
time (after some preprocessing), which leads to a linear
-time substring search in practical situations.