Searching Algorithms in Python
Searching is the technique of selecting specific data from a collection of data based on some condition. We are familiar with this concept from our experience in performing web searches to locate pages containing certain words or phrases or when looking up a phone number in the telephone book. In this lesson, we will restrict the term searching to refer to the process of finding a particular item in a collection.
We will look at two algorithms that perform this task:
- linear search, which simply checks the items in sequence until the desired item is found
- binary search, which requires a sorted sequence of list, and checks for the value in the middle of the list, repeatedly discarding the half of the list which contains values which are definitely either all larger or all smaller than the desired value
There are numerous other searching mechanisms. They depend on the construction of more complex data structures to facilitate repeated searching. Examples of such structures are hash tables (such as Python’s dictionaries) and prefix trees. Inexact searches that find elements similar to the one being searched for are also an important topic.
The Linear Search
The simplest solution to the sequence search problem is the sequential or linear search algorithm. This technique iterates over the sequence, one item at a time, until the specific item is found or all items have been examined. In Python, a target item can be found in a sequence using the in operator:
if key in theArray : print( "The key is in the array." ) else : print( "The key is not in the array.")
The use of the in operator makes our code simple and easy to read but it hides the inner workings. Underneath, the in operator is implemented as a linear search. Consider the unsorted 1-D array of integer values shown in Figure 5.1(a). To determine if value 31 is in the array, the search begins with the value in the first element. Since the first element does not contain the target value, the next element in sequential order is compared to value 31. This process is repeated until the item is found in the sixth position. What if the item is not in the array? For example, suppose we want to search for value 8 in the sample array. The search begins at the first entry as before, but this time every item in the array is compared to the target value. It cannot be determined that the value is not in the sequence until the entire array has been traversed.
SEARCHING A SORTED SEQUENCE
A linear search can also be performed on a sorted sequence, which is a sequence containing values in a specific order.
def sortedLinearSearch( theValues, item ) : n = len( theValues ) for i in range( n ) : # If the target is found in the ith element, return True if theValues[i] == item : return True # If target is larger than the ith element, it's not in the sequence. elif theValues[i] > item : return False return False # The item is not in the sequence.
FINDING THE SMALLEST VALUE
Instead of searching for a specific value in an unsorted sequence, suppose we wanted to search for the smallest value, which is equivalent to applying Python's min() function to the sequence. A linear search is performed as before, but this time we must keep track of the smallest value found for each iteration through the loop as shown in the sample code:
def findSmallest( theValues ): n = len( theValues ) # Assume the first item is the smallest value. smallest = theValues # Determine if any other item in the sequence is smaller. for i in range( 1, n ) : if theList[i] < smallest : smallest = theValues[i] return smallest # Return the smallest found.
The Binary Search
The linear search algorithm for a sorted sequence produced a slight improvement over the linear search with an unsorted sequence, but both have a linear time-complexity in the worst case. To improve the search time for a sorted sequence, we can modify the search technique itself.
def binarySearch( theValues, target ) : # Start with the entire sequence of elements. low = 0 high = len(theValues) - 1 # Repeatedly subdivide the sequence in half until the target is found. while low <= high : # Find the midpoint of the sequence. mid = (high + low) // 2 # Does the midpoint contain the target? if theValues[mid] == target : return True # Or does the target precede the midpoint? elif target < theValues[mid] : high = mid - 1 # Or does it follow the midpoint? else : low = mid + 1 # If the sequence cannot be subdivided further, we're done. return False
After computing the midpoint, the corresponding element in that position is examined. If the midpoint contains the target, we immediately return True. Otherwise, we determine if the target is less than the item at the midpoint or greater. If it is less, we adjust the high marker to be one less than the mid-point and if it is greater, we adjust the low marker to be one greater than the midpoint. In the next iteration of the loop, the only portion of the sequence considered are those elements between the low and high markers, as adjusted. This process is repeated until the item is found or the low marker becomes greater than the high marker. This condition occurs when there are no items left to be processed, indicating the target is not in the sorted sequence.
RUN TIME ANALYSIS
To evaluate the efficiency of the binary search algorithm, assume the sorted sequence contains n items. We need to determine the maximum number of times the while loop is executed. The worst-case occurs when the target value is not in the sequence, the same as for the linear search.
The difference with the binary search is that not every item in the sequence has to be examined before determining that the target is not in the sequence, even in the worst case. Since the sequence is sorted, each iteration of the loop can eliminate from consideration half of the remaining values. When the input size is repeatedly reduced by half during each iteration of a loop, there will be log n iterations in the worst case. Thus, the binary search algorithm has a worst-case time-complexity of O(log n ), which is more efficient than the linear search.