CS3851 -- Lab 1: Searching

Outcomes Addressed

  • apply asymptotic time complexity analysis to choose among competing algorithms.

Individual Assignment

Students should work individual on this assignment.

Overview

The purpose of this lab assignment is to review concepts from previous courses. This is accomplished through implementing two substring search algorithms and comparing their performance.

Time Logging Tip

If you found Process Dashboard cumbersome, you are encouraged to take a look at SlimTimer.com. SlimTimer is an easy to use web application that allows you to "clock-in" and "clock-out" of various tasks. You may find it helpful in keeping track of the time you spend on this course.

Problem Statement

You should write a Java program that will search through text files for a given sequence of words and display all of the line numbers (of the file) on which the sequence of words was found. You will implement two substring search algorithms (described below) and compare the run times of these two algorithms and the String.indexOf() method on a variety of test cases.

You should test your program on the files in searches.zip (taken from Project Gutenburg). Upon unzipping you will find a file called searches. This file contains a set of substrings (each on a separate line) that should be searched for in the other files that are part of the archive. You should compare all of the searches on all of these files using your implementation of the two algorithms described below and the String.indexOf() method.

One required feature that is not addressed in the algorithm pseudocode below is that your program should assume that all whitespace is created equal. This will allow your program to find My clothes are on fire in all three of the following examples:

Example 1                  Example 2                  Example 3
=========                  =========                  =========
Bill exclaimed,            Bill exclaimed, "My        Bill exclaimed, "My
"My clothes are on fire"   clothes are on fire"       clothes are
as he stopped, dropped,    as he stopped, dropped,
and rolled.                and rolled.                on    fire" as he
                                                      stopped, dropped, ...

Expected output:

Example 1                  Example 2                  Example 3
=========                  =========                  =========
2                          1-2                        1-4

Algorithms

Both algorithms described below have the same parameters passed in and returned. file is the file (with n characters in it) and substr is the substring (with m characters in it) to be found in the file. The value returned is either -1 or the index of file where substr begins. Please note that pseudocode is given to help explain the algorithm. Your code may not look like the pseudocode. For example, your algorithms may not take in a string representing the whole file. Also, your program should display the line number(s) on which the substring was found in the file.

Algorithm 1

The first algorithm is a simple brute-force technique. Here is pseudocode for the first algorithm:

Algorithm1(file, substr) {
  for i=0 to n-m step 1 do {
    j=0
    while j<m and file[i+j]==substr[j] do {
      j = j+1
    }
    if j==m then {
      return i
    }
  }
  return -1
}

Algorithm 2

The second algorithm attempts to improve the search time by making use of the following techniques: 1) when testing a potential placement for a match of substr in file, start comparing at the end of substr and move towards the front. 2) when a test for a match fails (say at file[i]) remember the character c = file[i] which caused the failure. If c is not contained in substr we can avoid checking any potential placement of substr that overlaps with file[i]. Otherwise, we can shift substr until an occurrence of c in substr is aligned with file[i]. Here is pseudocode for the second algorithm:

Algorithm2(file, substr) {
  i=m-1
  j=m-1
  do {
    if substr[j]==file[i] then {
      if j==0 then {
        return i
      } else {
        i=i-1
        j=j-1
      }
    } else {
      i=i + m - min(j, 1+last(file[i]))
      j=m-1
    }
  } until i>n-1
  return -1
}

Where last(c) is the index of the right most occurrence of c in substr (unless c is not in substr in which case, last(c) we set it to -1).

Enhancements (optional)

Students interested in exploring this topic further may wish to consider this algorithm as well.

Other students may be interested in implementing a command line interface so that the program functions much like the UNIX grep command. Be sure to implement the requirements of the lab (returning the line number on which matches are found as a command line option).

Intermediate Deliverable (due one hour prior to week 2 lab)

You must be prepared to demonstrate benchmarking the String.indexOf() method and your implementation of Algorithm 1. In addition, log all the time spent on this assignment (for week 1) in the FAST system.

Lab report (due 11:00pm, the day prior to week 3 lab)

Here is a template file to use as a starting point for this report.

The lab report should be self-contained. That is, it should be possible for someone to understand what you did and why without seeing anything other than your report. Your report should be in your own words and include:

  • Educational Objective
  • Problem Statement
  • Pattern focus: what patterns did you consider using when designing your solution, what did you use/not use, why?
  • Discussion: convince me that you were thinking when you did this assignment.
  • Conclusions (what you learned, suggestions of how the lab assignment could have been better, things you would have done differently, etc.)
  • Program results
  • Documented source code

As with any report you submit, correct spelling and grammar are required. In addition, your report should be submitted electronically following the Electronic Submission Guidelines. (You may wish to consult the sample report before submitting your report.) Be sure to keep copies of all your files, in case something gets lost.

If you have any questions, consult the instructor.

Acknowledgment

This laboratory was developed by Dr. Chris Taylor.

  • © 2001-2009 Dr. Christopher C. Taylor •
  • Office: L-343 •
  • Phone: 277-7339 •
  • npǝ˙ǝosɯ@ɹolʎɐʇ