.po +0.75i .EQ delim $$ .EN .so /home/u8/skiena/local/macros .SZ 10 .nf .ce 4 .ft B CSE 214 - Data Structures Fall 1997 Program #4 .sp Due: Monday, November 17, 1997 .ft R .sp .pp This semester we have seen a variety of different data structures for retrieval by key -- sequential search, sorting plus binary search, and hashing. While the analysis of algorithms provides considerable insight into their behavior, empirical studies are important to choose between algorithms of the same asymptotic complexity or select the best algorithm for small problems. .pp For this assignment, you are to implement several of the algorithms we have discussed in class and experiment with them to determine their performance. The algorithms you will implement will be: .np Quicksort-insertion sort .np Heapsort .np Binary search .\".np .\"Hashing via open addressing .np Hashing via chaining .lp .pp To provide test data, the file ~cse214/program4/words contains the entire Unix dictionary, arranged in a random order, comprising almost 25,000 words. The words are given, one per line, and each word contains at most 15 characters. It is best to represent each word as a TEXT variable. A suitable hash function for a word is: .EQ sum from {i=1} to {m} ( ord(word[3i-2]) + ord(word[3i-1]) + ord(word[3i]) ) times 26 sup i-1 ~~(mod~ tablesize) .EN where $m$ is the number of characters we want to hash on. It is a constant, for example 1 to 5. .pp The results of your investigations will be written up in a three page paper. Your report will answer the following questions: .np What is the optimal switch point where a hybrid Quicksort-Insertion sort becomes fastest? This issue is discussed within Sedgewick's chapter on Quicksort. .np How does the run time of Quicksort compare to Heapsort? Which is faster on unsorted data? Sorted data? .np How does the value of $m$ in the hash function affect performance? If $m=1$, it is hashing on the three characters only. If $m=5$, it would hash on all characters, including padded blanks. Maybe an intermediate value is best? What happens when you hash only on the first character? .np How does the size of the hash table effect the running time? Try hash tables of 1000, 5000, 20000, 25000, and 50000 buckets. Does making the hash table a prime number make a difference? .\".np .\"How does the load factor affect the efficiency of hashing? .\"Consider load factors of 0.5, 0.6, 0.7, 0.8, and 0.9. .\"Try chaining, open addressing with linear probing, and open addressing .\"with quadratic probing for each of these load factors. .np How does the best hashing scheme compare with binary search and the best sorting algorithm, for both preprocessing (ie. building the hash table or sorting) and query times? .pp To measure the performance, your program can either explicitly count the number of element comparisons made and/or be broken into units which the unix program \fItime\fP can be applied to. To compute the average search cost, your program should search for each word in the data file, count the total time or comparisons it takes, and divide by the number of words. The best way to do this probably is to \fIclose\fP the input file and read it again. .pp In addition the paper analyzing your results, you must handin the programs and a printout of a short run to show what your programs do. .sp .sp .ce \fIChallenge Problems\fP .pp Each assignment will contain one or more extra credit problems for students who want a challenge, or to pursue the subject in greater depth. NOTE THAT THESE PROBLEMS ARE NOT REQUIRED. For this assignment, you are encouraged to conduct experiments of your own choosing, but here are some ideas. .np What effect does the quality of the hash function have? Can you find a better hash function? .np Implement hashing by open addressing. How does it compare to chaining for similar-sized hash tables? (remember to factor in the costs of pointers in measuring similar size) .\"For open addressing, what effect does the primality of the hash table size .\"have for quadratic probing? .lp .sp .ce \fIRules of the Game\fP .np This program is not conceptually difficult, in my opinion, since the algorithms are completely described and at least partially implemented in the text. However, it contains several different parts which require intimately understanding distinct algorithms, so it may require a significant amount of time to complete. .np You may copy to your heart's content from the texts, but not from other students. Copying functions from the book will be a frustrating experience if you do not understand the algorithms - trust me. .np Since the algorithms required for this project are all fair game for the midterm, use this programming assignment as an excuse to study the topics you do not understand well. .np Like a simulation, it is important to confirm that your program is working correctly before trusting the results. Thus it is important to build in a 'debug' mode so you (and we) can verify that your program is working correctly. You will want to debug your program on a file with only 50 words, which can be obtained with the command \fIhead -50 <~cse214/program4/words >testdata\fP. .np There is absolutely no reason to copy the full words file into your directory, since it will always be there to be read. %unless you want to prevent other students %from working. .np Do not run your programs on the entire data file until you are 100% sure your program works correctly, since it is likely to run slowly. If your program cannot complete an experiment in two minutes or less, there is no shame in making a smaller data file using \fIhead\fP. Just describe what you did in your report. .np Your report should honestly describe your results. There are no right or wrong results, since everyone's programs will differ. However, if any of your timings are very surprising, figure out why - you may have a bug in your implementation. .np Your report must be typed and printed using a text formatting program. You may use whatever program you have access to, including Microsoft Word or HTML (don't post the result). .\"You should view the written report as an opportunity to learn about the .\"text processing programs available on UNIX, which you can use for the .\"rest of your Stony Brook careers for all your papers. If you prefer, latex, my favorite UNIX text formatting program is invoked \fIlatex filename\fP, and a sample latex file can be obtained from \fI~cse214/program4/sample.tex\fP. Further documentation can be found in the book 'Latex: a document preparation system' by Leslie Lamport, on reserve in the CS Library. \fIYou do not have to use latex if you have access to some other text processing system.\fP .np Don't get too fancy with formatting. You can waste an incredible amount of time making documents pretty - don't for the purposes of this class. This is \fIespecially\fP true if you use latex. .np One program you should definitely know about is the spelling checker \fIispell\fP, which you can learn about by using \fIman\fP. Spelling checkers are a painless and useful tool. To get you in the habit of using one, the TA's have been instructed to deduct 20% from your grade for any obvious spelling errors. .np Unix time gives somewhat unpredicatable results. If you can find a better way to do it, go ahead. .np Don't forget to write out and sign the pledge - \*(lqOn my honor as a student I have neither given nor received aid on this program\*)rq. Your paper will not be graded without it.