[This work will be presented as a poster at GECCO 2002]
We began the fall 2001 semester with an introduction to genetic algorithms and then dove right into the first project: searching for optimal increment sequences for "shellsort." Shellsort, or "diminishing increment" sort, uses a sequence 1 = h(0) < h(1) < ... < h(k-1) of increments to sort selected subarrays of an array a. It looks like:
for i := k-1 downto 0 do
for j := 0 to h(i)-1 do
sort(a[j], a[j+h(i)], a[j+2h(i)], ...)
The subroutine "sort" is usually an insertion sort.
The literature on shellsort is vast, but Sedgewick has written a good survey of the topic (Sedgewick 1996). The problem is of interest for a number of reasons. Sorting is a very practical problem and shellsort is a particularly simple algorithm to implement, so an efficient shellsort algorithm would be of great value. However, a complete analysis of shellsort in the average case is not known. The question we are concerned with in our project is determining increment sequences that are good in the average case, using empirical data (actual elapsed time) rather than analytical tools as our basis for evaluation.
We are aware of only one other attempt to use genetic algorithms for determining shellsort increment sequences. Simpson and Yachavaram used a genetic algorithm to construct a good "tail sequence" consisting of seven increments, then tried to extend this unique tail sequence to longer sequences (Simpson and Yachavaram, 1999).
Our approach is quite different. Using the observation that most of the "good" sequences seem to increase in a roughly geometrical way, we generate an element of the initial population as follows:
The genetic algorithm, implemented "from scratch" by the student members of the team, evolves this initial population using what is called a steady-state genetic algorithm in which one element of the population is replaced in each successive generation. This implementation was chosen for both its simplicity and its ability to be rapidly developed. (A traditional generational GA would have taken much more time to implement and was deemed impractical for the time available.) Replacement is done by first performing a crossover in which the first part of one increment sequence is joined to the last part of another sequence, preserving the monotonicity of the increment sequence. Values in the result are perturbed to remove multiples of two and three. Mutation (performed after each crossover) perturbs a randomly chosen increment (other than the first one, which must equal 1.)
We found a sequence that outperformed that of Simpson and Yachavaram:
{1, 8, 23, 131, 149, 155, 877, 2585, 5267, 13229, 72985, 91433
The table below shows a representative set of data on several files of sizes 500000 and 1000000. Each file name is n.i, where n is the size of the array in the file and i is just an identifying number. We show the minimum, maximum, and median running times, out of five trials, for each of the listed files. The column headed "S/Y" shows the results of Simpson and Yachavaram's sequence on the same file.
Best/Median/Worst Sorting Times (ms) FILE OURS S/Y 500000.0 721/724/737 749/750/752 500000.1 722/724/726 749/750/751 500000.2 721/724/725 749/751/752 1000000.0 1585/1606/1610 1677/1678/1678 1000000.1 1602/1603/1614 1678/1680/1681 1000000.2 1610/1613/1616 1678/1680/1682
We did not examine convergence properties of the algorithm, as our primary goal was the construction of a good shellsort sequence. However, a generational, rather then steady state approach, would probably lead to better performance. Many of the parameters need to be optimized.
It would be interesting to see how our sequence compares to Sedgewick's and Simpson/Yachavaram's using the comparisons metric rather than running time.
Finally, the "conditioning" of the sequences needs to be further investigated to see what quantitative effect (if any) the removal of extra multiples of 2 and 3 had on the resulting sequences.
While we did succeed in finding a shellsort sequence with apparently good average case behavior for moderately large files, we were disappointed at its behavior on smaller files. This suggests that shellsort sequences are very sensitive to changes in a single increment (and suggests that a slight change in the S/Y sequence might result in an improved sequence that outperforms ours on all file sizes).
Simpson, Richard, and Shashidhar Yachavaram. 1999. Faster shellsort sequences: a genetic algorithm application. Proceedings of the International Society for Computers and Their Applications(ISCA), April 7-9, 1999. Presented at the conference, Cancun Mexico. First author's home page: http://cs.mwsu.edu/%7Esimpson/.