JOTA (to appear)May 1994 (revised October 1994) LIDS-P-2274:PARALLEL ASYNCHRONOUS LABEL CORRECTINGMETHODS FOR SHORTEST PATHS1Dimitri P. Bertsekas2, Francesca Guerriero3, and Roberto Musmanno3Abstract. In this paper we develop parallel asynchronous implementations of someknown and some new label correcting methods for finding a shortest path from a singleorigin to all the other nodes of a directed graph. We compare these implementations ona shared memory multiprocessor, the Alliant FX/80, using several types of randomlygenerated problems. Excellent (sometimes superlinear) speedup is achieved with someof the methods, and it is found that the asynchronous versions of these methods aresubstantially faster than their synchronous counterparts.Keywords. Shortest path problem, parallel asynchronous algorithms, shared memorymultiprocessor, label correcting method.1 2 3 Research supported by National Science Foundation under Grants 9108058-CCR, 9221293-INT, and 9300494-DMI.Laboratory for Information and Decision Systems, M.I.T., Cambridge, Mass. 02139, U.S.A.Dipartimento di Elettronica, Informatica e Sistemistica, Universith della Calabria, 87036 Rende, Italy.1. INTRODUCTIONIn this paper we consider the problem of finding a path of minimum length from an originnode to each of the other nodes in a directed graph (N,A), where N is the set of nodes and A is theset of arcs. The numbers of nodes and arcs are denoted by n and m respectively. For each arc (i,j) EA we are given a scalar length aij. For convenience, we assume that there is at most one arc from anode i to a node j, so that we can unambiguously refer to arc (i,j). A path starting from a node iland ending at a node ik consists of a sequence of forward arcs of the form (il,i2), (i2,i3), ..., (ik-l,ik).The length of such a path is defined to be the sum of the lengths of its arcsj= i jij+lFor each node j, we want to find to find a path of minimum length that starts at node 1 and ends at j.Throughout this paper we assume that all arc lengths are nonnegative and that there exists at leastone path from node 1 to each other node.The shortest path problem is very common in practice, either by itself or as a subroutine inalgorithms for more complex problems. Its fast solution is thus of great practical interest. In thispaper, we focus attention on the class of label correcting methods. A recent computational study byGallo and Pallottino [1] has shown that for single origin-all destinations shortest path problems, themost efficient label correcting methods are faster than the most efficient label setting methods in aserial computational environment, particularly for sparse problems, that is, for problems involvinggraphs with a relatively small number of arcs. This conclusion agrees with our own experience. Theresults of this paper strongly suggest that the advantage of label correcting methods for sparse all-destinations problems carries over to a shared memory parallel computation setting.The methods of this paper can be adapted to solve single origin-few destinations problems.For such problems, however, label correcting methods have been outperformed by label setting(Dijkstra) methods and also by auction algorithms, as reported in [3]-[5]. Parallel implementationsof these methods for single-origin single-destination problems have been given in [5] and [6], and itis quite likely that for many problems of this type, the two-sided Dijkstra and the two-sided auctionmethods of [5] and [6], respectively, outperform the methods of the present paper in both a serialand a parallel computing environment.The prototype label correcting algorithm, as given by Gallo and Pallottino [7], maintains avector (dl, d2, --, dn) of labels and a candidate list V of nodes. Initially, we havedl =0, di=oo for i 1,andV={l}.The algorithm terminates when V is empty, and upon termination, each label di is the shortest-2-distance to node i. Assuming V is nonempty, a typical iteration is as follows:Remove from V a node i that is in V.For each arc (i,j) E A, if dj > di + aij, setdj := di + aijand add j to V if j does not already belong to V.Fig.1Typical iteration of the generic label correcting algorithmThere are several different methods for selecting at each iteration the node to be removedfrom the candidate list V. If the node exiting V is the node with the minimum label, Dijkstra'smethod is obtained. In this case, each node will enter and exit V exactly once. Label correctingmethods avoid the overhead associated with finding the minimum label node at the expense ofmultiple entrances of nodes into V.The simplest label correcting method, the Bellman-Ford method [8], maintains V in a FIFOqueue; nodes are removed from the top of the queue and are inserted at the bottom. Moresophisticated label correcting methods maintain V in one or in two queues and use more complexremoval and insertion strategies. The objective is to reduce the number of node reentries in V.These methods are significantly faster than the Bellman-Ford method, and will be discussed in thenext two sections with an emphasis on a general principle enunciated in [2] for the case where thearc lengths are nonnegative. According to this principle, the number of node reentries is reduced ifnodes with relatively small label are removed from V. Dijkstra's method, the threshold algorithm of[9], and the SLF (Small Label First) method of [2] conform to this principle. A new method, theLLL (Large Label Last) method, which also conforms to this principle, will be presented in the nextsection. Other methods can be obtained by combinations of threshold, SLF, LLL, and also theD'Esopo-Pape method of [10]. For a recent textbook discussion and analysis of other shortest pathmethods, we refer the reader to [ 11].Label correcting methods can be parallelized in straightforward fashion. Furthermore, theyadmit an asynchronous implementation, as first shown in [12] in the broader context of dynamicprogramming. In such an implementation, multiple nodes of the candidate list can beasynchronously and independently chosen for iteration by different processors, and the associatedcalculations may be done at the various processors with label information that is potentially out-of-date because of intermediate label updating operations by other processors; see also [13], p. 451.An extensive reference on parallel asynchronous algorithms, including shortest path methods, is[14], particularly Ch. 6. There is considerable computational evidence at present that asynchronousalgorithms, when valid, can be substantially faster than their synchronous counterparts, primarilybecause they avoid the penalty associated with synchronizing the iterations at different processors.-3-We note that with the exception of the auction algorithms of [6], all earlier implementations ofparallel shortest path algorithms that we are aware of, [5], [15], [16], are synchronous.A major aim of this paper is to develop parallel synchronous and asynchronousimplementations of a variety of label correcting methods, and to evaluate their speedup over theirserial versions in a shared memory machine, the Alliant FX/80 with 8 processors. This is done inSections 3 and 4. Our major findings are that (a) with proper implementation, excellent (close tolinear) speedup can be obtained with some but not all label correcting methods, (b) asynchronousimplementations are considerable faster that their synchronous counterparts, and (c) the thresholdmethod, which in combination with the SLF and the LLL methods is the fastest serial method in ourexperiments, does not lend itself to substantial speedup. As a result the pure SLF and SLF-LLLmethods are the fastest in a parallel setting.2. LABEL CORRECTING METHODS BASED ON THE SMALL LABEL PRINCIPLEIn this section we describe three methods motivated by a general principle given in [2]regarding the node selection policy of a label correcting method. According to this principle, forproblems with nonnegative arc lengths, the number of iterations of the method is strongly correlatedwith the average rank of the node removed from V, where nodes are ranked in terms of the size oftheir label (nodes with small labels have small rank). Thus one should make an effort to selectnodes with relatively small label. This was verified by extensive testing reported in [2] with twomethods based on this principle, the threshold and SLF methods, and their combinations. Wedescribe these two methods and we then propose a third new method, that can also be combinedwith the first two.In the threshold algorithm of [9], the candidate list V is partitioned in two disjoint queuesQ1 and Q2, on the basis of a threshold parameter s. At each iteration, the node removed from V isthe top node of Q1, while a node entering V is added at the bottom of Q2 or Q1 depending onwhether its label is greater than s, or smaller or equal to s, respectively. In this way, the queue Q1contains only nodes whose labels are not larger than s. When Q1 is exhausted, the entire list V isrepartitioned in two queues according to an appropriately adjusted threshold parameter.To understand the main idea of the threshold algorithm, suppose that at time t, the thresholdis set to a new value s, and at some subsequent time t'>t the queue Q1 gets exhausted. Then at timet' all the nodes of the candidate list have label greater than s. In view of the nonnegativity of the arclengths, this implies that all nodes with label less or equal to s will not reenter the candidate list aftertime t'. In particular, all nodes that exited the candidate list between times t and t' becomepermanently labeled at time t' and never reenter the candidate list. We may thus interpret the-4-threshold algorithm as a block version of Dijkstra's method, whereby a whole subset of nodesbecomes permanently labeled when the queue Q1 gets exhausted.However, when one tries to parallelize the threshold algorithm, it is difficult to maintain thepermanent labeling property described above. The reason is that this property depends on using auniform threshold value for the entire candidate list. In particular, this property will not hold if thecandidate list is divided into multiple (partial) candidate lists, each operated by a separate processorwith its own independent threshold value. The alternative to maintaining multiple parallel lists withindependent threshold values is either to maintain a single list, which is accessed by all processors,or to maintain a common threshold value across the independent lists of the different processors.Both of these alternatives requires considerable synchronization between processors, and this is thereason why we were unable to parallelize the threshold method as efficiently as other methods.We also note that the performance of the threshold method is very sensitive to theprocedure used for adjusting the threshold parameter s. In particular, if s is chosen too small, themethod becomes equivalent to an unsophisticated version of Dijkstra's algorithm, while if s ischosen too large, the method is quite similar to the Bellman-Ford algorithm. The original proposalof the threshold algorithm [9] gives a heuristic method for choosing the threshold that worksremarkably well for many problems, as also verified in [1] and [2]. However, it appears thatchoosing appropriate threshold values becomes more complicated in a parallel setting.In the Small Label First algorithm (SLF) the candidate list V is maintained as a doubleended queue Q. At each iteration, the node removed is the top node of Q. The rule for insertingnew nodes is given below:Let i be the top node of Q, and let j be a node that enters Q.If dj < di then enter j at the top of Qelse enter j at the bottom Fig.2of Q.SLF queue insertion strategyThe SLF method provides a rule for inserting nodes in the queue, but always removesnodes from the top of Q. We now propose a more sophisticated node removal strategy, which aimsto remove from Q nodes with small labels. In particular, we suggest that, at each iteration, when thenode at the top of Q has a larger label than the average node label in Q (defined as the sum of thelabels of the nodes in Q divided by the cardinality IQI of Q), this node is not removed from Q, butrather it is repositioned to the bottom of Q. We refer to this as the Large Label Last strategy (LLLfor short). Fig. 3 summarizes the LLL strategy for removing nodes from V.-5-AdjLet i be the top node of Q, and let s = iQIf di > s then move i at the bottom of Q .Repeat until a node i such that di < s is found and is removed from Q.Fig.3LLL node selection strategyIt is simple to combine the SLF queue insertion and the LLL node selection strategies,thereby obtaining a method referred to as SLF-LLL. We have found that the combined SLF-LLLmethod consistently requires a smaller number of iterations than either SLF or LLL, although thegain in number of iterations is sometimes more than offset by the extra overhead per iteration.The SLF and LLL strategies can also be combined with the threshold algorithm. Inparticular, the LLL strategy is used when selecting a node to exit the queue Q1 (the top node of Q1is repositioned to the bottom of Q1 if its label is found smaller than the average label in Q1).Furthermore, whenever a node enters the queue Q1, it is added to the bottom or the top of Q1depending on whether its label is greater than the label of the top node of Q1 or not. The samepolicy is used when transferring to Q1 the nodes of Q2 whose labels do not exceed the currentthreshold parameter. Thus the nodes of Q2 are transferred to Q1 one-by-one, and they are added tothe top or the bottom of Q1 according to the SLF strategy. Finally, the SLF strategy is alsofollowed when a node enters the queue Q2.It is also possible to combine the SLF and LLL strategies with the D'Esopo-Pape method[10], as has already been proposed (for the case of the SLF strategy) in [17]. In the D'Esopo-Papemethod the candidate list V is maintained as a double ended queue Q. At each iteration, the noderemoved is the top node of Q, but a new node is inserted at the bottom of Q if it has never enteredQ before, and is inserted at the top of Q otherwise. The rationale for this queue insertion strategy issomewhat unclear, but the literature contains numerous reports of excellent performance of theD'Esopo-Pape method. However, the results of [2] show that the D'Esopo-Pape method is notconsistently faster than the Bellman-Ford algorithm and indeed in some cases it is dramaticallyslower. Following the suggestion of [17], we have also experimented with serial implementationsof various combinations of the SFL and the SLF-LLL strategies with the D'Esopo-Pape method.We have verified that the use of the SLF strategy for nodes that enter Q for the first time reducesthe number of iterations and that the use of the LLL strategy, in addition to SLF, reduces thenumber of iterations even further. However, we found that in a serial environment, thecombinations of SLF and LLL with the threshold algorithm are much faster than the correspondingcombinations with the D'Esopo-Pape method. We have not experimented with combinations of theD'Esopo-Pape method with SLF and LLL in a parallel setting. We note, however, that parallelasynchronous implementations of such combinations based on the ideas of this paper are-6-straightforward. It is plausible that these implementations will prove effective for problems wherethe D'Esopo-Pape method is much faster than the Bellman-Ford algorithm.The results of [2] and [17], and the results of the present paper demonstrate that forproblems with nonnegative arc lengths the SLF and LLL strategies consistently improve theperformance of the Bellman-Ford, the threshold, and the D'Esopo-Pape method. It is seen thereforethat SLF and LLL are complementary to the other basic label correcting methods and improve theirperformance when combined with them. We will see in the next two sections that the same is true ina parallel setting.Regarding the theoretical worst-case performance of the SLF and the combined SLF-LLLalgorithms, it is not known at present whether these algorithms have polynomial complexity.However, extensive computational experience has yielded no indication of nonpolynomial behavior.In any case, it is possible to construct provably polynomial versions of these algorithms as follows.Suppose that there is a set of increasing iteration indices tl, t2,...,tn+l such that tl=l, and fori=l,...,n, all nodes that are in V at the start of iteration ti are removed from V at least once prior toiteration ti+l. Because all arc lengths are nonnegative, this guarantees that the minimum label nodeof V at the start of iteration ti will never reenter V after iteration ti+l. Thus the candidate list musthave no more than n-i nodes at the start of iteration ti+l, and must become empty prior to iterationtn+l. Thus, if the running time of the algorithm between iterations ti and ti+l is bounded by R, thetotal running time of the algorithm will be bounded by nR, and if R is polynomially bounded, therunning time of the algorithm will also be polynomially bounded.Assume now, in particular, that between iterations ti and ti+l, each node is inserted at thetop of Q for a number of times that is bounded by a constant and that (in the case of SLF-LLL) thetotal number of repositionings is bounded by a constant multiple of m. Then it can be seen that therunning time of the algorithm between iterations ti and ti+l is O(m), and therefore the complexity ofthe algorithm is O(nm). To modify SLF or SLF-LLL so that this result applies, it is sufficient thatwe fix an integer k> 1, and that we separate the iterations of the algorithm in successive blocks of kniterations each. We then impose an additional restriction that, within each block of kn iterations,each node can be inserted at most k-l times at the top of Q (that is, after the (k-l)th insertion of anode to the top of Q within a given block of kn iterations, all subsequent insertions of that nodewithin that block of kn iterations must be at the bottom of Q). In the case of SLF-LLL, we alsoimpose the additional restriction that the total number of repositionings within each block of kniterations should be at most km (that is, once the maximum number of km repositionings is reached,the top node of Q is removed from Q regardless of the value of its label). The worst-case runningtime of the modified algorithms are then O(nm). In practice, it should be highly unlikely that therestrictions introduced into the algorithms to guarantee O(nm) complexity will be exercised if k islarger than say 3 or 4.-7-3. PARALLEL LABEL CORRECTING METHODSThe general principle for parallelizing the generic label correcting method is straightforward.The basic idea is that several nodes can be simultaneously removed from the candidate list and thelabels of the adjacent nodes can be updated in parallel. In a shared memory machine, the label of anode is maintained in a unique memory location, which can be accessed by all processors. Duringthe concurrent label updating it is possible that multiple processors will attempt to modifysimultaneously the label of the same node. For this reason, the label updating operation must beexecuted with the use of a lock, which guarantees that only one processor at a time can modify agiven label.Two important characteristics of a parallel shared memory implementation of a labelcorrecting method are whether:1) The candidate list is organized in a single queue shared by all processors, or in multiplequeues, that is, a separate queue for each processor.2) The label updating is synchronous or asynchronous.The issue of one versus multiple queues primarily deals with the tradeoff between goodload balancing among multiple processor queues and increased contention for access to a singlequeue. We will see, however, that multiple queues also enhance the effectiveness of the SLF andLLL strategies because with multiple queues, more nodes with small labels tend to rise to the top ofthe queues.Our implementation of the various queue strategies is as follows:Parallel One-Queue Algorithm. We have a single queue Q shared among all processors (in thecase of the threshold algorithms this queue is partitioned as discussed earlier). Each processorremoves the node at the top of Q, updates the labels of its adjacent nodes, and adds these nodes (ifnecessary) into Q, according to the insertion strategy used. The procedure is repeated until Q isfound empty. In the latter case the processor switches to an idle state and reawakens when Qbecomes nonempty. The execution is stopped when the idle condition is reached by all processors.This algorithm suffers for substantial contention between the processors to access the top node ofQ and also to insert nodes into Q.Algorithm. In this algorithm, each processor uses a separate queue. ItParallel Multiple-Queues extracts nodes from the top of its queue, updates the labels for adjacent nodes, and uses a heuristicprocedure for choosing the queue to insert a node that enters V. In particular, the queue chosen is-8-the one with minimum current value for the sum of the out-degrees of the nodes assigned to thequeue (the out-degree of a node i is the number of outgoing arcs from i). This heuristic is easy toimplement and ensures good load balancing among the processors. In our implementations, a nodecan reside in at most one queue. In particular, a processor can check whether a node is present inthe candidate list (that is, in some queue) by checking the value of a boolean variable, which isupdated each time a node enters or exits the candidate list. In the case of the threshold algorithms,the threshold setting policy of the corresponding serial method was used independently for each ofthe queues.For all algorithms tested, we have found that the multiple-queues versions were moreefficient than their single queue counterparts. The reason is that in the case of multiple queues, thereis much less contention for queue access than in the case of a single queue, because with multiplequeues, the likelihood of multiple processors attempting simultaneously to insert a node in the samequeue is much smaller. For this reason, we concentrate in what follows in the multiple-queuesimplementation.The issue of synchronous versus asynchronous implementation is an issue of tradeoffbetween orderliness of computation and penalty for synchronization. In a synchronousimplementation, the computation proceeds in rounds of parallel iterations. During each round, eachprocessor removes a different node from the candidate list (if the number of processors is greaterthan the number of nodes, some processors remain idle). The processors then update in parallel thelabels of the corresponding adjacent nodes. Finally, a new round begins once all the label updatingfrom the current round is finished.In an asynchronous algorithm, there is no notion of rounds, and a new node may beremoved from the candidate list by some processor while other processors are still updating thelabels of various nodes. A single origin-single destination label correcting method resembling theones considered here is given in p. 451 of [14]. More formally, for t = 0, 1, ..., let dj(t) denote thevalue of the label of node j at time t; this is the value of dj that is kept in shared memory. In ourmathematical model of the asynchronous label correcting algorithm, the label dj(t) is updated at asubset of times Ti c { O, 1, ... I by some processor that need not be specified further.The updating formula is:1) = + idj(t Jdj(t) , otherwise.I t (1)Here ti(t) is the time at which the label di was read from shared memory by the processorupdating dj at time t. The asynchronism results from the possibility that we may have tj (t) < t and-9-di( (t)) : di(t) because the label di stored in shared memory may have been changed between thetimes t (t) and t by another processor. Note, however, that before the label of dj can be changedaccording to Eq. (1), the value di(Ji(t))+ aij must be found smaller than the current value dj(t).One way to accomplish this is to lock the memory location storing dj after dj is read, to ensure thatno other processor can change dj while the testdi (i(t)) + aij < dj(t) (2)is conducted. The drawback of this method is that the memory location of dj may be lockedunnecessarily, while other processors are waiting to read the value of dj.An alternative method that we found much more efficient is to first read dj(t') at some timet' and (without locking its value) compare it to di(i (t' )) + aij. If dj is found smaller, its memorylocation is locked and its current value dj(t) (which may have been changed by another processor asthe test (2) was being conducted) is read again. Depending on whether the test (2) is passed, thenew value dj(t+l) is recorded according to Eq. (1) and the corresponding memory location isunlocked. This memory management method reduced significantly the number of lockingoperations and contributed substantially in the speedup of the algorithms.The convergence of the preceding algorithm to the correct shortest distances di, that is,di(t)= di, Vt2 > t , i= 1,2, ...,n, (3)where t is some (finite) time, can be shown under very weak assumptions. In particular, what isneeded is that TJ is an infinite set for each j • 1, that if (i,j) is an arc, the node i is used in Eq. (1) foran infinite subset of Tj, and that ti(t) -oo as t -> oo. These are the minimal conditions forasynchronous convergence, as discussed in [14], Ch. 6. Note that the computation can beterminated once a time t such that Eq. (3) holds is found. In our shared memory context, the time twhere termination occurs as in Eq. (3) is recognized as the time where the queue Q is empty and allprocessors are idle. The proof of convergence closely resembles related proofs in [14], Section 6.4,in [11], and in [13], Section 5.2.4, and will not be given here.It has often been found empirically that asynchronous algorithms, when valid, outperformtheir synchronous counterparts because they are not delayed by synchronization requirements.Examples are given in references [6] and [18], which give parallel asynchronous implementationsof auction algorithms that bear similarity with the implementations given here. However, to ourknowledge, the present paper is the first to address the implementation of asynchronous labelcorrecting methods and to assess their performance.The synchronous algorithms also use multiple queues, since we found the single queueversions to be relatively inefficient. The insertion of nodes in the queues is done similar to the-10-corresponding asynchronous algorithms. Our implementation is depicted in Fig. 4, and involvestwo synchronization points, the first at the end of the label updating procedure and the second at theconclusion of the iteration. Each processor temporarily stores the values of the updated labels in aprivate memory area; in this way, the new labels of nodes can be computed by a processor withoutlocking their shared memory locations, which would delay the reading of these labels by otherprocessors. Thus, at the end of the label updating task, the same node could be stored into multipleprivate memory locations with different label values. Following the label updating task, the updatedlabels are transferred to their main (shared) memory locations, and the corresponding nodes areadded to V, if they are not already present in V. We have also tried the alternative scheme where thenode labels are directly updated at their shared memory locations, but this approach turned out to beless efficient. In our implementation of the asynchronous algorithms, a processor upon completingan iteration, does not wait for the completion of the iteration of the other processors at any time butstarts instead a new iteration (if V is not empty), thereby avoiding the correspondingsynchronization penalty.-11-V empty ?topnoAt most p nodesin V are extractedii iadjacPossible insertion of nodes in V Possible insertion ofipnodes in VFig.4Parallel synchronous label correcting algorithm4. NUMERICAL EXPERIMENTSThe SLF and SLF-LLL algorithms were implemented and tested using an Alliant FX/80.This computer is based on a vector-parallel architecture with 8 processors, each with 23 Mflops ofpeak performance, sharing a common memory of 32 MBytes. The compiler used was FX/Fortran-12-4.2. The vectorization capability of the processors was not used in our experiments.In order to evaluate numerically the efficiency of the methods, we have tested the followingsix codes, which evolved from the codes of [1] and [2]:-B-F: Bellman-Ford method;-SLF: Small Label First method;-SLF-LLL: Small Label First method, using in addition the Large Label Last strategy for noderemoval;-THRESH: Threshold method; the method for setting the threshold parameter is the same asthe one that was recommended in [9] and was also used in [2];-SLF-THRESH: Threshold method in combination with the SLF method for the nodeinsertion strategy;-SLF-LLL-THRESH: The preceding method, using in addition the Large Label Last strategyfor node removal;We used four different types of randomly generated test problems for which all arc lengthswere chosen according to a uniform distribution from the range [ 1,1000].Grid/random problems (G 1, G2, G3, G4). These are problems generated by a modified versionof the GRIDGEN generator of [11]. The number of arcs is 1,000,000 for all problems, and thenodes are arranged in a square planar grid with the origin node 1 set in the southwest corner. Eachpair of adjacent grid nodes is connected in both directions. We can also have additional arcs withrandom starting and ending nodes. The number of nodes was selected so that the total number ofadditional arcs is approximately 2, 3, 4, and 5 times the number of grid arcs.Euclidean grid/random problems (El, E2, E3, E4). These problems are generated similar to thepreceding class. The only difference is that the length of each nongrid arc from the grid node (ij) tothe grid node (h,k) is set to r-eij,hk, where eij,hk is the Euclidean distance of the nodes (the squareroot of (i-h)2+(j-k)2), and r is an integer chosen according to a uniform distribution from the range[1,1000].Netgen problems (N1, N2, N3, N4). These are problems generated with the public domainprogram NETGEN [19]. The number of arcs is 1,000,000, whereas the number of nodes waschosen as 31,622, 15,811, 11,952, and 10,000.Fully dense problems (C1, C2, C3, C4). In these problems all the possible n(n-1) arcs arepresent.-13 -Road networks (R1, R2, R3, R4). These are the Manhattan, Waltham, Boston, and MiddlesexCountry road networks from the TIGER/LineTM Census Files, which were also tested in [17]. Wethank Dr. T. Dung for providing these networks to us. In all our tests, node 0 was taken as theorigin.Test Nodes ArcsG1, E1 70756 1000000G2, E2 50176 1000000G3, E3 40804 1000000G4, E4 35344 1000000N1 31622 1000000N2 15811 1000000N3 11952 1000000N4 10000 1000000C1 250 62250C2 500 249500C3 750 561750C4 1000 999000R1 4795 16458R2 26347 64708R3 102557 250616R4 108324 271340Tab. 1List of test problems5. EXPERIMENTAL RESULTS AND DISCUSSIONWe now discuss our experimental results. For each category of test problems we give thesequential (one-processor) and the parallel (8-processor) solution times for each algorithm. We alsogive the speedup for 4 and 8 processors. We measured speedup for a given problem and for a givenalgorithm as the ratio of the one-processor time over the multiple-processor time required by thealgorithm. A more detailed accounting of our experimental results is given in the report [20], andincludes the number of iterations and the times required by the synchornous and the asynchronousversion of each algorithm on each of the test problems. For the parallel algorithms we report resultsonly with the more efficient multiple-queues versions.Grid/random problems. Figure 5 gives the sequential execution times, and shows that thethreshold methods are much faster than the others. For these problems, the threshold methodsrequire a very small number of iterations, almost equal to the number of nodes, which is the lower-14-bound attained by Dijkstra's algorithm. The combinations with the SLF and LLL strategiesconsistently require a smaller number of iterations than the pure threshold method. However, sincethe threshold method works very well for these problems, there is little or no further reduction inthe serial execution time as a result of the combination and, in some cases, there is a slight timeincrease due to the extra overhead of the SLF and LLL strategies. However, the SLF and LLLstrategies are also very helpful in reducing the number of iterations without a threshold, as can beinferred by comparing the results of the SLF, SLF-LLL, and B-F methods.[ THRESHO SLF EJ SLF-THRESH* SLF-LLL 0 SLF-LLL-THRESH706050403020100.G1 G2 G3 G4Fig. 5Time in secs required to solve grid/random problems with the sequential codes* B-F The improvements due to parallelism are summarized in Table 2, where the speedup valuesusing 4 processors and 8 processors are reported for the asynchronous parallel algorithms. Fig. 6gives the corresponding times using 8 processors.Problem G1 G2 G3 G4 B-F 2.67 / 4.28 2.92 / 5.09 2.98 / 4.71 2.03 / 5.25 SLF 2.81 / 5.21 3.01 / 4.77 2.97 / 5.48 3.21 / 6.25 SLF-LLL THRESH SLF THRESH 1.16/ 1.58 1.28/ 1.96 1.29 / 1.72 1.19 / 1.83 SLF-LLLTHRESH1.11/ 1.591.27 / 1.951.32/ 1.811.28 / 1.812.51 / 4.48 1.17/ 1.43 2.49 / 4.61 1.22 / 1.61 2.52 / 4.75 0.96 / 1.46 2.75 / 5.47 0.92 / 1.36 Tab. 2Speedup values for the asynchronous parallel codes (4 processors / 8 processors)It can be seen from Table 2 that the performance of the parallel asynchronous threshold methodsis poor; a maximum speedup value of only 1.96 is obtained. This is due in part to the difficulty inparallelizing the threshold methods, which involve operations, such as the threshold setting and thetransfer of nodes between the two queues, that are inherently sequential. Furthermore, with the use-15-of multiple queues the permanent labeling property of the threshold method is lost, as discussed inSection 2. In addition, in the threshold methods, it is difficult to choose an appropriate threshold,especially in the parallel case, when a threshold must be set for each queue. The SLF and LLLstrategies are very helpful in reducing the number of iterations and are well suited forparallelization. An interesting result, especially with SLF, is that the use of multiple queues reducessubstantially the number of iterations over the sequential version. This phenomenon was also notedfor the other test problems. One possible explanation is that by using multiple queues, the sortingprocess that places nodes with small labels near the top of the queues is enhanced. The reduction innumber of iterations accounts for the particularly good speedup achieved with SLF (up to 6.25 with8 processors), and also with SLF-LLL.* B-F 14 U THRESHD SLF 0 SLF-THRESHE SLF-LLL 0 SLF-LLL-THRESH1210G1 G2 G3 G4Fig. 6Time in secs required to solve grid/random problems with the parallel asynchronous codes using 8 processors.Euclidean grid/random problems. These problems are more difficult than the preceding onesbecause of the considerable difference between the lengths of the grid arcs and the nongrid arcs.Here THRESH requires a substantially smaller number of iterations that B-F, but the number ofiterations of THRESH is quite large (two or three times larger than the number of nodes). The SLFand LLL strategies substantially reduce the number of iterations as can be inferred from Fig. 7.Also in the parallel case we observe a large speedup with SLF and SLF-LLL. In particular, withSLF we achieve maximum speedup of around 6.82, whereas with the SLF-LLL version we achievea maximum speedup of 5.46. Again, our explanation is that the use of multiple queues enhances theprocess of examining nodes with small labels first, and results in a reduced number of iterations.-16-10080604020* B-F [ THRESHO SLF 120 E[ SLF-THRESH* SLF-LLL 0 SLF-LLL-THRESHEl E2 E3 E4Fig. 7Time in secs required to solve Euclidean grid/random problems with the sequential codesProblem El E2 E3 E4 B-F 2.65 / 4.69 2.95 / 4.49 3.13 / 5.50 3.18 / 4.90 SLF 2.83 / 5.49 3.08 / 6.03 3.36 / 6.82 3.15 / 6.28 SLF-LLL THRESH 2.28 / 4.54 1.40 / 2.32 2.50 / 5.10 1.33 / 2.37 2.78 / 5.46 1.17 / 1.97 2.51 / 4.41 1.21 / 2.15 Tab. 3Speedup values for the asynchronous parallel codes (4 processors / 8 processors)SLF THRESH 1.40 / 2.70 1.38 / 2.37 1.15 / 2.30 0.89 / 2.28 SLF-LLLTHRESH1.13 / 2.121.03 / 1.540.89 / 1.820.95 / 1.97-17-* B-F E THRESHO SLF e0 SLF-THRESH30 * SLF-LLL El SLF-LLL-THRESH282624222018161412108El E2 E3 E4Fig. 8Time in secs required to solve Euclidean grid/random problemswith the parallel asynchronous codes using 8 processors.Netgen problems. These problems are substantially more dense than the preceding ones, and inthe sequential case, the threshold algorithms are much faster than the others. The improvement inexecution time relative to B-F is due to the substantial reduction of the number of iterations.* B-F 1M THRESHE3 SLF El SLF-THRESH60 -- * SLF-LLL El SLF-LLL-THRESH5040302010Fig. 9Time in secs required to solve Netgen problems with the sequential codesIn the parallel asynchronous case, using multiple queues in combination with the SLF strategyworks very well and results in fewer iterations. The reduction in the number of iterations is so large-18 -for one of the problems that the speedup is greater than 8 with 8 processors. As a result, the SLFmethod outperforms all other parallel methods.Problem B-F SLF SLF-LLL THRESH SLF SLF-LLLTHRESH THRESHN1 3.37 / 6.07 3.10 / 6.12 2.46/ 4.73 1.03 / 1.33 1.11 / 1.68 1.25 / 1.71N2 3.37 / 6.37 4.46 / 8.56 2.60 / 6.31 0.81 / 1.44 1.06 / 1.85 1.10/ 2.01N3 3.67 / 6.85 3.93 / 7.12 3.08 / 4.44 0.96 / 1.50 1.19 / 2.16 0.89 / 2.08N4 3.47 / 6.76 4.51 / 8.45 2.65 / 5.35 0.80 / 1.48 1.06 / 2.04 1.01 / 2.05Tab. 4Speedup values for the asynchronous parallel codes (4 processors / 8 processors)B-F U] THRESHa SLF 17 SLF-THRESH12 SLF-LLL 3 SLF-LLL-THRESH10N1 N2 N3 N4Fig. 10Time in secs required to solve Netgen problems with the parallel asynchronous codes using 8 processors.Fully dense problems. For fully dense problems the results are quite similar to those for thepreceding problems, as can be seen from Fig. 11 and 12. The value of speedup is larger for theseproblems and the parallel performance of the Bellman-Ford method is relatively better than for thepreceding problems.Problem B-F SLF SLF-LLL THRESH SLF SLF-LLLTHRESH THRESHC1 3.96 / 7.26 3.38 / 7.52 2.97 / 5.72 1.90 / 3.18 1.85 / 2.84 1.57 / 2.72C2 4.03 / 7.50 4.22 / 8.08 3.33 / 5.98 1.88 / 3.59 2.03 / 3.92 1.67 / 2.97C3 4.10 / 7.73 4.20 / 8.23 3.09 / 5.80 2.32 / 2.96 2.08 / 3.87 1.70 / 3.32C4 4.21 / 8.02 4.15 / 8.13 3.06 / 6.16 2.09 / 3.05 2.25 / 4.07 1.68 / 3.32Tab. 5Speed-up values for the asynchronous parallel codes (4 processors / 8 processors)-19-* B-F B THRESHO SLF E] SLF-THRESH40 * SLF-LLL El SLF-LLL-THRESH35302520151050C1 C2 C3 C4Fig. 11Time in secs required to solve fully dense problems with the sequential codes* B-F A THRESHa SLF 0 SLF-THRESH8 U SLF-LLL E' SLF-LLL-THRESH6C1 C2 C3 C4Cl C2 C3 C4Fig. 12Time in secs required to solve fully dense problems with the parallel asynchronous codes using 8 processors.Road networks. For these problems, the SLF and LLL strategies are remarkably effective. In aserial setting they improve a great deal the performance of the Bellman-Ford and the thresholdalgorithms, as can be seen from Fig. 13. In a parallel setting they exibit excellent (often superlinear)speedup, due to a greatly reduced number of iterations, as can be seen from Tab. 6 and Fig. 14. Thereduction in the number of iterations for the SLF and LLL strategies must be attributed to the use of-20-multiple queues and the associated enhanced sorting that places nodes with small labels near the topof the queues.Problem B-F SLF SLF-LLL THRESH SLF SLF-LLLTHRESH THRESHR1 1.81 / 2.49 2.53 / 5.39 2.49 / 5.05 1.16 / 1.70 1.04 / 1.24 0.93/ 1.33R2 1.94 / 3.35 3.88 / 5.08 4.17 / 7.12 0.69 / 1.20 0.69 / 0.90 0.72 / 1.05R3 R4 2.21 1.84 / / 3.60 2.43 5.78 7.37 / / 10.45 12.66 17.53 11.33 / /18.38 21.13 0.59 0.64 / / 0.49 0.40 / 0.59 0.63 1.24 0.77 / 1.23 / 0.810.64 / 0.76Tab. 6Speed-up values for the asynchronous parallel codes (4 processors / 8 processors)* B-F [] THRESHO SLF QE SLF-THRESH* SLF-LLL 0 SLF-LLL-THRESH3.5 45 1000 6003 40 900*35 l*800 52.5 30 70040012 -25 -600500 3001.5 2-40015 220010200R1 R2 R3 R4Fig. 13Time in secs required to solve road network problems with the sequential codes-21-[] THRESHO SLF 0[ SLF-THRESH* SLF-LLL : SLF-LLL-THRESH* B-F 1.4 14 350 2501.2 12 300 2001- *fft10 2500.8 8 200 1500.6 -6 150 1000.4 -4 1000.2 2 50500 S-~ i 0 -f L 0 ~ 0R1 R2 R3 R4Time in secs required to solve road network problems Fig. 14with the parallel asynchronous codes using 8 processors.In Table 7 and Fig. 15 we aim to summarize the performance of the various methods andalso to show the advantage of the asynchronous implementations versus their synchronouscounterparts. In particular, we compare the methods following an approach that is similar to the oneproposed in [21], by giving to each method and for each test problem, a score that is equal to theratio of the execution time of this method over the execution time of the fastest method for the givenproblem. Thus, for each method, we obtain an average score, which is the ratio of the sum of thescores of the method over the number of test problems. This average score, given in Table 7,indicates how much a particular method has been slower on the average than the most successfulmethod.Code SEQ SYN ASYNBF 17.86 26.97 4.67SLF 9.39 5.72 1.21SLF-LLL 6.10 6.62 1.03THRESH 8.50 10.14 4.41SLF THRESH 3.33 2.77 1.44SLF-LLL THRESH 2.63 2.23 1.33Tab. 7Average scores of all implemented methods-22 -30* SEQUENTIAL25 -0 -rO l* ASYNCHRONOUSSYNCHRONOUS201510B-F SLF SLF-LLL THRESH SLF SLF-LLLTHRESH THRESHFig. 15Plot of the average scores of all implemented methods as per Table 7In conclusion, the use of multiple queues seems to work very well in conjunction with theSLF and LLL strategies, and the asynchronous parallel algorithms consistently outperform theirsynchronous counterparts. The threshold method, which is robust and efficient for serialcomputers, is not well suited for parallelization. Finally, the SLF and LLL strategies maintain theirefficiency when implemented in parallel, and when combined with other methods, significantlyimprove their performance both in a serial and in a parallel environment.6. ACKNOWLEDGMENTWe would like to acknowledge the director and the staff of CERFACS, Toulouse, France, forallowing us to use the Alliant FX/80.-23 -7. REFERENCES1. GALLO, G., and PALLOTTINO, S., Shortest Path Algorithms, Annals of OperationsResearch, Vol. 7, pp. 3-79, 1988.2. BERTSEKAS, D. P., A Simple and Fast Label Correcting Algorithm for Shortest Paths,Networks, Vol. 23, pp. 703-709, 1993.3. BERTSEKAS, D. P., An Auction Algorithm for the Shortest Path Problem MathematicalProgramming Study, Vol. 26, pp. 38-64, 1986.4. BERTSEKAS, D. P., PALLOTTINO, S., and SCUTELLA', M. G., Polynomial AuctionAlgorithms for Shortest Paths, Report LIDS-P-2107, Mass. Institute of Technology, May1992, to appear in Computational Optimization and Applications.5. HELGASON, R. V., and STEWART, D., One-to-One Shortest Path Problem: An EmpiricalAnalysis With Two-Tree Dijkstra Algorithm, Computational Optimization and Applications,Vol. 2, pp. 47-75, 1993.6. POLYMENAKOS, L., and BERTSEKAS, D. P., Parallel Shortest Path Auction Algorithms,Parallel Computing, Vol. 20, pp. 1221-1247, 1994.7. GALLO, G., and PALLOTTINO, S., Shortest Path Methods: A Unified Approach,Mathematical Programming Study, Vol. 26, pp. 38-64, 1986.8. BELLMAN, R., Dynamic Programming, Princeton University Press, Princeton, N.J., 1957.9. GLOVER, F., GLOVER, R., and KLINGMAN, D., The Threshold Shortest Path Algorithm,Networks, Vol. 14, 1986.10. PAPE, U., Implementation and Efficiency of Moore -Algorithms for the Shortest PathProblem, Mathematical Programming, Vol. 7, pp. 212-222, 1974.11. BERTSEKAS, D. P., Linear Network Optimization: Algorithms and Codes, M.I.T. Press,Cambridge, MA, 1991.12. BERTSEKAS, D. P., Distributed Dynamic Programming, IEEE Transactions on Aut.Control, Vol. AC-27, pp. 610-616, 1982.13. BERTSEKAS, D. P., and GALLAGER, R. G., Data Networks (2nd ed.), Prentice-Hall,Englewood Cliffs, N.J., 1992.14. BERTSEKAS, D. P., and TSITSIKLIS, J. N., Parallel and Distributed Computation:Numerical Methods, Prentice-Hall, Englewood Cliffs, N.J., 1989.15. MOHR, T., and PASCHE, C., Parallel Shortest Path Algorithm, Computing (Vienna/NewYork), Vol. 40, pp. 281-292, 1990.16. TRAFF, J. L., Precis: Distributed Shortest Path Algorithms, Proceedings of the 5thInternational PARLE Conference, Munich, Germany, pp. 720-723, Springer-Verlag, 1993.17. DUNG, T., HAO, J., and KOKUR, G., Label Correcting Shortest Path Algorithms: Analysisand Implementation, GTE Laboratories Incorporated, Waltham, MA, Unpulished Report,-24-1993.18. BERTSEKAS, D. P., and CASTANON, D. A., Parallel Asynchronous Implementations of theAuction Algorithm, Parallel Computing, Vol. 1, pp. 707-732, 1991.19. KLINGMAN, D., NAPIER, A., and STUTrz, J., NETGEN -A Program for Generating LargeScale (Un)Capacitated Assignment, Transportation and Minimum Cost Flow NetworkProblems, Management Science, Vol. 20, pp. 814-822, 1974.20. BERTSEKAS, D. P., GUERRIERO, F., AND MUSMANNO, R., ParallelAsynchronous Label Correcting Methods for Shortest Paths, Report LIDS-P-2250, Mass.Institute of Technology, May 1994.21. BROWN, A. A., AND BARTOLOMEW-BIGGS, M. C., Some Effective Methods forUnconstrained Optimization Based on the Solutionof Systems of Ordinary DifferentialEquations, Tech. Report 78, Numerical Optimization Centre, The Hatfield Polytechnic,Hatfield, UK, 1987.-25 -