Each cluster in the cluster priority list is examined to determine if the ready operation can be scheduled on it. If the copies can be scheduled in their respective groups in previous cycles, the current operation and the corresponding copies are scheduled. The proposed scheduler based on graph matching finds the best connection of a set of instructions ready to be scheduled in the current cycle among all possible scheduling alternatives.
In case of a tie in communication cost, the instruction is assigned to a pool where it can be scheduled in a unit less needed in the current cycle with the same communication cost (this is discussed in more detail in the next section). The dequeue method simply returns the next untried node (operation) from the ready prioritized list, and this node is then used to initialize a prioritized list of stacks into which an instruction can be scheduled in the current loop.
An example
The ready list also has an associated resource vector that is used to keep track of the approximate resource requirements of the nodes currently in the ready list. The enqueue method of ready list places a priority on nodes through a three-component priority function described above. The selected cluster and the node are then used to select feasible functional units on the cluster for scheduling the instruction under consideration.
The dequeue method of the list of functional units returns the least needed resource in the current cycle for binding the statement. The cluster initialization routine guarantees that the free slot required for an explicit MV operation is available before the cluster is placed in the viable cluster list. If an operation cannot be scheduled on any cluster, the operation is marked as attempted and the next operation is considered.
Once all the operations have been considered, the cycle counter is advanced and ready list and the resource demand vector is updated with operations now ready for scheduling.
MVK .S1 8,V109
A GRAPH-MATCHING-BASED INTEGRATED SCHEDULING FRAMEWORK
The Algorithm
The cost of an edge is composed of various factors that contribute to the priority of scheduling an instruction in the current cycle over other instructions. Another problem is posed such that the space of all possible matches with the same cost as the previously obtained minimum cost is searched, and its solution maximizes the number of instructions in the match. To further reduce the communication and associated costs, some of the instructions in the last match that incur high communication costs but with sufficient scheduling freedom can be delayed for consideration in later cycles.
The graph construction process creates a bipartite matching graph consisting of a set of instruction nodes including all the instructions in the ready list and a set of resource nodes including all the resources in all the clusters. A more detailed description of graph construction is available in the accompanying technical report by Nagpal and Srikant [15]. After each cycle, the mobility of each instruction remaining in the ready list is reduced by a factor β to reflect the exact freedom remaining in scheduling each instruction in the next cycle.
The coverage factor is given a relatively small contribution and is used to break ties in the cost function. As the available freedom to plan an instruction is reduced, the probability of selection in the final match increases. Since our algorithm schedules as many instructions as possible in each cycle, some of the instructions with high communication costs may appear in the final match depending on the resource and communication requirements of other instructions being considered.
So to further reduce the ICC and add extra code, scheduling instructions with high communication costs and enough freedom can be deferred to future cycles in the hope of scheduling them later.
An example
MAX DEPARTMENT : Maximum of cover factors for instructions in the ready list MIN DEPARTMENT : Minimum of cover factors for instructions in the ready list MAX MOB : Maximum mobility of instructions in the ready list. The hedging factor can also be used to determine which candidate to reject in a better way. Thus, selective rejection provides a mechanism to explore the trade-off in code size and performance.
We present the performance impact of using and not using selective rejection in terms of runtime and code size in the next section. We enumerate all possible scheduling alternatives for all instructions and leave it to the scheduler to choose among the scheduling alternatives in order to schedule as many instructions as possible in each cycle. Tables IV and V present the mobility factor, detection factor, communication cost factor, and total cost calculated for all scheduling alternatives using the cost function described above.
Each instruction in the generated schedule is suffixed with the scheduling alternative number used by the scheduler. A detailed description and analysis of the schedule generated by the graph-matching-based scheduler and UAS is available in the accompanying technical report by Nagpal and Srikant [15].
EXPERIMENTAL RESULTS 1. Experimental setup
Performance results
Percent distribution of explicit MV statements between clusters versus other statements (see Figure 13 for legend). This is because it aims to schedule an instruction in the current cycle, even if it has enough freedom and requires explicit MV control. GS reduces the number of explicit MV operations by incorporating a selective rejection mechanism with some performance degradation attributed to operation serialization in the later cycles.
Our integrated algorithm offers performance comparable to GS and code size penalty (in terms of explicit MV operations) slightly higher than GS and LP. The integrated algorithm can achieve significant speedup over UC, UM and LP on program exhibiting high ILP (low L/N). GS and LP outperform the other algorithms in terms of the number of explicit MV operations used for ICC.
LP, which takes a global view of the DFG and tends to minimize ICC, has a smaller code size penalty in terms of explicit MV operations. The integrated algorithm performs consistently better than UC and UM and slightly worse than GS and LP in terms of the number of explicit MV instructions used for ICC. We solved the problem by using hundreds of nodes (for an eight-wide machine) in a cycle and were able to schedule the entire DFG in milliseconds.
Thus, even considering a practical 16-wide machine (with a 4- or 8-cluster configuration) that has an additional communication constraint will not increase the compilation time to an extent that may question the practicability of the approach given the quality of the plans created and the importance of the quality code in the integrated domain.
Performance evaluation
However, they do not suggest any particular order for considering instructions in the ready queue. However, effective functional unit binding is important in the case of the machine model under consideration because resources differ in their operational and communicative capabilities and resource and communication constraints are tightly coupled. Our integrated scheduling algorithm shows improvement over UAS by effectively exploiting the information about free crossings and functional units available in the current cycle as well as in the earlier cycles and also the number and types of operations ready to be scheduled in the current cycle and their resource requirements.
Functional unit binding is performed while taking care of other operations to be scheduled in the current cycle and their requirements for functional units and the intersection. This helps to combat the situation where previously naive decisions lead to unnecessary delay in planning an operation in the current cycle and the resulting schedule. Our scheme is very aggressive in using the free junction and tries to schedule an instruction in the current cycle by trying to schedule communication in the previous cycles using the free communication slot where necessary.
The scheduler simultaneously selects options for the instructions to be scheduled in the current cycle, taking advantage of the communication capacity and parallelism offered by the hardware. Scheduling alternatives for instructions with sufficient freedom and high communication costs are assigned high costs, so that scheduling instructions with low values of freedom are preferred in the current cycle, hoping to schedule high-cost instructions using a low-cost alternative in later cycles. Because the GM schedules as many instructions as possible in each cycle, some instructions with high communication overhead may appear in the final match, depending on the resources and communication requirements of the other instructions considered.
To further reduce ICC and associated overhead costs, GS incorporates a selective rejection mechanism to postpone the scheduling of instructions with high communication costs and sufficient freedom for future cycles in the hope of scheduling them later using a less costly alternative.
CONCLUSIONS AND FUTURE DIRECTIONS
GM further improves on the integrated algorithm by considering all the possible scheduling alternatives obtained by varying communication options, spatial locations, and resource consumption simultaneously instead of following a fixed order. A cost function composed of various dynamically varying factors such as the communication cost, the freedom available in the scheduling of the instruction, and the hedging factor of the instruction is used to choose between scheduling alternatives for instructions competing for limited resources while scheduling the maximum number of instructions. in each cycle. Instructions with less freedom are scheduled ahead of those with higher freedom values, despite their high communication costs, to avoid stretching the overall schedule.
However, performance results show that selective rejection incurs a performance penalty to reduce the code size.