TID, November 25, 2009 # SCATTERED CONTEXT GRAMMARS & VLIW ARCHITECTURE MODELING Jakub Křoustek ikroustek@fit.vutbr.cz #### Motivation - PhD thesis: Debugging Tools for Optimized Code of VLIW Architectures (doc. Kolář) - Applied research Exploitation of SCG in VLIW Architecture Modeling - 1. Generator of Proper VLIW Assembler Code - VLIW assembler code analysis (parsing) - 3. New techniques for instruction scheduling - Co-author: Stanislav Židek #### Contents #### 1. Introduction - VLIW architecture overview - Scattered Context Grammar - 2. Generator of Proper VLIW Assembler Code - Examples - SCG-based generators - 3. Future Research - SCG with Priority Rules # PART 1 # **INTRODUCTION** #### VLIW Architecture Overview - Very Long Instruction Word - 1980 (Josh Fisher ) - Instruction Level Parallelism - Control of many (10+) function units in every cycle - ALU, MEMORY, BRANCH UNIT, MMU ... - Instructions contain many independent operations <sup>2.</sup> Generators 3. (P)SCG With Priority #### **VLIW Constraints** - Very complicated compilers (schedulers) - High ILP = utilization of all units - Not always possible ⇒ NOPs - Planning of function unit utilizations at compile time! - Knowledge of operation latencies, dependencies, ... - Must control code for conflicts (no HW runtime check!) - RAW (read after write), WAR (write after read) - Write conflicts WAW (write after write) add **r1**, r2, r3 \ mul r4, r2, r3 \ load **r1**, [r5] \ nop # (Propagating) Scattered Context Grammar - (Propagating) SCG is quadruple G = (V, T, P, S) - V is finite set of symbols - T is set of terminals, T ⊂ V - S is the start symbol, S ∈ V T - P is a finite set of productions of the form $$(A_1, ..., A_n) \rightarrow (x_1, ..., x_n),$$ where $\forall A_i : A_i \in V - T, \ \forall x_i : x_i \in V^*$ (Propagating SCG: $x_i \in V^*$ ) # Derivation Step, Generated Language #### Derivation Step Let $$(A_1, ..., A_n) \rightarrow (x_1, ..., x_n) \in P$$ and for $1 \le i \le n+1$ let $u_i \in V^*$ : $$u_1A_1u_2A_2 \dots u_nA_nu_{n+1} \Rightarrow_G u_1x_1u_2x_2 \dots u_nx_nu_{n+1}$$ - Let ⇒\*<sub>G</sub> be a reflexive transitive closure of ⇒<sub>G</sub> - Generated Language - $L(G) = \{x \in T^* : S \Rightarrow_G^* x\}$ #### **Generative Power** - $\mathscr{L}(SC) = \mathscr{L}(RE)$ - $\mathscr{L}(CF) \subset \mathscr{L}(PSC) \subseteq \mathscr{L}(CS)$ # PART 2 # GENERATOR OF PROPER VLIW ASSEMBLER CODE #### Example - VLIW processor with 8 registers r<sub>1</sub>, ..., r<sub>8</sub> - 3 function units A, B, C (e.g. ALU, MUL, MEM) - A operations: op1 $r_1$ , $r_2$ , $r_3$ ; op2 $r_1$ , $r_2$ , $r_3$ ( $r_1 = r_2 \circ p r_3$ ) - B operations: op3 $r_1$ , $r_2$ , $r_3$ ( $r_1 = r_2 op r_3$ ) - C operations: op4 $r_1$ , $r_2$ ; op5 $r_1$ , $r_2$ ( $r_1 = [r_2]$ or $[r_1] = r_2$ ) - Function units are not pipelined - nop operation = no new job for function unit - Operation latency - A: op1, op2 1 cycle - B: op3 2 cycles - C: op4 2 cycles; op5 3 cycles - Write conflicts on instruction level are prohibited ## VLIW Assembler Code Illustration (1/4) ``` (1) op1 r1,r2,r3 op5 r2,r3 nop ;; (2) op2 r6,r3,r2 op3 r7,r3,r2 nop (3) nop nop nop (4) op1 r4,r2,r3 op4 r4,r2 nop ``` #### No conflicts #### VLIW Assembler Code Illustration (2/4) ``` (1) op5 r2,r3 op1 r1,r2,r3 nop (2) op3 r7,r3,r2 op2 r6,r3,r2 nop (3) nop nop nop op4 r4,r2 (4) op1 r4,r2,r3 nop ``` #### No conflicts <sup>2.</sup> Generators ## VLIW Assembler Code Illustration (3/4) ``` (1) op1 r1,r2,r3 nop op5 r2,r3 ;; (2) op2 r6,r3,r2 op3 r7,r3,r2 nop ;; (3) nop nop nop ;; (4) op1 r4,r2,r3 nop op4 r4,r2 ;; ``` #### No conflicts Introduction <sup>2.</sup> Generators 3. (P)SCG With Priority ## VLIW Assembler Code Illustration (4/4) ``` (1) op1 r1,r2,r3 nop op5 r2,r3 ;; (2) op2 r6,r3,r2 op3 r7,r3,r2 nop ;; (3) nop nop nop ;; (4) op1 r4,r2,r3 nop op4 r4,r2 ;; ``` #### Write conflict! Introduction <sup>2.</sup> Generators 3. (P)SCG With Priority #### Generator – Motivation - State of the art - Manually created assembler code generators - Reservation tables - Register allocation - ... - We need formal method => automation - Minimal number of rules - Easy to parse #### Methods 1 - Assume we have only ONE instruction: - Number of allowed combinations of operations is finite - Instruction is a sentence of language - Finite set of all allowed sentences is language (regular) - 67 millions of combinations! - 52 millions of allowed combinations! - But we need to model also latencies and other constraints - => This approach is not a solution! # "Slicing" Methods 2 (No Latencies) - Right regular grammar - 1255 rules $$S = op_1 A^{\emptyset} = op_1 r_1 B^{\{r1\}} = op_1 r_1, C^{\{r1\}} = op_1 r_1, r_2 D^{\{r1\}} = >^*$$ = $op_1 r_1, r_2, r_3 | op_3 r_3, r_5, r_4 | op_4 X^{\{r1,r3\}} = > ...$ - Right linear grammar - 1165 rules $$S = op_1 A^{\emptyset} = op_1 r_1, B^{\{r1\}} = op_1 r_1, r_2, C^{\{r1\}} = ...$$ - Context free grammar - 447 rules $$S = op_1 A^{\emptyset} = op_1 r_1$$ , R, R | $B^{\{r1\}} = op_1 r_1$ , R, R | $op_3 C^{\{r1\}} = op_1 r_1$ , R, R | $op_3 r_2$ , R, R | $op_3 r_1$ , R, R | $op_3 r_2$ , R, R | $op_3 r_2$ , R, R | $op_3 r_3$ , R, R | $op_3 r_4$ , R, R | $op_3 r_5$ $op_3$ ## "Slicing" Methods 3 (No Latencies) - Propagating SCG - 447 rules no improvement against CFG - SCG - 146 rules - $(@^{M}, W) \rightarrow (\varepsilon, r_{i} @^{MU\{ri\}})$ ``` S => @^{\emptyset}ABC\# => @^{\emptyset} op_{1} W, R, R | BC\# => op_{1} r_{1} @^{\{r1\}}, R, R | BC\# => op_{1} r_{1} @^{\{r1\}}, R, R | op_{3} W, R, R | C\# => => op_{1} r_{1}, R, R | op_{3} r_{2} @^{\{r1,r2\}}, R, R | C\# => => op_{1} r_{1}, R, R | op_{3} r_{2} @^{\{r1,r2\}}, R, R | op_{4} W, R \# => => op_{1} r_{1}, R, R | op_{3} r_{2}, R, R | op_{4} r_{3} @^{\{r1,r2,r3\}}, R\# => => op_{1} r_{1}, R, R | op_{3} r_{2}, R, R | op_{4} r_{3}, R;; => ... ``` # "Slicing" Methods 4 (With Latencies) - Right regular, right linear, CFG, ... - 2500+ rules - Propagating SCG - 654 rules - $(@^{M}, OP_{1}) \rightarrow (op_{1}, @^{M}); (@^{M}, R) \rightarrow (r_{1}, @^{M}); ...$ - SCG - 150 rules ``` S => @^{\emptyset}ABC\$L_{A}L_{B}L_{C}\# => @^{\emptyset} op_{1} W, R, R|BC\$AL_{B}L_{C}\# => \\ => op_{1} r_{1} @^{\{r1\}}, R, R|BC\$AL_{B}L_{C}\# => \\ => op_{1} r_{1} @^{\{r1\}}, R, R|op_{3} W, R, R|C\$AB_{1}L_{C}\# =>^{*} \\ => op_{1} r_{1}, r_{2}, r_{3}|op_{3} r_{2} @^{\{r1,r2\}}, r_{6}, r_{5}|nop\$AB_{1}C\# => \\ => op_{1} r_{1}, r_{2}, r_{3}|op_{3} r_{2}, r_{6}, r_{5}|nop;;@^{\emptyset}AB_{1}C\$L_{A}L_{B}L_{C}\# =>... ``` # PART 3 # **FUTURE RESEARCH** #### Motivation - SCG-based "slicing" generator pretty good, but… - The number of rules is still too high - @-rules are problem (large number of registers) - Replace @-rules with forbidding rules $$(W_{R1}, W_{R1}) \rightarrow (Z, Z)$$ - How to make sure that this rule will be applied instead of a rule $(W_{R1}) \rightarrow (r_1)$ ? - "Default" SCG is not good enough - We need to add priority to rules! # (Propagating) SCG with Priority Rules - (Propagating) SCG with Priority Rules is quintuple $G = (V, T, P, \pi, S)$ - V is finite set of symbols - T is set of terminals, T ⊂ V - S is the start symbol, S ∈ V T - P is a finite set of productions of the form $$(A_1, ..., A_n) \rightarrow (x_1, ..., x_n),$$ where $n \ge 1$ , $\forall A_i : A_i \in V - T$ , $\forall x_i : x_i \in V^*$ (Propagating SCG with Priority Rules: $x_i \in V^*$ ) • $\pi$ is priority function: $\pi$ : $P \rightarrow \mathbb{N}$ # Priority Derivation, Generated Language - Let G = (V, T, P, π, S) is (P)SCG-P, p ∈ P, for $1 \le i \le n+1$ let $u_i \in V^*$ : $u = u_1A_1 ... u_nA_nu_{n+1}$ , $v = u_1x_1 ... u_nx_nu_{n+1}$ , $w = u_1y_1 ... u_ny_nu_{n+1}$ - Define Priority Derivation Step as $$u \xrightarrow{\pi}_{G} v [p]$$ iff $u \Rightarrow_G v[p]$ and there is no $r \in P$ satisfying $\pi(r) > \pi(p)$ such that $u \Rightarrow_G w[r]$ . - Generated Language - $L(G) = \{x \in T^* : S_{\pi} \Rightarrow_{G}^* x\}$ #### **Generative Power** - $\mathscr{L}(SC-P) = ? \mathscr{L}(RE)$ - $\mathscr{L}(\mathsf{PSC}\text{-P}) = ? \mathscr{L}(\mathsf{CS})$ - **Proofs** needed ## "Forbidding" Method - (P)SCG with priority rules (no latency) - 42 rules - (P)SCG with priority rules (with latency) - 46 rules - $\pi((W_{R1}, W_{R1}) \rightarrow (Z, Z)) = 2$ (Z is "block" symbol) - $\pi((W_{R1}) \to (r_1)) = 1$ - $\pi((\$, \#) \to (;;, \$L_A L_B L_C \#)) = 0$ $$S => ABC$LALBLC# =>*$$ $$=> op_1 W_{R1}, r_2, r_3 | op_3 W_{R1}, r_6, r_5 | nop $AB_1C# =>$$ $$=> op_1 Z, r_2, r_3 | op_3 Z, r_6, r_5 | nop $AB_1C#$$ #### Conclusion - New approach of VLIW assembler generators - Minimal number of rules - Still easy to parse - Comparison of methods - (P)SCG-based generators - (P)SCG with priority #### References Fisher, Faraboschi, Young: Embedded Computing - A VLIW Approach to Architecture, Compilers, and Tools, 2005 Tory A. Titles - Frails Fathends - City burg. Embedded Computing A T. T. W. Approved to Andreasters. Catylina and East. Kolář: Exploitation of Scattered Context Grammars to Model Constraints between Components, ASIS 2009 Meduna, Techet: Scattered Context Grammars and their Applications, 2009 (2010) # Questions? # THANK YOU FOR YOUR ATTENTION