From: Michael Pratt Date: Tue, 21 Feb 2023 18:20:49 +0000 (-0500) Subject: cmd/compile/internal/ssa: drop overwritten regalloc basic block input requirements X-Git-Tag: go1.21rc1~1233 X-Git-Url: http://www.git.cypherpunks.su/?a=commitdiff_plain;h=2c6aaaea6497715b787e8d95a7e6f1fb7b9efcef;p=gostls13.git cmd/compile/internal/ssa: drop overwritten regalloc basic block input requirements For the following description, consider the following basic block graph: b1 ───┐┌──── b2 ││ ││ ▼▼ b3 For register allocator transitions between basic blocks, there are two key passes (significant paraphrasing): First, each basic block is visited in some predetermined visit order. This is the core visitOrder range loop in regAllocState.regalloc. The specific ordering heuristics aren't important here, except that the order guarantees that when visiting a basic block at least one of its predecessors has already been visited. Upon visiting a basic block, that block sets its expected starting register state (regAllocState.startRegs) based on the ending register state (regAlloc.State.endRegs) of one of its predecessors. (How it chooses which predecessor to use is not important here.) From that starting state, registers are assigned for all values in the block, ultimately resulting in some ending register state. After all blocks have been visited, the shuffle pass (regAllocState.shuffle) ensures that for each edge, endRegs of the predecessor == startRegs of the successor. That is, it makes sure that the startRegs assumptions actually hold true for each edge. It does this by adding moves to the end of the predecessor block to place values in the expected register for the successor block. These may be moves from other registers, or from memory if the value is spilled. Now on to the actual problem: Assume that b1 places some value v1 into register R10, and thus ends with endRegs containing R10 = v1. When b3 is visited, it selects b1 as its model predecessor and sets startRegs with R10 = v1. b2 does not have v1 in R10, so later in the shuffle pass, we will add a move of v1 into R10 to the end of b2 to ensure it is available for b3. This is all perfectly fine and exactly how things should work. Now suppose that b3 does not use v1. It does need to use some other value v2, which is not currently in a register. When assigning v2 to a register, it finds all registers are already in use and it needs to dump a value. Ultimately, it decides to dump v1 from R10 and replace it with v2. This is fine, but it has downstream effects on shuffle in b2. b3's startRegs still state that R10 = v1, so b2 will add a move to R10 even though b3 will unconditionally overwrite it. i.e., the move at the end of b2 is completely useless and can result in code like: // end of b2 MOV n(SP), R10 // R10 = v1 <-- useless // start of b3 MOV m(SP), R10 // R10 = v2 This is precisely what happened in #58298. This CL addresses this problem by dropping registers from startRegs if they are never used in the basic block prior to getting dumped. This allows the shuffle pass to avoid placing those useless values into the register. There is a significant limitation to this CL, which is that it only impacts the immediate predecessors of an overwriting block. We can discuss this by zooming out a bit on the previous graph: b4 ───┐┌──── b5 ││ ││ ▼▼ b1 ───┐┌──── b2 ││ ││ ▼▼ b3 Here we have the same graph, except we can see the two predecessors of b1. Now suppose that rather than b1 assigning R10 = v1 as above, the assignment is done in b4. b1 has startRegs R10 = v1, doesn't use the value at all, and simply passes it through to endRegs R10 = v1. Now the shuffle pass will require both b2 and b5 to add a move to assigned R10 = v1, because that is specified in their successor startRegs. With this CL, b3 drops R10 = v1 from startRegs, but there is no backwards propagation, so b1 still has R10 = v1 in startRegs, and b5 still needs to add a useless move. Extending this CL with such propagation may significantly increase the number of useless moves we can remove, though it will add complexity to maintenance and could potentially impact build performance depending on how efficiently we could implement the propagation (something I haven't considered carefully). As-is, this optimization does not impact much code. In bent .text size geomean is -0.02%. In the container/heap test binary, 18 of ~2500 functions are impacted by this CL. Bent and sweet do not show a noticeable performance impact one way or another, however #58298 does show a case where this can have impact if the useless instructions end up in the hot path of a tight loop. For #58298. Change-Id: I2fcef37c955159d068fa0725f995a1848add8a5f Reviewed-on: https://go-review.googlesource.com/c/go/+/471158 Run-TryBot: Michael Pratt TryBot-Result: Gopher Robot Reviewed-by: Keith Randall Reviewed-by: Cherry Mui Reviewed-by: David Chase --- diff --git a/src/cmd/compile/internal/ssa/regalloc.go b/src/cmd/compile/internal/ssa/regalloc.go index 80f6434e76..26dcda2c62 100644 --- a/src/cmd/compile/internal/ssa/regalloc.go +++ b/src/cmd/compile/internal/ssa/regalloc.go @@ -272,6 +272,9 @@ type regAllocState struct { // mask of registers currently in use used regMask + // mask of registers used since the start of the current block + usedSinceBlockStart regMask + // mask of registers used in the current instruction tmpused regMask @@ -289,6 +292,11 @@ type regAllocState struct { // saved state does not include the state of phi ops in the block. startRegs [][]startReg + // startRegsMask is a mask of the registers in startRegs[curBlock.ID]. + // Registers dropped from startRegsMask are later synchronoized back to + // startRegs by dropping from there as well. + startRegsMask regMask + // spillLive[blockid] is the set of live spills at the end of each block spillLive [][]ID @@ -406,7 +414,9 @@ func (s *regAllocState) allocReg(mask regMask, v *Value) register { // Pick an unused register if one is available. if mask&^s.used != 0 { - return pickReg(mask &^ s.used) + r := pickReg(mask &^ s.used) + s.usedSinceBlockStart |= regMask(1) << r + return r } // Pick a value to spill. Spill the value with the @@ -450,6 +460,7 @@ func (s *regAllocState) allocReg(mask regMask, v *Value) register { v2 := s.regs[r].v m := s.compatRegs(v2.Type) &^ s.used &^ s.tmpused &^ (regMask(1) << r) if m != 0 && !s.values[v2.ID].rematerializeable && countRegs(s.values[v2.ID].regs) == 1 { + s.usedSinceBlockStart |= regMask(1) << r r2 := pickReg(m) c := s.curBlock.NewValue1(v2.Pos, OpCopy, v2.Type, s.regs[r].c) s.copies[c] = false @@ -459,7 +470,21 @@ func (s *regAllocState) allocReg(mask regMask, v *Value) register { s.setOrig(c, v2) s.assignReg(r2, v2, c) } + + // If the evicted register isn't used between the start of the block + // and now then there is no reason to even request it on entry. We can + // drop from startRegs in that case. + if s.usedSinceBlockStart&(regMask(1) << r) == 0 { + if s.startRegsMask&(regMask(1) << r) == 1 { + if s.f.pass.debug > regDebug { + fmt.Printf("dropped from startRegs: %s\n", &s.registers[r]) + } + s.startRegsMask &^= regMask(1) << r + } + } + s.freeReg(r) + s.usedSinceBlockStart |= regMask(1) << r return r } @@ -513,6 +538,7 @@ func (s *regAllocState) allocValToReg(v *Value, mask regMask, nospill bool, pos if nospill { s.nospill |= regMask(1) << r } + s.usedSinceBlockStart |= regMask(1) << r return s.regs[r].c } @@ -532,6 +558,7 @@ func (s *regAllocState) allocValToReg(v *Value, mask regMask, nospill bool, pos if s.regs[r2].v != v { panic("bad register state") } + s.usedSinceBlockStart |= regMask(1) << r2 c = s.curBlock.NewValue1(pos, OpCopy, v.Type, s.regs[r2].c) } else if v.rematerializeable() { // Rematerialize instead of loading from the spill location. @@ -882,6 +909,8 @@ func (s *regAllocState) regalloc(f *Func) { fmt.Printf("Begin processing block %v\n", b) } s.curBlock = b + s.startRegsMask = 0 + s.usedSinceBlockStart = 0 // Initialize regValLiveSet and uses fields for this block. // Walk backwards through the block doing liveness analysis. @@ -1173,6 +1202,7 @@ func (s *regAllocState) regalloc(f *Func) { continue } regList = append(regList, startReg{r, v, s.regs[r].c, s.values[v.ID].uses.pos}) + s.startRegsMask |= regMask(1) << r } s.startRegs[b.ID] = make([]startReg, len(regList)) copy(s.startRegs[b.ID], regList) @@ -1878,6 +1908,23 @@ func (s *regAllocState) regalloc(f *Func) { u.next = s.freeUseRecords s.freeUseRecords = u } + + // allocReg may have dropped registers from startRegsMask that + // aren't actually needed in startRegs. Synchronize back to + // startRegs. + // + // This must be done before placing spills, which will look at + // startRegs to decide if a block is a valid block for a spill. + if c := countRegs(s.startRegsMask); c != len(s.startRegs[b.ID]) { + regs := make([]startReg, 0, c) + for _, sr := range s.startRegs[b.ID] { + if s.startRegsMask&(regMask(1) << sr.r) == 0 { + continue + } + regs = append(regs, sr) + } + s.startRegs[b.ID] = regs + } } // Decide where the spills we generated will go.