From 3393155abf77e460fe661ffccb0b21c500290613 Mon Sep 17 00:00:00 2001 From: Mauri de Souza Meneguzzo Date: Mon, 7 Aug 2023 00:53:33 +0000 Subject: [PATCH] internal/bytealg: optimize Count/CountString in amd64 MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit Another optimization by aligning a hot loop. ``` │ sec/op │ sec/op vs base │ Count/10-16 11.29n ± 1% 10.50n ± 1% -7.04% (p=0.000 n=10) Count/32-16 11.06n ± 1% 11.36n ± 2% +2.76% (p=0.000 n=10) Count/4K-16 2.852µ ± 1% 1.953µ ± 1% -31.52% (p=0.000 n=10) Count/4M-16 2.884m ± 1% 1.958m ± 1% -32.11% (p=0.000 n=10) Count/64M-16 46.27m ± 1% 30.86m ± 0% -33.31% (p=0.000 n=10) CountEasy/10-16 9.873n ± 1% 9.669n ± 1% -2.07% (p=0.000 n=10) CountEasy/32-16 11.07n ± 1% 11.23n ± 1% +1.49% (p=0.000 n=10) CountEasy/4K-16 73.47n ± 1% 54.20n ± 0% -26.22% (p=0.000 n=10) CountEasy/4M-16 61.12µ ± 1% 49.42µ ± 0% -19.15% (p=0.000 n=10) CountEasy/64M-16 1.303m ± 3% 1.082m ± 4% -16.97% (p=0.000 n=10) CountSingle/10-16 4.150n ± 1% 3.679n ± 1% -11.36% (p=0.000 n=10) CountSingle/32-16 4.815n ± 1% 4.588n ± 1% -4.71% (p=0.000 n=10) CountSingle/4M-16 72.18µ ± 2% 75.38µ ± 1% +4.44% (p=0.000 n=10) CountHard3-16 462.6µ ± 1% 484.4µ ± 1% +4.73% (p=0.000 n=10) │ old.txt │ new.txt │ │ B/s │ B/s vs base │ Count/10-16 844.1Mi ± 1% 908.3Mi ± 1% +7.60% (p=0.000 n=10) Count/32-16 2.695Gi ± 1% 2.623Gi ± 2% -2.66% (p=0.000 n=10) Count/4K-16 1.337Gi ± 1% 1.953Gi ± 1% +46.06% (p=0.000 n=10) Count/4M-16 1.355Gi ± 1% 1.995Gi ± 1% +47.29% (p=0.000 n=10) Count/64M-16 1.351Gi ± 1% 2.026Gi ± 0% +49.95% (p=0.000 n=10) CountEasy/10-16 965.9Mi ± 1% 986.3Mi ± 1% +2.11% (p=0.000 n=10) CountEasy/32-16 2.693Gi ± 1% 2.653Gi ± 1% -1.48% (p=0.000 n=10) CountEasy/4K-16 51.93Gi ± 1% 70.38Gi ± 0% +35.54% (p=0.000 n=10) CountEasy/4M-16 63.91Gi ± 1% 79.05Gi ± 0% +23.68% (p=0.000 n=10) CountEasy/64M-16 47.97Gi ± 3% 57.77Gi ± 4% +20.44% (p=0.000 n=10) CountSingle/10-16 2.244Gi ± 1% 2.532Gi ± 1% +12.80% (p=0.000 n=10) CountSingle/32-16 6.190Gi ± 1% 6.496Gi ± 1% +4.94% (p=0.000 n=10) CountSingle/4M-16 54.12Gi ± 2% 51.82Gi ± 1% -4.25% (p=0.000 n=10) ``` Change-Id: I847b36125d2b11e2a88d31f48f6c160f041b3624 GitHub-Last-Rev: faacba662ee6bf41f69960060d48d340cfdbbbd6 GitHub-Pull-Request: golang/go#61793 Reviewed-on: https://go-review.googlesource.com/c/go/+/516455 TryBot-Result: Gopher Robot Auto-Submit: Keith Randall Reviewed-by: Michael Knyszek Reviewed-by: Keith Randall Reviewed-by: Keith Randall Run-TryBot: Keith Randall --- src/internal/bytealg/count_amd64.s | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/internal/bytealg/count_amd64.s b/src/internal/bytealg/count_amd64.s index efb17f84b7..807c289113 100644 --- a/src/internal/bytealg/count_amd64.s +++ b/src/internal/bytealg/count_amd64.s @@ -57,6 +57,7 @@ sse: LEAQ -16(SI)(BX*1), AX // AX = address of last 16 bytes JMP sseloopentry + PCALIGN $16 sseloop: // Move the next 16-byte chunk of the data into X1. MOVOU (DI), X1 @@ -163,6 +164,7 @@ avx2: MOVD AX, X0 LEAQ -32(SI)(BX*1), R11 VPBROADCASTB X0, Y1 + PCALIGN $32 avx2_loop: VMOVDQU (DI), Y2 VPCMPEQB Y1, Y2, Y3 -- 2.48.1