From: cui fliter Date: Sat, 28 Oct 2023 13:21:31 +0000 (+0800) Subject: internal/bytealg: optimize Count/CountString in arm64 X-Git-Tag: go1.22rc1~479 X-Git-Url: http://www.git.cypherpunks.su/?a=commitdiff_plain;h=289b823ac9495e2a0820296c986a2534a7b4af79;p=gostls13.git internal/bytealg: optimize Count/CountString in arm64 For #63678 goos: darwin goarch: arm64 pkg: strings │ count_old.txt │ count_new.txt │ │ sec/op │ sec/op vs base │ CountHard1-8 368.7µ ± 11% 332.0µ ± 1% -9.95% (p=0.002 n=10) CountHard2-8 348.8µ ± 5% 333.1µ ± 1% -4.51% (p=0.000 n=10) CountHard3-8 402.7µ ± 25% 359.5µ ± 1% -10.75% (p=0.000 n=10) CountTorture-8 10.536µ ± 23% 9.913µ ± 0% -5.91% (p=0.000 n=10) CountTortureOverlapping-8 74.86µ ± 9% 67.56µ ± 1% -9.75% (p=0.000 n=10) CountByte/10-8 6.905n ± 3% 6.690n ± 1% -3.11% (p=0.001 n=10) CountByte/32-8 3.247n ± 13% 3.207n ± 2% -1.23% (p=0.030 n=10) CountByte/4096-8 83.72n ± 1% 82.58n ± 1% -1.36% (p=0.007 n=10) CountByte/4194304-8 85.17µ ± 5% 84.02µ ± 8% ~ (p=0.075 n=10) CountByte/67108864-8 1.497m ± 8% 1.397m ± 2% -6.69% (p=0.000 n=10) geomean 9.977µ 9.426µ -5.53% │ count_old.txt │ count_new.txt │ │ B/s │ B/s vs base │ CountByte/10-8 1.349Gi ± 3% 1.392Gi ± 1% +3.20% (p=0.002 n=10) CountByte/32-8 9.180Gi ± 11% 9.294Gi ± 2% +1.24% (p=0.029 n=10) CountByte/4096-8 45.57Gi ± 1% 46.20Gi ± 1% +1.38% (p=0.007 n=10) CountByte/4194304-8 45.86Gi ± 5% 46.49Gi ± 7% ~ (p=0.075 n=10) CountByte/67108864-8 41.75Gi ± 8% 44.74Gi ± 2% +7.16% (p=0.000 n=10) geomean 16.10Gi 16.55Gi +2.85% Change-Id: Ifc2173ba3a926b0fa9598372d4404b8645929d45 Reviewed-on: https://go-review.googlesource.com/c/go/+/538116 Reviewed-by: Keith Randall Reviewed-by: Bryan Mills Run-TryBot: shuang cui Auto-Submit: Keith Randall Reviewed-by: Keith Randall LUCI-TryBot-Result: Go LUCI --- diff --git a/src/internal/bytealg/count_arm64.s b/src/internal/bytealg/count_arm64.s index 8cd703d943..e616627b1a 100644 --- a/src/internal/bytealg/count_arm64.s +++ b/src/internal/bytealg/count_arm64.s @@ -37,6 +37,7 @@ TEXT countbytebody<>(SB),NOSPLIT,$0 // Work with not 32-byte aligned head BIC $0x1f, R0, R3 ADD $0x20, R3 + PCALIGN $16 head_loop: MOVBU.P 1(R0), R5 CMP R5, R1 @@ -60,6 +61,7 @@ chunk: // Clear the low 64-bit element of V7 and V8 VEOR V7.B8, V7.B8, V7.B8 VEOR V8.B8, V8.B8, V8.B8 + PCALIGN $16 // Count the target byte in 32-byte chunk chunk_loop: VLD1.P (R0), [V1.B16, V2.B16]