mirror of
https://github.com/emmansun/gmsm.git
synced 2025-10-13 23:00:47 +08:00
![github-actions[bot]](/assets/img/avatar_default.png)
* build(deps): bump github/codeql-action from 3.29.11 to 3.30.0 (#361) Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.29.11 to 3.30.0. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](3c3833e0f8...2d92b76c45
) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 3.30.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): bump codecov/codecov-action from 5.5.0 to 5.5.1 (#362) Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 5.5.0 to 5.5.1. - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](fdcc847654...5a1091511a
) --- updated-dependencies: - dependency-name: codecov/codecov-action dependency-version: 5.5.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): bump actions/setup-go from 5.5.0 to 6.0.0 (#363) Bumps [actions/setup-go](https://github.com/actions/setup-go) from 5.5.0 to 6.0.0. - [Release notes](https://github.com/actions/setup-go/releases) - [Commits](d35c59abb0...4469467582
) --- updated-dependencies: - dependency-name: actions/setup-go dependency-version: 6.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): bump github/codeql-action from 3.30.0 to 3.30.1 (#364) Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.0 to 3.30.1. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](2d92b76c45...f1f6e5f6af
) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 3.30.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): bump step-security/harden-runner from 2.13.0 to 2.13.1 (#367) Bumps [step-security/harden-runner](https://github.com/step-security/harden-runner) from 2.13.0 to 2.13.1. - [Release notes](https://github.com/step-security/harden-runner/releases) - [Commits](ec9f2d5744...f4a75cfd61
) --- updated-dependencies: - dependency-name: step-security/harden-runner dependency-version: 2.13.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): bump github/codeql-action from 3.30.1 to 3.30.2 (#368) Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.1 to 3.30.2. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](f1f6e5f6af...d3678e237b
) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 3.30.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * feat(mlkem): initialize mlkem from golang standard library * chore(mlkem): refactoring, reduce alloc times * build(deps): bump github/codeql-action from 3.30.2 to 3.30.3 (#369) Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.2 to 3.30.3. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](d3678e237b...192325c861
) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 3.30.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * doc(README): include MLKEM * mldsa: refactor the implementation of key and sign/verify * mldsa,slhdsa: crypto.Signer assertion * fix(slhdsa): GenerateKey slice issue #72 * fix(slhdsa): copy/paste issue * slhdsa: supplements package level document * internal/zuc: eea supports encoding.BinaryMarshaler & encoding.BinaryUnmarshaler interfaces * mlkem: use clear built-in * build(deps): bump github/codeql-action from 3.30.3 to 3.30.4 (#376) Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.3 to 3.30.4. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](192325c861...303c0aef88
) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 3.30.4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * cipher: initial support gxm & mur modes * cipher: update comments * build(deps): bump github/codeql-action from 3.30.4 to 3.30.5 (#377) Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.4 to 3.30.5. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](303c0aef88...3599b3baa1
) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 3.30.5 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * 增加了DRBG销毁内部状态的方法 (#378) * 增加了DRBG销毁内部状态的方法 * 统一前缀 * 修改随机数长度 * 分组和注释 * 错误函数描述 * zuc: expose methods to support encoding.BinaryMarshaler and encoding.BinaryUnmarshaler * drbg: align comments style * internal/zuc: support fast forward * internal/zuc: supplement comments * build(deps): bump ossf/scorecard-action from 2.4.2 to 2.4.3 (#380) Bumps [ossf/scorecard-action](https://github.com/ossf/scorecard-action) from 2.4.2 to 2.4.3. - [Release notes](https://github.com/ossf/scorecard-action/releases) - [Changelog](https://github.com/ossf/scorecard-action/blob/main/RELEASE.md) - [Commits](05b42c6244...4eaacf0543
) --- updated-dependencies: - dependency-name: ossf/scorecard-action dependency-version: 2.4.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): bump github/codeql-action from 3.30.5 to 3.30.6 (#381) Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.5 to 3.30.6. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](3599b3baa1...64d10c1313
) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 3.30.6 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * cipher: add reference for GXM & MUR * ci: try to use loong64/binfmt * ci: add loong64 qemu test * ci: remove custom image * internal/nat: port loong64 & wasm * internal/nat: avoid global function name conflict * internal/nat: add missing loong64 * internal/deps/cpu: support Loong64 features detectiion * build(deps): bump github/codeql-action from 3.30.6 to 4.30.7 (#382) Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.30.6 to 4.30.7. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](64d10c1313...e296a93559
) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 4.30.7 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * internal/sm2ec: test loong64 * internal/sm2ec: fix compile error * internal/sm2ec: fix test case * internal/sm2ec: loong64 p256NegCond * internal/sm2ec: loong p256MovCond v1 without LSX * internal/sm2ec: try LSX * ci: loong64 + go1.25 * internal/sm2ec: fix instructions * internal/sm2ec: try LASX * loong64: check LSX & LASX support * remove loong64 from this branch first * internal/sm4: fix xts amd64 avx2 bug #383 --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sun Yimin <emmansun@users.noreply.github.com> Co-authored-by: Guanyu Quan <quanguanyu@qq.com>
1905 lines
35 KiB
ArmAsm
1905 lines
35 KiB
ArmAsm
//go:build !purego
|
|
|
|
#include "textflag.h"
|
|
|
|
#define B0 X0
|
|
#define B1 X1
|
|
#define B2 X2
|
|
#define B3 X3
|
|
#define B4 X4
|
|
#define B5 X5
|
|
#define B6 X6
|
|
#define B7 X7
|
|
|
|
#define TW X10
|
|
|
|
#define T0 X11
|
|
#define T1 X12
|
|
#define T2 X13
|
|
#define POLY X14
|
|
#define NIBBLE_MASK Y13
|
|
#define X_NIBBLE_MASK X13
|
|
#define BSWAP X15
|
|
#define DWBSWAP Y15
|
|
|
|
DATA gcmPoly<>+0x00(SB)/8, $0x0000000000000087
|
|
DATA gcmPoly<>+0x08(SB)/8, $0x0000000000000000
|
|
|
|
DATA gbGcmPoly<>+0x00(SB)/8, $0x0000000000000000
|
|
DATA gbGcmPoly<>+0x08(SB)/8, $0xe100000000000000
|
|
|
|
GLOBL gcmPoly<>(SB), (NOPTR+RODATA), $16
|
|
GLOBL gbGcmPoly<>(SB), (NOPTR+RODATA), $16
|
|
|
|
#include "aesni_macros_amd64.s"
|
|
|
|
#define mul2GBInline \
|
|
PSHUFB BSWAP, TW; \
|
|
\// TW * 2
|
|
MOVOU TW, T0; \
|
|
PSHUFD $0, TW, T1; \
|
|
PSRLQ $1, TW; \
|
|
PSLLQ $63, T0; \
|
|
PSRLDQ $8, T0; \
|
|
POR T0, TW; \
|
|
\// reduction
|
|
PSLLL $31, T1; \
|
|
PSRAL $31, T1; \
|
|
PAND POLY, T1; \
|
|
PXOR T1, TW; \
|
|
PSHUFB BSWAP, TW
|
|
|
|
#define avxMul2GBInline \
|
|
VPSHUFB BSWAP, TW, TW; \
|
|
\// TW * 2
|
|
VPSLLQ $63, TW, T0; \
|
|
VPSHUFD $0, TW, T1; \
|
|
VPSRLQ $1, TW, TW; \
|
|
VPSRLDQ $8, T0, T0; \
|
|
VPOR T0, TW, TW; \
|
|
\// reduction
|
|
VPSLLD $31, T1, T1; \
|
|
VPSRAD $31, T1, T1; \
|
|
VPAND POLY, T1, T1; \
|
|
VPXOR T1, TW, TW; \
|
|
VPSHUFB BSWAP, TW, TW
|
|
|
|
#define prepareGB4Tweaks \
|
|
MOVOU TW, (16*0)(SP); \
|
|
mul2GBInline; \
|
|
MOVOU TW, (16*1)(SP); \
|
|
mul2GBInline; \
|
|
MOVOU TW, (16*2)(SP); \
|
|
mul2GBInline; \
|
|
MOVOU TW, (16*3)(SP); \
|
|
mul2GBInline
|
|
|
|
#define prepareGB8Tweaks \
|
|
prepareGB4Tweaks; \
|
|
MOVOU TW, (16*4)(SP); \
|
|
mul2GBInline; \
|
|
MOVOU TW, (16*5)(SP); \
|
|
mul2GBInline; \
|
|
MOVOU TW, (16*6)(SP); \
|
|
mul2GBInline; \
|
|
MOVOU TW, (16*7)(SP); \
|
|
mul2GBInline
|
|
|
|
#define avxPrepareGB4Tweaks \
|
|
VMOVDQU TW, (16*0)(SP); \
|
|
avxMul2GBInline; \
|
|
VMOVDQU TW, (16*1)(SP); \
|
|
avxMul2GBInline; \
|
|
VMOVDQU TW, (16*2)(SP); \
|
|
avxMul2GBInline; \
|
|
VMOVDQU TW, (16*3)(SP); \
|
|
avxMul2GBInline
|
|
|
|
#define avxPrepareGB8Tweaks \
|
|
avxPrepareGB4Tweaks; \
|
|
VMOVDQU TW, (16*4)(SP); \
|
|
avxMul2GBInline; \
|
|
VMOVDQU TW, (16*5)(SP); \
|
|
avxMul2GBInline; \
|
|
VMOVDQU TW, (16*6)(SP); \
|
|
avxMul2GBInline; \
|
|
VMOVDQU TW, (16*7)(SP); \
|
|
avxMul2GBInline
|
|
|
|
#define avxPrepareGB16Tweaks \
|
|
avxPrepareGB8Tweaks; \
|
|
VMOVDQU TW, (16*8)(SP); \
|
|
avxMul2GBInline; \
|
|
VMOVDQU TW, (16*9)(SP); \
|
|
avxMul2GBInline; \
|
|
VMOVDQU TW, (16*10)(SP); \
|
|
avxMul2GBInline; \
|
|
VMOVDQU TW, (16*11)(SP); \
|
|
avxMul2GBInline; \
|
|
VMOVDQU TW, (16*12)(SP); \
|
|
avxMul2GBInline; \
|
|
VMOVDQU TW, (16*13)(SP); \
|
|
avxMul2GBInline; \
|
|
VMOVDQU TW, (16*14)(SP); \
|
|
avxMul2GBInline; \
|
|
VMOVDQU TW, (16*15)(SP); \
|
|
avxMul2GBInline
|
|
|
|
#define mul2Inline \
|
|
PSHUFD $0xff, TW, T0; \
|
|
MOVOU TW, T1; \
|
|
PSRAL $31, T0; \
|
|
PAND POLY, T0; \
|
|
PSRLL $31, T1; \
|
|
PSLLDQ $4, T1; \
|
|
PSLLL $1, TW; \
|
|
PXOR T0, TW; \
|
|
PXOR T1, TW
|
|
|
|
#define avxMul2Inline \
|
|
VPSHUFD $0xff, TW, T0; \
|
|
VPSRLD $31, TW, T1; \
|
|
VPSRAD $31, T0, T0; \
|
|
VPAND POLY, T0, T0; \
|
|
VPSLLDQ $4, T1, T1; \
|
|
VPSLLD $1, TW, TW; \
|
|
VPXOR T0, TW, TW; \
|
|
VPXOR T1, TW, TW
|
|
|
|
#define prepare4Tweaks \
|
|
MOVOU TW, (16*0)(SP); \
|
|
mul2Inline; \
|
|
MOVOU TW, (16*1)(SP); \
|
|
mul2Inline; \
|
|
MOVOU TW, (16*2)(SP); \
|
|
mul2Inline; \
|
|
MOVOU TW, (16*3)(SP); \
|
|
mul2Inline
|
|
|
|
#define prepare8Tweaks \
|
|
prepare4Tweaks; \
|
|
MOVOU TW, (16*4)(SP); \
|
|
mul2Inline; \
|
|
MOVOU TW, (16*5)(SP); \
|
|
mul2Inline; \
|
|
MOVOU TW, (16*6)(SP); \
|
|
mul2Inline; \
|
|
MOVOU TW, (16*7)(SP); \
|
|
mul2Inline
|
|
|
|
#define avxPrepare4Tweaks \
|
|
VMOVDQU TW, (16*0)(SP); \
|
|
avxMul2Inline; \
|
|
VMOVDQU TW, (16*1)(SP); \
|
|
avxMul2Inline; \
|
|
VMOVDQU TW, (16*2)(SP); \
|
|
avxMul2Inline; \
|
|
VMOVDQU TW, (16*3)(SP); \
|
|
avxMul2Inline
|
|
|
|
#define avxPrepare8Tweaks \
|
|
prepare4Tweaks; \
|
|
VMOVDQU TW, (16*4)(SP); \
|
|
avxMul2Inline; \
|
|
VMOVDQU TW, (16*5)(SP); \
|
|
avxMul2Inline; \
|
|
VMOVDQU TW, (16*6)(SP); \
|
|
avxMul2Inline; \
|
|
VMOVDQU TW, (16*7)(SP); \
|
|
avxMul2Inline
|
|
|
|
#define avxPrepare16Tweaks \
|
|
prepare8Tweaks; \
|
|
VMOVDQU TW, (16*8)(SP); \
|
|
avxMul2Inline; \
|
|
VMOVDQU TW, (16*9)(SP); \
|
|
avxMul2Inline; \
|
|
VMOVDQU TW, (16*10)(SP); \
|
|
avxMul2Inline; \
|
|
VMOVDQU TW, (16*11)(SP); \
|
|
avxMul2Inline; \
|
|
VMOVDQU TW, (16*12)(SP); \
|
|
avxMul2Inline; \
|
|
VMOVDQU TW, (16*13)(SP); \
|
|
avxMul2Inline; \
|
|
VMOVDQU TW, (16*14)(SP); \
|
|
avxMul2Inline; \
|
|
VMOVDQU TW, (16*15)(SP); \
|
|
avxMul2Inline
|
|
|
|
#define sseLoad4Blocks \
|
|
MOVOU (16*0)(DX), B0; \
|
|
MOVOU (16*0)(SP), T0; \
|
|
PXOR T0, B0; \
|
|
MOVOU (16*1)(DX), B1; \
|
|
MOVOU (16*1)(SP), T0; \
|
|
PXOR T0, B1; \
|
|
MOVOU (16*2)(DX), B2; \
|
|
MOVOU (16*2)(SP), T0; \
|
|
PXOR T0, B2; \
|
|
MOVOU (16*3)(DX), B3; \
|
|
MOVOU (16*3)(SP), T0; \
|
|
PXOR T0, B3
|
|
|
|
#define sseStore4Blocks \
|
|
MOVOU (16*0)(SP), T0; \
|
|
PXOR T0, B0; \
|
|
MOVOU B0, (16*0)(CX); \
|
|
MOVOU (16*1)(SP), T0; \
|
|
PXOR T0, B1; \
|
|
MOVOU B1, (16*1)(CX); \
|
|
MOVOU (16*2)(SP), T0; \
|
|
PXOR T0, B2; \
|
|
MOVOU B2, (16*2)(CX); \
|
|
MOVOU (16*3)(SP), T0; \
|
|
PXOR T0, B3; \
|
|
MOVOU B3, (16*3)(CX)
|
|
|
|
#define sseLoad8Blocks \
|
|
sseLoad4Blocks; \
|
|
MOVOU (16*4)(DX), B4; \
|
|
MOVOU (16*4)(SP), T0; \
|
|
PXOR T0, B4; \
|
|
MOVOU (16*5)(DX), B5; \
|
|
MOVOU (16*5)(SP), T0; \
|
|
PXOR T0, B5; \
|
|
MOVOU (16*6)(DX), B6; \
|
|
MOVOU (16*6)(SP), T0; \
|
|
PXOR T0, B6; \
|
|
MOVOU (16*7)(DX), B7; \
|
|
MOVOU (16*7)(SP), T0; \
|
|
PXOR T0, B7
|
|
|
|
#define sseStore8Blocks \
|
|
sseStore4Blocks; \
|
|
MOVOU (16*4)(SP), T0; \
|
|
PXOR T0, B4; \
|
|
MOVOU B4, (16*4)(CX); \
|
|
MOVOU (16*5)(SP), T0; \
|
|
PXOR T0, B5; \
|
|
MOVOU B5, (16*5)(CX); \
|
|
MOVOU (16*6)(SP), T0; \
|
|
PXOR T0, B6; \
|
|
MOVOU B6, (16*6)(CX); \
|
|
MOVOU (16*7)(SP), T0; \
|
|
PXOR T0, B7; \
|
|
MOVOU B7, (16*7)(CX)
|
|
|
|
#define avxLoad4Blocks \
|
|
VMOVDQU (16*0)(DX), B0; \
|
|
VPXOR (16*0)(SP), B0, B0; \
|
|
VMOVDQU (16*1)(DX), B1; \
|
|
VPXOR (16*1)(SP), B1, B1; \
|
|
VMOVDQU (16*2)(DX), B2; \
|
|
VPXOR (16*2)(SP), B2, B2; \
|
|
VMOVDQU (16*3)(DX), B3; \
|
|
VPXOR (16*3)(SP), B3, B3
|
|
|
|
#define avxStore4Blocks \
|
|
VPXOR (16*0)(SP), B0, B0; \
|
|
VMOVDQU B0, (16*0)(CX); \
|
|
VPXOR (16*1)(SP), B1, B1; \
|
|
VMOVDQU B1, (16*1)(CX); \
|
|
VPXOR (16*2)(SP), B2, B2; \
|
|
VMOVDQU B2, (16*2)(CX); \
|
|
VPXOR (16*3)(SP), B3, B3; \
|
|
VMOVDQU B3, (16*3)(CX)
|
|
|
|
#define avxLoad8Blocks \
|
|
avxLoad4Blocks; \
|
|
VMOVDQU (16*4)(DX), B4; \
|
|
VPXOR (16*4)(SP), B4, B4; \
|
|
VMOVDQU (16*5)(DX), B5; \
|
|
VPXOR (16*5)(SP), B5, B5; \
|
|
VMOVDQU (16*6)(DX), B6; \
|
|
VPXOR (16*6)(SP), B6, B6; \
|
|
VMOVDQU (16*7)(DX), B7; \
|
|
VPXOR (16*7)(SP), B7, B7
|
|
|
|
#define avxStore8Blocks \
|
|
avxStore4Blocks; \
|
|
VPXOR (16*4)(SP), B4, B4; \
|
|
VMOVDQU B4, (16*4)(CX); \
|
|
VPXOR (16*5)(SP), B5, B5; \
|
|
VMOVDQU B5, (16*5)(CX); \
|
|
VPXOR (16*6)(SP), B6, B6; \
|
|
VMOVDQU B6, (16*6)(CX); \
|
|
VPXOR (16*7)(SP), B7, B7; \
|
|
VMOVDQU B7, (16*7)(CX)
|
|
|
|
#define avx2Load8Blocks \
|
|
VMOVDQU (32*0)(DX), Y0; \
|
|
VPXOR (32*0)(SP), Y0, Y0; \
|
|
VMOVDQU (32*1)(DX), Y1; \
|
|
VPXOR (32*1)(SP), Y1, Y1; \
|
|
VMOVDQU (32*2)(DX), Y2; \
|
|
VPXOR (32*2)(SP), Y2, Y2; \
|
|
VMOVDQU (32*3)(DX), Y3; \
|
|
VPXOR (32*3)(SP), Y3, Y3
|
|
|
|
#define avx2Load16Blocks \
|
|
avx2Load8Blocks; \
|
|
VMOVDQU (32*4)(DX), Y4; \
|
|
VPXOR (32*4)(SP), Y4, Y4; \
|
|
VMOVDQU (32*5)(DX), Y5; \
|
|
VPXOR (32*5)(SP), Y5, Y5; \
|
|
VMOVDQU (32*6)(DX), Y6; \
|
|
VPXOR (32*6)(SP), Y6, Y6; \
|
|
VMOVDQU (32*7)(DX), Y7; \
|
|
VPXOR (32*7)(SP), Y7, Y7
|
|
|
|
#define avx2LE2BE8Blocks \
|
|
VBROADCASTI128 ·flip_mask(SB), Y11; \
|
|
VPSHUFB Y11, Y0, Y0; \
|
|
VPSHUFB Y11, Y1, Y1; \
|
|
VPSHUFB Y11, Y2, Y2; \
|
|
VPSHUFB Y11, Y3, Y3; \
|
|
|
|
#define avx2LE2BE16Blocks \
|
|
avx2LE2BE8Blocks; \
|
|
VPSHUFB Y11, Y4, Y4; \
|
|
VPSHUFB Y11, Y5, Y5; \
|
|
VPSHUFB Y11, Y6, Y6; \
|
|
VPSHUFB Y11, Y7, Y7
|
|
|
|
#define avx2Store8Blocks \
|
|
VPXOR (32*0)(SP), Y0, Y0; \
|
|
VMOVDQU Y0, (32*0)(CX); \
|
|
VPXOR (32*1)(SP), Y1, Y1; \
|
|
VMOVDQU Y1, (32*1)(CX); \
|
|
VPXOR (32*2)(SP), Y2, Y2; \
|
|
VMOVDQU Y2, (32*2)(CX); \
|
|
VPXOR (32*3)(SP), Y3, Y3; \
|
|
VMOVDQU Y3, (32*3)(CX); \
|
|
|
|
#define avx2Store16Blocks \
|
|
avx2Store8Blocks; \
|
|
VPXOR (32*4)(SP), Y4, Y4; \
|
|
VMOVDQU Y4, (32*4)(CX); \
|
|
VPXOR (32*5)(SP), Y5, Y5; \
|
|
VMOVDQU Y5, (32*5)(CX); \
|
|
VPXOR (32*6)(SP), Y6, Y6; \
|
|
VMOVDQU Y6, (32*6)(CX); \
|
|
VPXOR (32*7)(SP), Y7, Y7; \
|
|
VMOVDQU Y7, (32*7)(CX)
|
|
|
|
#define avx2ByteSwap8Blocks \
|
|
VPSHUFB DWBSWAP, Y0, Y0; \
|
|
VPSHUFB DWBSWAP, Y1, Y1; \
|
|
VPSHUFB DWBSWAP, Y2, Y2; \
|
|
VPSHUFB DWBSWAP, Y3, Y3; \
|
|
|
|
#define avx2ByteSwap16Blocks \
|
|
avx2ByteSwap8Blocks; \
|
|
VPSHUFB DWBSWAP, Y4, Y4; \
|
|
VPSHUFB DWBSWAP, Y5, Y5; \
|
|
VPSHUFB DWBSWAP, Y6, Y6; \
|
|
VPSHUFB DWBSWAP, Y7, Y7
|
|
|
|
// func encryptSm4Xts(xk *uint32, tweak *[BlockSize]byte, dst, src []byte)
|
|
TEXT ·encryptSm4Xts(SB),0,$256-64
|
|
MOVQ xk+0(FP), AX
|
|
MOVQ tweak+8(FP), BX
|
|
MOVQ dst+16(FP), CX
|
|
MOVQ src+40(FP), DX
|
|
MOVQ src_len+48(FP), DI
|
|
|
|
CMPB ·useAVX2(SB), $1
|
|
JE avx2XtsSm4Enc
|
|
|
|
CMPB ·useAVX(SB), $1
|
|
JE avxXtsSm4Enc
|
|
|
|
MOVOU gcmPoly<>(SB), POLY
|
|
|
|
MOVOU (0*16)(BX), TW
|
|
|
|
xtsSm4EncOctets:
|
|
CMPQ DI, $128
|
|
JB xtsSm4EncNibbles
|
|
SUBQ $128, DI
|
|
|
|
// prepare tweaks
|
|
prepare8Tweaks
|
|
// load 8 blocks for encryption
|
|
sseLoad8Blocks
|
|
|
|
SM4_8BLOCKS(AX, X8, T0, T1, T2, B0, B1, B2, B3, B4, B5, B6, B7)
|
|
|
|
sseStore8Blocks
|
|
|
|
LEAQ 128(DX), DX
|
|
LEAQ 128(CX), CX
|
|
|
|
JMP xtsSm4EncOctets
|
|
|
|
xtsSm4EncNibbles:
|
|
CMPQ DI, $64
|
|
JB xtsSm4EncSingles
|
|
SUBQ $64, DI
|
|
|
|
// prepare tweaks
|
|
prepare4Tweaks
|
|
// load 4 blocks for encryption
|
|
sseLoad4Blocks
|
|
|
|
SM4_4BLOCKS(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
|
|
sseStore4Blocks
|
|
|
|
LEAQ 64(DX), DX
|
|
LEAQ 64(CX), CX
|
|
|
|
xtsSm4EncSingles:
|
|
CMPQ DI, $16
|
|
JB xtsSm4EncTail
|
|
SUBQ $16, DI
|
|
|
|
// load 1 block for encryption
|
|
MOVOU (16*0)(DX), B0
|
|
|
|
PXOR TW, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
PXOR TW, B0
|
|
MOVOU B0, (16*0)(CX)
|
|
mul2Inline
|
|
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
|
|
JMP xtsSm4EncSingles
|
|
|
|
xtsSm4EncTail:
|
|
TESTQ DI, DI
|
|
JE xtsSm4EncDone
|
|
|
|
LEAQ -16(CX), R8
|
|
MOVOU (16*0)(R8), B0
|
|
MOVOU B0, (16*0)(SP)
|
|
|
|
CMPQ DI, $8
|
|
JB loop_1b
|
|
SUBQ $8, DI
|
|
MOVQ (DX)(DI*1), R9
|
|
MOVQ (SP)(DI*1), R10
|
|
MOVQ R9, (SP)(DI*1)
|
|
MOVQ R10, (CX)(DI*1)
|
|
|
|
TESTQ DI, DI
|
|
JE xtsSm4EncTailEnc
|
|
|
|
loop_1b:
|
|
SUBQ $1, DI
|
|
MOVB (DX)(DI*1), R9
|
|
MOVB (SP)(DI*1), R10
|
|
MOVB R9, (SP)(DI*1)
|
|
MOVB R10, (CX)(DI*1)
|
|
TESTQ DI, DI
|
|
JNE loop_1b
|
|
|
|
xtsSm4EncTailEnc:
|
|
MOVOU (16*0)(SP), B0
|
|
PXOR TW, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
PXOR TW, B0
|
|
MOVOU B0, (16*0)(R8)
|
|
|
|
xtsSm4EncDone:
|
|
MOVOU TW, (16*0)(BX)
|
|
RET
|
|
|
|
avxXtsSm4Enc:
|
|
VMOVDQU gcmPoly<>(SB), POLY
|
|
VMOVDQU (0*16)(BX), TW
|
|
|
|
avxXtsSm4EncOctets:
|
|
CMPQ DI, $128
|
|
JB avxXtsSm4EncNibbles
|
|
SUBQ $128, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepare8Tweaks
|
|
// load 8 blocks for encryption
|
|
avxLoad8Blocks
|
|
|
|
AVX_SM4_8BLOCKS(AX, X8, T0, T1, T2, B0, B1, B2, B3, B4, B5, B6, B7)
|
|
|
|
avxStore8Blocks
|
|
|
|
LEAQ 128(DX), DX
|
|
LEAQ 128(CX), CX
|
|
|
|
JMP avxXtsSm4EncOctets
|
|
|
|
avxXtsSm4EncNibbles:
|
|
CMPQ DI, $64
|
|
JB avxXtsSm4EncSingles
|
|
SUBQ $64, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepare4Tweaks
|
|
// load 4 blocks for encryption
|
|
avxLoad4Blocks
|
|
|
|
AVX_SM4_4BLOCKS(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
|
|
avxStore4Blocks
|
|
|
|
LEAQ 64(DX), DX
|
|
LEAQ 64(CX), CX
|
|
|
|
avxXtsSm4EncSingles:
|
|
CMPQ DI, $16
|
|
JB avxXtsSm4EncTail
|
|
SUBQ $16, DI
|
|
|
|
// load 1 block for encryption
|
|
VMOVDQU (16*0)(DX), B0
|
|
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
avxMul2Inline
|
|
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
|
|
JMP avxXtsSm4EncSingles
|
|
|
|
avxXtsSm4EncTail:
|
|
TESTQ DI, DI
|
|
JE avxXtsSm4EncDone
|
|
|
|
LEAQ -16(CX), R8
|
|
VMOVDQU (16*0)(R8), B0
|
|
VMOVDQU B0, (16*0)(SP)
|
|
|
|
CMPQ DI, $8
|
|
JB avx_loop_1b
|
|
SUBQ $8, DI
|
|
MOVQ (DX)(DI*1), R9
|
|
MOVQ (SP)(DI*1), R10
|
|
MOVQ R9, (SP)(DI*1)
|
|
MOVQ R10, (CX)(DI*1)
|
|
|
|
TESTQ DI, DI
|
|
JE avxXtsSm4EncTailEnc
|
|
|
|
avx_loop_1b:
|
|
SUBQ $1, DI
|
|
MOVB (DX)(DI*1), R9
|
|
MOVB (SP)(DI*1), R10
|
|
MOVB R9, (SP)(DI*1)
|
|
MOVB R10, (CX)(DI*1)
|
|
TESTQ DI, DI
|
|
JNE avx_loop_1b
|
|
|
|
avxXtsSm4EncTailEnc:
|
|
VMOVDQU (16*0)(SP), B0
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(R8)
|
|
|
|
avxXtsSm4EncDone:
|
|
VMOVDQU TW, (16*0)(BX)
|
|
RET
|
|
|
|
avx2XtsSm4Enc:
|
|
VMOVDQU gcmPoly<>(SB), POLY
|
|
VMOVDQU (0*16)(BX), TW
|
|
VBROADCASTI128 ·nibble_mask(SB), NIBBLE_MASK
|
|
VBROADCASTI128 ·bswap_mask(SB), DWBSWAP
|
|
|
|
avx2XtsSm4Enc16Blocks:
|
|
CMPQ DI, $256
|
|
JB avx2XtsSm4EncOctets
|
|
SUBQ $256, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepare16Tweaks
|
|
// load 16 blocks for encryption
|
|
avx2Load16Blocks
|
|
// Apply Byte Flip Mask: LE -> BE
|
|
avx2LE2BE16Blocks
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
TRANSPOSE_MATRIX(Y4, Y5, Y6, Y7, Y8, Y9)
|
|
|
|
AVX2_SM4_16BLOCKS(AX, Y8, Y9, X8, X9, Y11, Y12, Y0, Y1, Y2, Y3, Y4, Y5, Y6, Y7)
|
|
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
TRANSPOSE_MATRIX(Y4, Y5, Y6, Y7, Y8, Y9)
|
|
avx2ByteSwap16Blocks
|
|
avx2Store16Blocks
|
|
|
|
LEAQ 256(DX), DX
|
|
LEAQ 256(CX), CX
|
|
JMP avx2XtsSm4Enc16Blocks
|
|
|
|
avx2XtsSm4EncOctets:
|
|
CMPQ DI, $128
|
|
JB avx2XtsSm4EncNibbles
|
|
SUBQ $128, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepare8Tweaks
|
|
// load 8 blocks for encryption
|
|
avx2Load8Blocks
|
|
// Apply Byte Flip Mask: LE -> BE
|
|
avx2LE2BE8Blocks
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
|
|
AVX2_SM4_8BLOCKS(AX, Y8, Y9, X8, X9, Y7, Y0, Y1, Y2, Y3)
|
|
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
avx2ByteSwap8Blocks
|
|
avx2Store8Blocks
|
|
|
|
LEAQ 128(DX), DX
|
|
LEAQ 128(CX), CX
|
|
|
|
avx2XtsSm4EncNibbles:
|
|
CMPQ DI, $64
|
|
JB avx2XtsSm4EncSingles
|
|
SUBQ $64, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepare4Tweaks
|
|
|
|
// load 4 blocks for encryption
|
|
avxLoad4Blocks
|
|
|
|
AVX_SM4_4BLOCKS(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
|
|
avxStore4Blocks
|
|
|
|
LEAQ 64(DX), DX
|
|
LEAQ 64(CX), CX
|
|
|
|
avx2XtsSm4EncSingles:
|
|
CMPQ DI, $16
|
|
JB avx2XtsSm4EncTail
|
|
SUBQ $16, DI
|
|
|
|
// load 1 block for encryption
|
|
VMOVDQU (16*0)(DX), B0
|
|
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
avxMul2Inline
|
|
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
|
|
JMP avx2XtsSm4EncSingles
|
|
|
|
avx2XtsSm4EncTail:
|
|
TESTQ DI, DI
|
|
JE avx2XtsSm4EncDone
|
|
|
|
LEAQ -16(CX), R8
|
|
VMOVDQU (16*0)(R8), B0
|
|
VMOVDQU B0, (16*0)(SP)
|
|
|
|
CMPQ DI, $8
|
|
JB avx2_loop_1b
|
|
SUBQ $8, DI
|
|
MOVQ (DX)(DI*1), R9
|
|
MOVQ (SP)(DI*1), R10
|
|
MOVQ R9, (SP)(DI*1)
|
|
MOVQ R10, (CX)(DI*1)
|
|
|
|
TESTQ DI, DI
|
|
JE avx2XtsSm4EncTailEnc
|
|
|
|
avx2_loop_1b:
|
|
SUBQ $1, DI
|
|
MOVB (DX)(DI*1), R9
|
|
MOVB (SP)(DI*1), R10
|
|
MOVB R9, (SP)(DI*1)
|
|
MOVB R10, (CX)(DI*1)
|
|
TESTQ DI, DI
|
|
JNE avx2_loop_1b
|
|
|
|
avx2XtsSm4EncTailEnc:
|
|
VMOVDQU (16*0)(SP), B0
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(R8)
|
|
|
|
avx2XtsSm4EncDone:
|
|
VMOVDQU TW, (16*0)(BX)
|
|
VZEROUPPER
|
|
RET
|
|
|
|
// func encryptSm4XtsGB(xk *uint32, tweak *[BlockSize]byte, dst, src []byte)
|
|
TEXT ·encryptSm4XtsGB(SB),0,$256-64
|
|
MOVQ xk+0(FP), AX
|
|
MOVQ tweak+8(FP), BX
|
|
MOVQ dst+16(FP), CX
|
|
MOVQ src+40(FP), DX
|
|
MOVQ src_len+48(FP), DI
|
|
|
|
CMPB ·useAVX2(SB), $1
|
|
JE avx2XtsSm4Enc
|
|
|
|
CMPB ·useAVX(SB), $1
|
|
JE avxXtsSm4Enc
|
|
|
|
MOVOU gbGcmPoly<>(SB), POLY
|
|
MOVOU ·bswap_mask(SB), BSWAP
|
|
MOVOU (0*16)(BX), TW
|
|
|
|
xtsSm4EncOctets:
|
|
CMPQ DI, $128
|
|
JB xtsSm4EncNibbles
|
|
SUBQ $128, DI
|
|
|
|
// prepare tweaks
|
|
prepareGB8Tweaks
|
|
// load 8 blocks for encryption
|
|
sseLoad8Blocks
|
|
|
|
SM4_8BLOCKS(AX, X8, T0, T1, T2, B0, B1, B2, B3, B4, B5, B6, B7)
|
|
|
|
sseStore8Blocks
|
|
|
|
LEAQ 128(DX), DX
|
|
LEAQ 128(CX), CX
|
|
|
|
JMP xtsSm4EncOctets
|
|
|
|
xtsSm4EncNibbles:
|
|
CMPQ DI, $64
|
|
JB xtsSm4EncSingles
|
|
SUBQ $64, DI
|
|
|
|
// prepare tweaks
|
|
prepareGB4Tweaks
|
|
// load 4 blocks for encryption
|
|
sseLoad4Blocks
|
|
|
|
SM4_4BLOCKS(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
|
|
sseStore4Blocks
|
|
|
|
LEAQ 64(DX), DX
|
|
LEAQ 64(CX), CX
|
|
|
|
xtsSm4EncSingles:
|
|
CMPQ DI, $16
|
|
JB xtsSm4EncTail
|
|
SUBQ $16, DI
|
|
|
|
// load 1 block for encryption
|
|
MOVOU (16*0)(DX), B0
|
|
|
|
PXOR TW, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
PXOR TW, B0
|
|
MOVOU B0, (16*0)(CX)
|
|
mul2GBInline
|
|
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
|
|
JMP xtsSm4EncSingles
|
|
|
|
xtsSm4EncTail:
|
|
TESTQ DI, DI
|
|
JE xtsSm4EncDone
|
|
|
|
LEAQ -16(CX), R8
|
|
MOVOU (16*0)(R8), B0
|
|
MOVOU B0, (16*0)(SP)
|
|
|
|
CMPQ DI, $8
|
|
JB loop_1b
|
|
SUBQ $8, DI
|
|
MOVQ (DX)(DI*1), R9
|
|
MOVQ (SP)(DI*1), R10
|
|
MOVQ R9, (SP)(DI*1)
|
|
MOVQ R10, (CX)(DI*1)
|
|
|
|
TESTQ DI, DI
|
|
JE xtsSm4EncTailEnc
|
|
|
|
loop_1b:
|
|
SUBQ $1, DI
|
|
MOVB (DX)(DI*1), R9
|
|
MOVB (SP)(DI*1), R10
|
|
MOVB R9, (SP)(DI*1)
|
|
MOVB R10, (CX)(DI*1)
|
|
TESTQ DI, DI
|
|
JNE loop_1b
|
|
|
|
xtsSm4EncTailEnc:
|
|
MOVOU (16*0)(SP), B0
|
|
PXOR TW, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
PXOR TW, B0
|
|
MOVOU B0, (16*0)(R8)
|
|
|
|
xtsSm4EncDone:
|
|
MOVOU TW, (16*0)(BX)
|
|
RET
|
|
|
|
avxXtsSm4Enc:
|
|
VMOVDQU gbGcmPoly<>(SB), POLY
|
|
VMOVDQU ·bswap_mask(SB), BSWAP
|
|
VMOVDQU (0*16)(BX), TW
|
|
|
|
avxXtsSm4EncOctets:
|
|
CMPQ DI, $128
|
|
JB avxXtsSm4EncNibbles
|
|
SUBQ $128, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepareGB8Tweaks
|
|
// load 8 blocks for encryption
|
|
avxLoad8Blocks
|
|
|
|
AVX_SM4_8BLOCKS(AX, X8, T0, T1, T2, B0, B1, B2, B3, B4, B5, B6, B7)
|
|
|
|
avxStore8Blocks
|
|
|
|
LEAQ 128(DX), DX
|
|
LEAQ 128(CX), CX
|
|
|
|
JMP avxXtsSm4EncOctets
|
|
|
|
avxXtsSm4EncNibbles:
|
|
CMPQ DI, $64
|
|
JB avxXtsSm4EncSingles
|
|
SUBQ $64, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepareGB4Tweaks
|
|
// load 4 blocks for encryption
|
|
avxLoad4Blocks
|
|
|
|
AVX_SM4_4BLOCKS(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
|
|
avxStore4Blocks
|
|
|
|
LEAQ 64(DX), DX
|
|
LEAQ 64(CX), CX
|
|
|
|
avxXtsSm4EncSingles:
|
|
CMPQ DI, $16
|
|
JB avxXtsSm4EncTail
|
|
SUBQ $16, DI
|
|
|
|
// load 1 block for encryption
|
|
VMOVDQU (16*0)(DX), B0
|
|
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
avxMul2GBInline
|
|
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
|
|
JMP avxXtsSm4EncSingles
|
|
|
|
avxXtsSm4EncTail:
|
|
TESTQ DI, DI
|
|
JE avxXtsSm4EncDone
|
|
|
|
LEAQ -16(CX), R8
|
|
VMOVDQU (16*0)(R8), B0
|
|
VMOVDQU B0, (16*0)(SP)
|
|
|
|
CMPQ DI, $8
|
|
JB avx_loop_1b
|
|
SUBQ $8, DI
|
|
MOVQ (DX)(DI*1), R9
|
|
MOVQ (SP)(DI*1), R10
|
|
MOVQ R9, (SP)(DI*1)
|
|
MOVQ R10, (CX)(DI*1)
|
|
|
|
TESTQ DI, DI
|
|
JE avxXtsSm4EncTailEnc
|
|
|
|
avx_loop_1b:
|
|
SUBQ $1, DI
|
|
MOVB (DX)(DI*1), R9
|
|
MOVB (SP)(DI*1), R10
|
|
MOVB R9, (SP)(DI*1)
|
|
MOVB R10, (CX)(DI*1)
|
|
TESTQ DI, DI
|
|
JNE avx_loop_1b
|
|
|
|
avxXtsSm4EncTailEnc:
|
|
VMOVDQU (16*0)(SP), B0
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(R8)
|
|
|
|
avxXtsSm4EncDone:
|
|
VMOVDQU TW, (16*0)(BX)
|
|
RET
|
|
|
|
avx2XtsSm4Enc:
|
|
VMOVDQU gbGcmPoly<>(SB), POLY
|
|
VMOVDQU (0*16)(BX), TW
|
|
VBROADCASTI128 ·nibble_mask(SB), NIBBLE_MASK
|
|
VBROADCASTI128 ·bswap_mask(SB), DWBSWAP
|
|
|
|
avx2XtsSm4Enc16Blocks:
|
|
CMPQ DI, $256
|
|
JB avx2XtsSm4EncOctets
|
|
SUBQ $256, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepareGB16Tweaks
|
|
// load 16 blocks for encryption
|
|
avx2Load16Blocks
|
|
// Apply Byte Flip Mask: LE -> BE
|
|
avx2LE2BE16Blocks
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
TRANSPOSE_MATRIX(Y4, Y5, Y6, Y7, Y8, Y9)
|
|
|
|
AVX2_SM4_16BLOCKS(AX, Y8, Y9, X8, X9, Y11, Y12, Y0, Y1, Y2, Y3, Y4, Y5, Y6, Y7)
|
|
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
TRANSPOSE_MATRIX(Y4, Y5, Y6, Y7, Y8, Y9)
|
|
avx2ByteSwap16Blocks
|
|
avx2Store16Blocks
|
|
|
|
LEAQ 256(DX), DX
|
|
LEAQ 256(CX), CX
|
|
JMP avx2XtsSm4Enc16Blocks
|
|
|
|
avx2XtsSm4EncOctets:
|
|
CMPQ DI, $128
|
|
JB avx2XtsSm4EncNibbles
|
|
SUBQ $128, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepareGB8Tweaks
|
|
// load 8 blocks for encryption
|
|
avx2Load8Blocks
|
|
// Apply Byte Flip Mask: LE -> BE
|
|
avx2LE2BE8Blocks
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
|
|
AVX2_SM4_8BLOCKS(AX, Y8, Y9, X8, X9, Y7, Y0, Y1, Y2, Y3)
|
|
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
avx2ByteSwap8Blocks
|
|
avx2Store8Blocks
|
|
|
|
LEAQ 128(DX), DX
|
|
LEAQ 128(CX), CX
|
|
|
|
avx2XtsSm4EncNibbles:
|
|
CMPQ DI, $64
|
|
JB avx2XtsSm4EncSingles
|
|
SUBQ $64, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepareGB4Tweaks
|
|
// load 4 blocks for encryption
|
|
avxLoad4Blocks
|
|
|
|
AVX_SM4_4BLOCKS(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
|
|
avxStore4Blocks
|
|
|
|
LEAQ 64(DX), DX
|
|
LEAQ 64(CX), CX
|
|
|
|
avx2XtsSm4EncSingles:
|
|
CMPQ DI, $16
|
|
JB avx2XtsSm4EncTail
|
|
SUBQ $16, DI
|
|
|
|
// load 1 block for encryption
|
|
VMOVDQU (16*0)(DX), B0
|
|
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
avxMul2GBInline
|
|
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
|
|
JMP avx2XtsSm4EncSingles
|
|
|
|
avx2XtsSm4EncTail:
|
|
TESTQ DI, DI
|
|
JE avx2XtsSm4EncDone
|
|
|
|
LEAQ -16(CX), R8
|
|
VMOVDQU (16*0)(R8), B0
|
|
VMOVDQU B0, (16*0)(SP)
|
|
|
|
CMPQ DI, $8
|
|
JB avx2_loop_1b
|
|
SUBQ $8, DI
|
|
MOVQ (DX)(DI*1), R9
|
|
MOVQ (SP)(DI*1), R10
|
|
MOVQ R9, (SP)(DI*1)
|
|
MOVQ R10, (CX)(DI*1)
|
|
|
|
TESTQ DI, DI
|
|
JE avx2XtsSm4EncTailEnc
|
|
|
|
avx2_loop_1b:
|
|
SUBQ $1, DI
|
|
MOVB (DX)(DI*1), R9
|
|
MOVB (SP)(DI*1), R10
|
|
MOVB R9, (SP)(DI*1)
|
|
MOVB R10, (CX)(DI*1)
|
|
TESTQ DI, DI
|
|
JNE avx2_loop_1b
|
|
|
|
avx2XtsSm4EncTailEnc:
|
|
VMOVDQU (16*0)(SP), B0
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(R8)
|
|
|
|
avx2XtsSm4EncDone:
|
|
VMOVDQU TW, (16*0)(BX)
|
|
VZEROUPPER
|
|
RET
|
|
|
|
// func decryptSm4Xts(xk *uint32, tweak *[BlockSize]byte, dst, src []byte)
|
|
TEXT ·decryptSm4Xts(SB),0,$256-64
|
|
MOVQ xk+0(FP), AX
|
|
MOVQ tweak+8(FP), BX
|
|
MOVQ dst+16(FP), CX
|
|
MOVQ src+40(FP), DX
|
|
MOVQ src_len+48(FP), DI
|
|
|
|
CMPB ·useAVX2(SB), $1
|
|
JE avx2XtsSm4Dec
|
|
|
|
CMPB ·useAVX(SB), $1
|
|
JE avxXtsSm4Dec
|
|
|
|
MOVOU gcmPoly<>(SB), POLY
|
|
MOVOU (0*16)(BX), TW
|
|
|
|
xtsSm4DecOctets:
|
|
CMPQ DI, $128
|
|
JB xtsSm4DecNibbles
|
|
SUBQ $128, DI
|
|
|
|
// prepare tweaks
|
|
prepare8Tweaks
|
|
// load 8 blocks for decryption
|
|
sseLoad8Blocks
|
|
|
|
SM4_8BLOCKS(AX, X8, T0, T1, T2, B0, B1, B2, B3, B4, B5, B6, B7)
|
|
|
|
sseStore8Blocks
|
|
|
|
LEAQ 128(DX), DX
|
|
LEAQ 128(CX), CX
|
|
|
|
JMP xtsSm4DecOctets
|
|
|
|
xtsSm4DecNibbles:
|
|
CMPQ DI, $64
|
|
JB xtsSm4DecSingles
|
|
SUBQ $64, DI
|
|
|
|
// prepare tweaks
|
|
prepare4Tweaks
|
|
// load 4 blocks for decryption
|
|
sseLoad4Blocks
|
|
|
|
SM4_4BLOCKS(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
|
|
sseStore4Blocks
|
|
|
|
LEAQ 64(DX), DX
|
|
LEAQ 64(CX), CX
|
|
|
|
xtsSm4DecSingles:
|
|
CMPQ DI, $32
|
|
JB xtsSm4DecTail
|
|
SUBQ $16, DI
|
|
|
|
// load 1 block for decryption
|
|
MOVOU (16*0)(DX), B0
|
|
|
|
PXOR TW, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
PXOR TW, B0
|
|
MOVOU B0, (16*0)(CX)
|
|
mul2Inline
|
|
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
|
|
JMP xtsSm4DecSingles
|
|
|
|
xtsSm4DecTail:
|
|
TESTQ DI, DI
|
|
JE xtsSm4DecDone
|
|
|
|
CMPQ DI, $16
|
|
JE xtsSm4DecLastBlock
|
|
|
|
// length > 16
|
|
// load 1 block for decryption
|
|
MOVOU (16*0)(DX), B0
|
|
MOVOU TW, B5
|
|
mul2Inline
|
|
PXOR TW, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
PXOR TW, B0
|
|
MOVOU B0, (16*0)(CX)
|
|
MOVOU B5, TW
|
|
|
|
SUBQ $16, DI
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
LEAQ -16(CX), R8
|
|
MOVOU B0, (16*0)(SP)
|
|
|
|
CMPQ DI, $8
|
|
JB loop_1b
|
|
SUBQ $8, DI
|
|
MOVQ (DX)(DI*1), R9
|
|
MOVQ (SP)(DI*1), R10
|
|
MOVQ R9, (SP)(DI*1)
|
|
MOVQ R10, (CX)(DI*1)
|
|
|
|
TESTQ DI, DI
|
|
JE xtsSm4DecTailDec
|
|
|
|
loop_1b:
|
|
SUBQ $1, DI
|
|
MOVB (DX)(DI*1), R9
|
|
MOVB (SP)(DI*1), R10
|
|
MOVB R9, (SP)(DI*1)
|
|
MOVB R10, (CX)(DI*1)
|
|
TESTQ DI, DI
|
|
JNE loop_1b
|
|
|
|
xtsSm4DecTailDec:
|
|
MOVOU (16*0)(SP), B0
|
|
PXOR TW, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
PXOR TW, B0
|
|
MOVOU B0, (16*0)(R8)
|
|
JMP xtsSm4DecDone
|
|
|
|
xtsSm4DecLastBlock:
|
|
MOVOU (16*0)(DX), B0
|
|
PXOR TW, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
PXOR TW, B0
|
|
MOVOU B0, (16*0)(CX)
|
|
mul2Inline
|
|
|
|
xtsSm4DecDone:
|
|
MOVOU TW, (16*0)(BX)
|
|
RET
|
|
|
|
avxXtsSm4Dec:
|
|
VMOVDQU gcmPoly<>(SB), POLY
|
|
VMOVDQU (0*16)(BX), TW
|
|
|
|
avxXtsSm4DecOctets:
|
|
CMPQ DI, $128
|
|
JB avxXtsSm4DecNibbles
|
|
SUBQ $128, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepare8Tweaks
|
|
|
|
// load 8 blocks for decryption
|
|
avxLoad8Blocks
|
|
|
|
AVX_SM4_8BLOCKS(AX, X8, T0, T1, T2, B0, B1, B2, B3, B4, B5, B6, B7)
|
|
|
|
avxStore8Blocks
|
|
|
|
LEAQ 128(DX), DX
|
|
LEAQ 128(CX), CX
|
|
|
|
JMP avxXtsSm4DecOctets
|
|
|
|
avxXtsSm4DecNibbles:
|
|
CMPQ DI, $64
|
|
JB avxXtsSm4DecSingles
|
|
SUBQ $64, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepare4Tweaks
|
|
// load 4 blocks for decryption
|
|
avxLoad4Blocks
|
|
|
|
AVX_SM4_4BLOCKS(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
|
|
avxStore4Blocks
|
|
|
|
LEAQ 64(DX), DX
|
|
LEAQ 64(CX), CX
|
|
|
|
avxXtsSm4DecSingles:
|
|
CMPQ DI, $32
|
|
JB avxXtsSm4DecTail
|
|
SUBQ $16, DI
|
|
|
|
// load 1 block for decryption
|
|
VMOVDQU (16*0)(DX), B0
|
|
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
avxMul2Inline
|
|
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
|
|
JMP avxXtsSm4DecSingles
|
|
|
|
avxXtsSm4DecTail:
|
|
TESTQ DI, DI
|
|
JE avxXtsSm4DecDone
|
|
|
|
CMPQ DI, $16
|
|
JE avxXtsSm4DecLastBlock
|
|
|
|
// length > 16
|
|
// load 1 block for decryption
|
|
VMOVDQU (16*0)(DX), B0
|
|
VMOVDQU TW, B5
|
|
avxMul2Inline
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
VMOVDQU B5, TW
|
|
|
|
SUBQ $16, DI
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
LEAQ -16(CX), R8
|
|
VMOVDQU B0, (16*0)(SP)
|
|
|
|
CMPQ DI, $8
|
|
JB avx_loop_1b
|
|
SUBQ $8, DI
|
|
MOVQ (DX)(DI*1), R9
|
|
MOVQ (SP)(DI*1), R10
|
|
MOVQ R9, (SP)(DI*1)
|
|
MOVQ R10, (CX)(DI*1)
|
|
|
|
TESTQ DI, DI
|
|
JE avxXtsSm4DecTailDec
|
|
|
|
avx_loop_1b:
|
|
SUBQ $1, DI
|
|
MOVB (DX)(DI*1), R9
|
|
MOVB (SP)(DI*1), R10
|
|
MOVB R9, (SP)(DI*1)
|
|
MOVB R10, (CX)(DI*1)
|
|
TESTQ DI, DI
|
|
JNE avx_loop_1b
|
|
|
|
avxXtsSm4DecTailDec:
|
|
VMOVDQU (16*0)(SP), B0
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(R8)
|
|
JMP avxXtsSm4DecDone
|
|
|
|
avxXtsSm4DecLastBlock:
|
|
VMOVDQU (16*0)(DX), B0
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
avxMul2Inline
|
|
|
|
avxXtsSm4DecDone:
|
|
VMOVDQU TW, (16*0)(BX)
|
|
RET
|
|
|
|
avx2XtsSm4Dec:
|
|
VMOVDQU gcmPoly<>(SB), POLY
|
|
VMOVDQU (0*16)(BX), TW
|
|
VBROADCASTI128 ·nibble_mask(SB), NIBBLE_MASK
|
|
VBROADCASTI128 ·bswap_mask(SB), DWBSWAP
|
|
|
|
avx2XtsSm4Dec16Blocks:
|
|
CMPQ DI, $256
|
|
JB avx2XtsSm4DecOctets
|
|
SUBQ $256, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepare16Tweaks
|
|
// load 16 blocks for encryption
|
|
avx2Load16Blocks
|
|
// Apply Byte Flip Mask: LE -> BE
|
|
avx2LE2BE16Blocks
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
TRANSPOSE_MATRIX(Y4, Y5, Y6, Y7, Y8, Y9)
|
|
|
|
AVX2_SM4_16BLOCKS(AX, Y8, Y9, X8, X9, Y11, Y12, Y0, Y1, Y2, Y3, Y4, Y5, Y6, Y7)
|
|
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
TRANSPOSE_MATRIX(Y4, Y5, Y6, Y7, Y8, Y9)
|
|
avx2ByteSwap16Blocks
|
|
avx2Store16Blocks
|
|
|
|
LEAQ 256(DX), DX
|
|
LEAQ 256(CX), CX
|
|
|
|
JMP avx2XtsSm4Dec16Blocks
|
|
|
|
avx2XtsSm4DecOctets:
|
|
CMPQ DI, $128
|
|
JB avx2XtsSm4DecNibbles
|
|
SUBQ $128, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepare8Tweaks
|
|
// load 8 blocks for encryption
|
|
avx2Load8Blocks
|
|
// Apply Byte Flip Mask: LE -> BE
|
|
avx2LE2BE8Blocks
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
|
|
AVX2_SM4_8BLOCKS(AX, Y8, Y9, X8, X9, Y7, Y0, Y1, Y2, Y3)
|
|
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
avx2ByteSwap8Blocks
|
|
avx2Store8Blocks
|
|
|
|
LEAQ 128(DX), DX
|
|
LEAQ 128(CX), CX
|
|
|
|
avx2XtsSm4DecNibbles:
|
|
CMPQ DI, $64
|
|
JB avxXtsSm4DecSingles
|
|
SUBQ $64, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepare4Tweaks
|
|
// load 4 blocks for decryption
|
|
avxLoad4Blocks
|
|
|
|
AVX_SM4_4BLOCKS(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
|
|
avxStore4Blocks
|
|
|
|
LEAQ 64(DX), DX
|
|
LEAQ 64(CX), CX
|
|
|
|
avx2XtsSm4DecSingles:
|
|
CMPQ DI, $32
|
|
JB avx2XtsSm4DecTail
|
|
SUBQ $16, DI
|
|
|
|
// load 1 block for decryption
|
|
VMOVDQU (16*0)(DX), B0
|
|
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
avxMul2Inline
|
|
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
|
|
JMP avx2XtsSm4DecSingles
|
|
|
|
avx2XtsSm4DecTail:
|
|
TESTQ DI, DI
|
|
JE avx2XtsSm4DecDone
|
|
|
|
CMPQ DI, $16
|
|
JE avx2XtsSm4DecLastBlock
|
|
|
|
// length > 16
|
|
// load 1 block for decryption
|
|
VMOVDQU (16*0)(DX), B0
|
|
VMOVDQU TW, B5
|
|
avxMul2Inline
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
VMOVDQU B5, TW
|
|
|
|
SUBQ $16, DI
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
LEAQ -16(CX), R8
|
|
VMOVDQU B0, (16*0)(SP)
|
|
|
|
CMPQ DI, $8
|
|
JB avx2_loop_1b
|
|
SUBQ $8, DI
|
|
MOVQ (DX)(DI*1), R9
|
|
MOVQ (SP)(DI*1), R10
|
|
MOVQ R9, (SP)(DI*1)
|
|
MOVQ R10, (CX)(DI*1)
|
|
|
|
TESTQ DI, DI
|
|
JE avx2XtsSm4DecTailDec
|
|
|
|
avx2_loop_1b:
|
|
SUBQ $1, DI
|
|
MOVB (DX)(DI*1), R9
|
|
MOVB (SP)(DI*1), R10
|
|
MOVB R9, (SP)(DI*1)
|
|
MOVB R10, (CX)(DI*1)
|
|
TESTQ DI, DI
|
|
JNE avx2_loop_1b
|
|
|
|
avx2XtsSm4DecTailDec:
|
|
VMOVDQU (16*0)(SP), B0
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(R8)
|
|
JMP avx2XtsSm4DecDone
|
|
|
|
avx2XtsSm4DecLastBlock:
|
|
VMOVDQU (16*0)(DX), B0
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
avxMul2Inline
|
|
|
|
avx2XtsSm4DecDone:
|
|
VMOVDQU TW, (16*0)(BX)
|
|
VZEROUPPER
|
|
RET
|
|
|
|
// func decryptSm4XtsGB(xk *uint32, tweak *[BlockSize]byte, dst, src []byte)
|
|
TEXT ·decryptSm4XtsGB(SB),0,$256-64
|
|
MOVQ xk+0(FP), AX
|
|
MOVQ tweak+8(FP), BX
|
|
MOVQ dst+16(FP), CX
|
|
MOVQ src+40(FP), DX
|
|
MOVQ src_len+48(FP), DI
|
|
|
|
CMPB ·useAVX2(SB), $1
|
|
JE avx2XtsSm4Dec
|
|
|
|
CMPB ·useAVX(SB), $1
|
|
JE avxXtsSm4Dec
|
|
|
|
MOVOU gbGcmPoly<>(SB), POLY
|
|
MOVOU ·bswap_mask(SB), BSWAP
|
|
MOVOU (0*16)(BX), TW
|
|
|
|
xtsSm4DecOctets:
|
|
CMPQ DI, $128
|
|
JB xtsSm4DecNibbles
|
|
SUBQ $128, DI
|
|
|
|
// prepare tweaks
|
|
prepareGB8Tweaks
|
|
// load 8 blocks for decryption
|
|
sseLoad8Blocks
|
|
|
|
SM4_8BLOCKS(AX, X8, T0, T1, T2, B0, B1, B2, B3, B4, B5, B6, B7)
|
|
|
|
sseStore8Blocks
|
|
|
|
LEAQ 128(DX), DX
|
|
LEAQ 128(CX), CX
|
|
|
|
JMP xtsSm4DecOctets
|
|
|
|
xtsSm4DecNibbles:
|
|
CMPQ DI, $64
|
|
JB xtsSm4DecSingles
|
|
SUBQ $64, DI
|
|
|
|
// prepare tweaks
|
|
prepareGB4Tweaks
|
|
// load 4 blocks for decryption
|
|
sseLoad4Blocks
|
|
|
|
SM4_4BLOCKS(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
|
|
sseStore4Blocks
|
|
|
|
LEAQ 64(DX), DX
|
|
LEAQ 64(CX), CX
|
|
|
|
xtsSm4DecSingles:
|
|
CMPQ DI, $32
|
|
JB xtsSm4DecTail
|
|
SUBQ $16, DI
|
|
|
|
// load 1 block for decryption
|
|
MOVOU (16*0)(DX), B0
|
|
|
|
PXOR TW, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
PXOR TW, B0
|
|
MOVOU B0, (16*0)(CX)
|
|
mul2GBInline
|
|
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
|
|
JMP xtsSm4DecSingles
|
|
|
|
xtsSm4DecTail:
|
|
TESTQ DI, DI
|
|
JE xtsSm4DecDone
|
|
|
|
CMPQ DI, $16
|
|
JE xtsSm4DecLastBlock
|
|
|
|
// length > 16
|
|
// load 1 block for decryption
|
|
MOVOU (16*0)(DX), B0
|
|
MOVOU TW, B5
|
|
mul2GBInline
|
|
PXOR TW, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
PXOR TW, B0
|
|
MOVOU B0, (16*0)(CX)
|
|
MOVOU B5, TW
|
|
|
|
SUBQ $16, DI
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
LEAQ -16(CX), R8
|
|
MOVOU B0, (16*0)(SP)
|
|
|
|
CMPQ DI, $8
|
|
JB loop_1b
|
|
SUBQ $8, DI
|
|
MOVQ (DX)(DI*1), R9
|
|
MOVQ (SP)(DI*1), R10
|
|
MOVQ R9, (SP)(DI*1)
|
|
MOVQ R10, (CX)(DI*1)
|
|
|
|
TESTQ DI, DI
|
|
JE xtsSm4DecTailDec
|
|
|
|
loop_1b:
|
|
SUBQ $1, DI
|
|
MOVB (DX)(DI*1), R9
|
|
MOVB (SP)(DI*1), R10
|
|
MOVB R9, (SP)(DI*1)
|
|
MOVB R10, (CX)(DI*1)
|
|
TESTQ DI, DI
|
|
JNE loop_1b
|
|
|
|
xtsSm4DecTailDec:
|
|
MOVOU (16*0)(SP), B0
|
|
PXOR TW, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
PXOR TW, B0
|
|
MOVOU B0, (16*0)(R8)
|
|
JMP xtsSm4DecDone
|
|
|
|
xtsSm4DecLastBlock:
|
|
MOVOU (16*0)(DX), B0
|
|
PXOR TW, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
PXOR TW, B0
|
|
MOVOU B0, (16*0)(CX)
|
|
mul2GBInline
|
|
|
|
xtsSm4DecDone:
|
|
MOVOU TW, (16*0)(BX)
|
|
RET
|
|
|
|
avxXtsSm4Dec:
|
|
VMOVDQU gbGcmPoly<>(SB), POLY
|
|
VMOVDQU ·bswap_mask(SB), BSWAP
|
|
VMOVDQU (0*16)(BX), TW
|
|
|
|
avxXtsSm4DecOctets:
|
|
CMPQ DI, $128
|
|
JB avxXtsSm4DecNibbles
|
|
SUBQ $128, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepareGB8Tweaks
|
|
// load 8 blocks for decryption
|
|
avxLoad8Blocks
|
|
|
|
AVX_SM4_8BLOCKS(AX, X8, T0, T1, T2, B0, B1, B2, B3, B4, B5, B6, B7)
|
|
|
|
avxStore8Blocks
|
|
|
|
LEAQ 128(DX), DX
|
|
LEAQ 128(CX), CX
|
|
|
|
JMP avxXtsSm4DecOctets
|
|
|
|
avxXtsSm4DecNibbles:
|
|
CMPQ DI, $64
|
|
JB avxXtsSm4DecSingles
|
|
SUBQ $64, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepareGB4Tweaks
|
|
// load 4 blocks for decryption
|
|
avxLoad4Blocks
|
|
|
|
AVX_SM4_4BLOCKS(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
|
|
avxStore4Blocks
|
|
|
|
LEAQ 64(DX), DX
|
|
LEAQ 64(CX), CX
|
|
|
|
avxXtsSm4DecSingles:
|
|
CMPQ DI, $32
|
|
JB avxXtsSm4DecTail
|
|
SUBQ $16, DI
|
|
|
|
// load 1 block for decryption
|
|
VMOVDQU (16*0)(DX), B0
|
|
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
avxMul2GBInline
|
|
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
|
|
JMP avxXtsSm4DecSingles
|
|
|
|
avxXtsSm4DecTail:
|
|
TESTQ DI, DI
|
|
JE avxXtsSm4DecDone
|
|
|
|
CMPQ DI, $16
|
|
JE avxXtsSm4DecLastBlock
|
|
|
|
// length > 16
|
|
// load 1 block for decryption
|
|
VMOVDQU (16*0)(DX), B0
|
|
VMOVDQU TW, B5
|
|
avxMul2GBInline
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
VMOVDQU B5, TW
|
|
|
|
SUBQ $16, DI
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
LEAQ -16(CX), R8
|
|
VMOVDQU B0, (16*0)(SP)
|
|
|
|
CMPQ DI, $8
|
|
JB avx_loop_1b
|
|
SUBQ $8, DI
|
|
MOVQ (DX)(DI*1), R9
|
|
MOVQ (SP)(DI*1), R10
|
|
MOVQ R9, (SP)(DI*1)
|
|
MOVQ R10, (CX)(DI*1)
|
|
|
|
TESTQ DI, DI
|
|
JE avxXtsSm4DecTailDec
|
|
|
|
avx_loop_1b:
|
|
SUBQ $1, DI
|
|
MOVB (DX)(DI*1), R9
|
|
MOVB (SP)(DI*1), R10
|
|
MOVB R9, (SP)(DI*1)
|
|
MOVB R10, (CX)(DI*1)
|
|
TESTQ DI, DI
|
|
JNE avx_loop_1b
|
|
|
|
avxXtsSm4DecTailDec:
|
|
VMOVDQU (16*0)(SP), B0
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(R8)
|
|
JMP avxXtsSm4DecDone
|
|
|
|
avxXtsSm4DecLastBlock:
|
|
VMOVDQU (16*0)(DX), B0
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
avxMul2GBInline
|
|
|
|
avxXtsSm4DecDone:
|
|
VMOVDQU TW, (16*0)(BX)
|
|
RET
|
|
|
|
avx2XtsSm4Dec:
|
|
VMOVDQU gbGcmPoly<>(SB), POLY
|
|
VMOVDQU (0*16)(BX), TW
|
|
VBROADCASTI128 ·nibble_mask(SB), NIBBLE_MASK
|
|
VBROADCASTI128 ·bswap_mask(SB), DWBSWAP
|
|
|
|
avx2XtsSm4Dec16Blocks:
|
|
CMPQ DI, $256
|
|
JB avx2XtsSm4DecOctets
|
|
SUBQ $256, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepareGB16Tweaks
|
|
// load 16 blocks for encryption
|
|
avx2Load16Blocks
|
|
// Apply Byte Flip Mask: LE -> BE
|
|
avx2LE2BE16Blocks
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
TRANSPOSE_MATRIX(Y4, Y5, Y6, Y7, Y8, Y9)
|
|
|
|
AVX2_SM4_16BLOCKS(AX, Y8, Y9, X8, X9, Y11, Y12, Y0, Y1, Y2, Y3, Y4, Y5, Y6, Y7)
|
|
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
TRANSPOSE_MATRIX(Y4, Y5, Y6, Y7, Y8, Y9)
|
|
avx2ByteSwap16Blocks
|
|
avx2Store16Blocks
|
|
|
|
LEAQ 256(DX), DX
|
|
LEAQ 256(CX), CX
|
|
|
|
JMP avx2XtsSm4Dec16Blocks
|
|
|
|
avx2XtsSm4DecOctets:
|
|
CMPQ DI, $128
|
|
JB avx2XtsSm4DecNibbles
|
|
SUBQ $128, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepareGB8Tweaks
|
|
// load 8 blocks for encryption
|
|
avx2Load8Blocks
|
|
// Apply Byte Flip Mask: LE -> BE
|
|
avx2LE2BE8Blocks
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
|
|
AVX2_SM4_8BLOCKS(AX, Y8, Y9, X8, X9, Y7, Y0, Y1, Y2, Y3)
|
|
|
|
// Transpose matrix 4 x 4 32bits word
|
|
TRANSPOSE_MATRIX(Y0, Y1, Y2, Y3, Y8, Y9)
|
|
avx2ByteSwap8Blocks
|
|
avx2Store8Blocks
|
|
|
|
LEAQ 128(DX), DX
|
|
LEAQ 128(CX), CX
|
|
|
|
avx2XtsSm4DecNibbles:
|
|
CMPQ DI, $64
|
|
JB avxXtsSm4DecSingles
|
|
SUBQ $64, DI
|
|
|
|
// prepare tweaks
|
|
avxPrepareGB4Tweaks
|
|
// load 4 blocks for decryption
|
|
avxLoad4Blocks
|
|
|
|
AVX_SM4_4BLOCKS(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
|
|
avxStore4Blocks
|
|
|
|
LEAQ 64(DX), DX
|
|
LEAQ 64(CX), CX
|
|
|
|
avx2XtsSm4DecSingles:
|
|
CMPQ DI, $32
|
|
JB avx2XtsSm4DecTail
|
|
SUBQ $16, DI
|
|
|
|
// load 1 block for decryption
|
|
VMOVDQU (16*0)(DX), B0
|
|
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
avxMul2GBInline
|
|
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
|
|
JMP avx2XtsSm4DecSingles
|
|
|
|
avx2XtsSm4DecTail:
|
|
TESTQ DI, DI
|
|
JE avx2XtsSm4DecDone
|
|
|
|
CMPQ DI, $16
|
|
JE avx2XtsSm4DecLastBlock
|
|
|
|
// length > 16
|
|
// load 1 block for decryption
|
|
VMOVDQU (16*0)(DX), B0
|
|
VMOVDQU TW, B5
|
|
avxMul2GBInline
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
VMOVDQU B5, TW
|
|
|
|
SUBQ $16, DI
|
|
LEAQ 16(DX), DX
|
|
LEAQ 16(CX), CX
|
|
LEAQ -16(CX), R8
|
|
VMOVDQU B0, (16*0)(SP)
|
|
|
|
CMPQ DI, $8
|
|
JB avx2_loop_1b
|
|
SUBQ $8, DI
|
|
MOVQ (DX)(DI*1), R9
|
|
MOVQ (SP)(DI*1), R10
|
|
MOVQ R9, (SP)(DI*1)
|
|
MOVQ R10, (CX)(DI*1)
|
|
|
|
TESTQ DI, DI
|
|
JE avx2XtsSm4DecTailDec
|
|
|
|
avx2_loop_1b:
|
|
SUBQ $1, DI
|
|
MOVB (DX)(DI*1), R9
|
|
MOVB (SP)(DI*1), R10
|
|
MOVB R9, (SP)(DI*1)
|
|
MOVB R10, (CX)(DI*1)
|
|
TESTQ DI, DI
|
|
JNE avx2_loop_1b
|
|
|
|
avx2XtsSm4DecTailDec:
|
|
VMOVDQU (16*0)(SP), B0
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(R8)
|
|
JMP avx2XtsSm4DecDone
|
|
|
|
avx2XtsSm4DecLastBlock:
|
|
VMOVDQU (16*0)(DX), B0
|
|
VPXOR TW, B0, B0
|
|
SM4_SINGLE_BLOCK(AX, B4, T0, T1, T2, B0, B1, B2, B3)
|
|
VPXOR TW, B0, B0
|
|
VMOVDQU B0, (16*0)(CX)
|
|
avxMul2GBInline
|
|
|
|
avx2XtsSm4DecDone:
|
|
VMOVDQU TW, (16*0)(BX)
|
|
VZEROUPPER
|
|
RET
|