Table of Contents
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
最近对SM3的性能进行了优化,主要包括:
- 纯golang版本,减少W,W'预计算;避免使用条件判断。
- 三月初实现了针对AMD64架构的ASM版本。
下一步,看看能否实现使用AVX2指令集进行优化。
SHA256 SIMD Execution
下面是SHA256 message scheduler SIMD实现,四轮算出下一个4个DWORDs.
// Wt = SIGMA1(Wt-2) + Wt-7 + SIGMA0(Wt-15) + Wt-16; for 16 <= t <= 63
// SIGMA0(x) = ROTR(7,x) XOR ROTR(18,x) XOR SHR(3,x)
// SIGMA1(x) = ROTR(17,x) XOR ROTR(19,x) XOR SHR(10,x)
// Transpose data into high/low parts
VPERM2I128 $0x20, XTMP2, XTMP0, XDWORD0 // w3, w2, w1, w0
VPERM2I128 $0x31, XTMP2, XTMP0, XDWORD1 // w7, w6, w5, w4
VPERM2I128 $0x20, XTMP3, XTMP1, XDWORD2 // w11, w10, w9, w8
VPERM2I128 $0x31, XTMP3, XTMP1, XDWORD3 // w15, w14, w13, w12
VPALIGNR $4, XDWORD2, XDWORD3, XTMP0; \ // XTMP0 = W[-7] = {w12,w11,w10,w9}
VPADDD XDWORD0, XTMP0, XTMP0; \ // XTMP0 = W[-7] + W[-16] = {w12+w3, w11+w2, w10+w1, w9+w0}
VPALIGNR $4, XDWORD0, XDWORD1, XTMP1; \ // XTMP1 = W[-15] = {w4,w3,w2,w1}
VPSRLD $7, XTMP1, XTMP2; \ // XTMP2 = W[-15] >> 7 = {w4>>7,w3>>7,w2>>7,w1>>7}
VPSLLD $(32-7), XTMP1, XTMP3; \ // XTMP3 = W[-15] << 25 = {w4<<25,w3<<25,w2<<25,w1<<25}
VPOR XTMP2, XTMP3, XTMP3; \ // XTMP3 = W[-15] ror 7 = {ROTR(7,w4),ROTR(7,w3),ROTR(7,w2),ROTR(7,w1)}
VPSRLD $18, XTMP1, XTMP2;
VPSRLD $3, XTMP1, XTMP4; \ // XTMP4 = W[-15] >> 3
VPSLLD $(32-18), XTMP1, XTMP1; \ // why no VPOR?
VPXOR XTMP1, XTMP3, XTMP3; \
VPXOR XTMP2, XTMP3, XTMP3; \ // XTMP3 = W[-15] ror 7 ^ W[-15] ror 18
VPXOR XTMP4, XTMP3, XTMP1; \ // XTMP1 = W[-15] ror 7 ^ W[-15] ror 18 ^ (W[-15] >> 3)
VPSHUFD $0xFA, XDWORD3, XTMP2; \ // XTMP2 = W[-2] {BBAA} {w15,w15,w14,w14}
VPADDD XTMP1, XTMP0, XTMP0; \ // XTMP0 = W[-16] + W[-7] + s0
VPSRLD $10, XTMP2, XTMP4 // XTMP4 = W[-2] >> 10 {BBAA}
VPSRLQ $19, XTMP2, XTMP3; \ // XTMP3 = W[-2] ror 19 {xBxA}
VPSRLQ $17, XTMP2, XTMP2; \ // XTMP2 = W[-2] ror 17 {xBxA}
VPXOR XTMP3, XTMP2, XTMP2; \
VPXOR XTMP2, XTMP4, XTMP4; \ // XTMP4 = s1 {xBxA}
VPSHUFB shuff_00BA<>(SB), XTMP4, XTMP4;\ // XTMP4 = s1 {00BA}
VPADDD XTMP4, XTMP0, XTMP0; \ // XTMP0 = {..., ..., W[1], W[0]}
VPSHUFD $0x50, XTMP0, XTMP2; \ // XTMP2 = W[-2] {DDCC}
VPSRLD $10, XTMP2, XTMP5; \ // XTMP5 = W[-2] >> 10 {DDCC}
VPSRLQ $19, XTMP2, XTMP3; \ // XTMP3 = W[-2] ror 19 {xDxC}
VPSRLQ $17, XTMP2, XTMP2; \ // XTMP2 = W[-2] ror 17 {xDxC}
VPXOR XTMP3, XTMP2, XTMP2; \
VPXOR XTMP2, XTMP5, XTMP5; \ // XTMP5 = s1 {xDxC}
VPSHUFB shuff_DC00<>(SB), XTMP5, XTMP5;\ // XTMP5 = s1 {DC00}
VPADDD XTMP0, XTMP5, XDWORD0; \ // XDWORD0 = {W[3], W[2], W[1], W[0]}
SM3的message scheduler有两个显著差别:
- Wi+3依赖Wi,所以也和SHA256类似,不能一次生成四个DWORDs,但和SHA256不完全一样。
- 比SHA256需要多算4个DWORDs。
Intel 指令参考: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
SM3 SIMD Execution
SM3的第一版,比SHA256复杂,不知道有没有继续优化的空间。
// Wj ← P1(Wj−16 ⊕ Wj−9 ⊕ (Wj−3 ≪ 15)) ⊕ (Wj−13 ≪ 7) ⊕ Wj−6
// Transpose data into high/low parts
VPERM2I128 $0x20, XTMP2, XTMP0, XDWORD0 // w3, w2, w1, w0
VPERM2I128 $0x31, XTMP2, XTMP0, XDWORD1 // w7, w6, w5, w4
VPERM2I128 $0x20, XTMP3, XTMP1, XDWORD2 // w11, w10, w9, w8
VPERM2I128 $0x31, XTMP3, XTMP1, XDWORD3 // w15, w14, w13, w12
VPALIGNR $12, XDWORD0, XDWORD1, XTMP0; \ // XTMP0 = W[-13] = {w6,w5,w4,w3}
VPSLLD $7, XTMP0, XTMP1; \
VPSRLD $(32-7), XTMP0, XTMP0; \
VPOR XTMP0, XTMP1, XTMP1; \ // XTMP1 = W[-13] rol 7
VPALIGNR $8, XDWORD2, XDWORD3, XTMP0; \ // XTMP0 = W[-6] = {w13,w12,w11,w10}
VPXOR XTMP1, XTMP0, XTMP0; \ // XTMP0 = W[-6] XOR (W[-13] rol 7)
// Prepare P1 parameters
VPALIGNR $12, XDWORD1, XDWORD2, XTMP1; \ // XTMP1 = W[-9] = {w10,w9,w8,w7}
VPXOR XDWORD0, XTMP1, XTMP1; \ // XTMP1 = W[-9] XOR W[-16]
VPSHUFD $0xA5, XDWORD3, XTMP2; \ // XTMP2 = W[-3] {BBAA} {w14,w14,w13,w13}
VPSLLQ $15, XTMP2, XTMP2; \ // XTMP2 = W[-3] rol 15 {BxAx}
VPSHUFB shuff_00BA<>(SB), XTMP2, XTMP2;\ // XTMP2 = W[-3] rol 15 {00BA}
VPXOR XTMP1, XTMP2, XTMP2; \ // XTMP2 = W[-9] XOR W[-16] XOR (W[-3] rol 15) {xxBA}
// P1
VPSLLD $15, XTMP2, XTMP3; \
VPSRLD $(32-15), XTMP2, XTMP4; \
VPOR XTMP3, XTMP4, XTMP4; \ // XTMP4 = XTMP2 rol 15 {xxBA}
VPXOR XTMP2, XTMP4, XTMP4; \ // XTMP4 = XTMP2 XOR (XTMP2 rol 15 {xxBA})
VPSLLD $23, XTMP2, XTMP3; \
VPSRLD $(32-23), XTMP2, XTMP5; \
VPOR XTMP3, XTMP5, XTMP5; \ //XTMP5 = XTMP2 rol 23 {xxBA}
VPXOR XTMP4, XTMP5, XTMP4; \ // XTMP4 = XTMP2 XOR (XTMP2 rol 15 {xxBA}) XOR (XTMP2 rol 23 {xxBA})
// First 2 words message schedule result
VPXOR XTMP4, XTMP0, XTMP2; \ // XTMP2 = {..., ..., W[1], W[0]}
// Prepare P1 parameters
VPALIGNR $12, XDWORD3, XTMP2, XTMP3; \ // XTMP3 = {..., W[1], W[0], w15}
VPSHUFD $80, XTMP3, XTMP4; \ // XTMP4 = = W[-3] {DDCC}
VPSLLQ $15, XTMP4, XTMP4; \ // XTMP4 = W[-3] rol 15 {DxCx}
VPSHUFB shuff_DC00<>(SB), XTMP4, XTMP4;\ // XTMP4 = W[-3] rol 15 {DC00}
VPXOR XTMP1, XTMP4, XTMP4; \ // XTMP4 = W[-9] XOR W[-16] XOR (W[-3] rol 15) {DCxx}
// P1
VPSLLD $15, XTMP4, XTMP5; \
VPSRLD $(32-15), XTMP4, XTMP3; \
VPOR XTMP3, XTMP5, XTMP3; \ // XTMP3 = XTMP4 rol 15 {DCxx}
VPXOR XTMP3, XTMP4, XTMP3; \ // XTMP3 = XTMP4 XOR (XTMP4 rol 15 {DCxx})
VPSLLD $23, XTMP4, XTMP5; \
VPSRLD $(32-23), XTMP4, XTMP1; \
VPOR XTMP1, XTMP5, XTMP1; \ // XTMP1 = XTMP4 rol 23 {DCxx}
VPXOR XTMP3, XTMP1, XTMP1; \ // XTMP1 = XTMP4 XOR (XTMP4 rol 15 {DCxx}) XOR (XTMP4 rol 23 {DCxx})
// Last 2 words message schedule result
VPXOR XTMP1, XTMP0, XTMP1; \ // XTMP1 = {W[3], W[2], ..., ...}
// 拼接结果
VPALIGNR $8, XTMP1, XTMP2, XTMP3; \ // XTMP3 = {W[1], W[0], W[3], W[2]}
VPSHUFD $0x4E, XTMP3, XDWORD0; \ // XDWORD0 = {W[3], W[2], W[1], W[0]}
第二版(用VPSHUFB优化循环左移):
// Wj ← P1(Wj−16 ⊕ Wj−9 ⊕ (Wj−3 ≪ 15)) ⊕ (Wj−13 ≪ 7) ⊕ Wj−6
// Transpose data into high/low parts
VPERM2I128 $0x20, XTMP2, XTMP0, XDWORD0 // w3, w2, w1, w0
VPERM2I128 $0x31, XTMP2, XTMP0, XDWORD1 // w7, w6, w5, w4
VPERM2I128 $0x20, XTMP3, XTMP1, XDWORD2 // w11, w10, w9, w8
VPERM2I128 $0x31, XTMP3, XTMP1, XDWORD3 // w15, w14, w13, w12
VPALIGNR $12, XDWORD0, XDWORD1, XTMP0; \ // XTMP0 = W[-13] = {w6,w5,w4,w3}
VPSLLD $7, XTMP0, XTMP1; \
VPSRLD $(32-7), XTMP0, XTMP0; \
VPOR XTMP0, XTMP1, XTMP1; \ // XTMP1 = W[-13] rol 7
VPALIGNR $8, XDWORD2, XDWORD3, XTMP0; \ // XTMP0 = W[-6] = {w13,w12,w11,w10}
VPXOR XTMP1, XTMP0, XTMP0; \ // XTMP0 = W[-6] XOR (W[-13] rol 7)
// Prepare P1 parameters
VPALIGNR $12, XDWORD1, XDWORD2, XTMP1; \ // XTMP1 = W[-9] = {w10,w9,w8,w7}
VPXOR XDWORD0, XTMP1, XTMP1; \ // XTMP1 = W[-9] XOR W[-16]
VPSHUFD $0xA5, XDWORD3, XTMP2; \ // XTMP2 = W[-3] {BBAA} {w14,w14,w13,w13}
VPSLLQ $15, XTMP2, XTMP2; \ // XTMP2 = W[-3] rol 15 {BxAx}
VPSHUFB shuff_00BA<>(SB), XTMP2, XTMP2;\ // XTMP2 = W[-3] rol 15 {00BA}
VPXOR XTMP1, XTMP2, XTMP2; \ // XTMP2 = W[-9] XOR W[-16] XOR (W[-3] rol 15) {xxBA}
// P1
VPSLLD $15, XTMP2, XTMP3; \
VPSRLD $(32-15), XTMP2, XTMP4; \
VPOR XTMP3, XTMP4, XTMP4; \ // XTMP4 = XTMP2 rol 15 {xxBA}
VPSHUFB r08_mask<>(SB), XTMP4, XTMP3; \ // XTMP3 = XTMP2 rol 23 {xxBA}
VPXOR XTMP2, XTMP4, XTMP4; \ // XTMP4 = XTMP2 XOR (XTMP2 rol 15 {xxBA})
VPXOR XTMP4, XTMP3, XTMP4; \ // XTMP4 = XTMP2 XOR (XTMP2 rol 15 {xxBA}) XOR (XTMP2 rol 23 {xxBA})
// First 2 words message schedule result
VPXOR XTMP4, XTMP0, XTMP2; \ // XTMP2 = {..., ..., W[1], W[0]}
// Prepare P1 parameters
VPALIGNR $12, XDWORD3, XTMP2, XTMP3; \ // XTMP3 = {..., W[1], W[0], w15}
VPSHUFD $80, XTMP3, XTMP4; \ // XTMP4 = = W[-3] {DDCC}
VPSLLQ $15, XTMP4, XTMP4; \ // XTMP4 = W[-3] rol 15 {DxCx}
VPSHUFB shuff_DC00<>(SB), XTMP4, XTMP4;\ // XTMP4 = W[-3] rol 15 {DC00}
VPXOR XTMP1, XTMP4, XTMP4; \ // XTMP4 = W[-9] XOR W[-16] XOR (W[-3] rol 15) {DCxx}
// P1
VPSLLD $15, XTMP4, XTMP5; \
VPSRLD $(32-15), XTMP4, XTMP3; \
VPOR XTMP3, XTMP5, XTMP3; \ // XTMP3 = XTMP4 rol 15 {DCxx}
VPSHUFB r08_mask<>(SB), XTMP3, XTMP1; \ // XTMP1 = XTMP4 rol 23 {DCxx}
VPXOR XTMP3, XTMP4, XTMP3; \ // XTMP3 = XTMP4 XOR (XTMP4 rol 15 {DCxx})
VPXOR XTMP3, XTMP1, XTMP1; \ // XTMP1 = XTMP4 XOR (XTMP4 rol 15 {DCxx}) XOR (XTMP4 rol 23 {DCxx})
// Last 2 words message schedule result
VPXOR XTMP1, XTMP0, XTMP1; \ // XTMP1 = {W[3], W[2], ..., ...}
// 拼接结果
VPALIGNR $8, XTMP1, XTMP2, XTMP3; \ // XTMP3 = {W[1], W[0], W[3], W[2]}
VPSHUFD $0x4E, XTMP3, XDWORD0; \ // XDWORD0 = {W[3], W[2], W[1], W[0]}
第三版(去除shuff_00BA<>(SB),shuff_DC00<>(SB),先扩展一个word):
// Wj ← P1(Wj−16 ⊕ Wj−9 ⊕ (Wj−3 ≪ 15)) ⊕ (Wj−13 ≪ 7) ⊕ Wj−6
// Transpose data into high/low parts
VPERM2I128 $0x20, XTMP2, XTMP0, XDWORD0 // w3, w2, w1, w0
VPERM2I128 $0x31, XTMP2, XTMP0, XDWORD1 // w7, w6, w5, w4
VPERM2I128 $0x20, XTMP3, XTMP1, XDWORD2 // w11, w10, w9, w8
VPERM2I128 $0x31, XTMP3, XTMP1, XDWORD3 // w15, w14, w13, w12
VPALIGNR $12, XDWORD0, XDWORD1, XTMP0; \ // XTMP0 = W[-13] = {w6,w5,w4,w3}
VPSLLD $7, XTMP0, XTMP1; \
VPSRLD $(32-7), XTMP0, XTMP0; \
VPOR XTMP0, XTMP1, XTMP1; \ // XTMP1 = W[-13] rol 7
VPALIGNR $8, XDWORD2, XDWORD3, XTMP0; \ // XTMP0 = W[-6] = {w13,w12,w11,w10}
VPXOR XTMP1, XTMP0, XTMP0; \ // XTMP0 = W[-6] XOR (W[-13] rol 7)
// Prepare P1 parameters
VPALIGNR $12, XDWORD1, XDWORD2, XTMP1; \ // XTMP1 = W[-9] = {w10,w9,w8,w7}
VPXOR XDWORD0, XTMP1, XTMP1; \ // XTMP1 = W[-9] XOR W[-16]
VPSHUFD $0xA5, XDWORD3, XTMP2; \ // XTMP2 = W[-3] {BBAA} {w14,w14,w13,w13}
VPSRLQ $17, XTMP2, XTMP2; \ // XTMP2 = W[-3] ror 17 {xBxA}
VPXOR XTMP1, XTMP2, XTMP2; \ // XTMP2 = W[-9] XOR W[-16] XOR (W[-3] rol 15) {xxxA}
// P1
VPSLLD $15, XTMP2, XTMP3; \
VPSRLD $(32-15), XTMP2, XTMP4; \
VPOR XTMP3, XTMP4, XTMP4; \ // XTMP4 = XTMP2 rol 15 {xxxA}
VPSHUFB r08_mask<>(SB), XTMP4, XTMP3; \ // XTMP3 = XTMP2 rol 23 {xxxA}
VPXOR XTMP2, XTMP4, XTMP4; \ // XTMP4 = XTMP2 XOR (XTMP2 rol 15 {xxxA})
VPXOR XTMP4, XTMP3, XTMP4; \ // XTMP4 = XTMP2 XOR (XTMP2 rol 15 {xxxA}) XOR (XTMP2 rol 23 {xxxA})
// First 1 words message schedule result
VPXOR XTMP4, XTMP0, XTMP2; \ // XTMP2 = {..., ..., ..., W[0]}
// Prepare P1 parameters
VPALIGNR $4, XDWORD3, XTMP2, XTMP3; \ // XTMP3 = {W[0], w15, w14, w13}
VPSLLD $15, XTMP3, XTMP4; \
VPSRLD $(32-15), XTMP3, XTMP3; \
VPOR XTMP3, XTMP4, XTMP4; \ // XTMP4 = W[-3] rol 15 {DCBA}
VPXOR XTMP1, XTMP4, XTMP4; \ // XTMP4 = W[-9] XOR W[-16] XOR (W[-3] rol 15) {DCBA}
// P1
VPSLLD $15, XTMP4, XTMP2; \
VPSRLD $(32-15), XTMP4, XTMP3; \
VPOR XTMP3, XTMP2, XTMP3; \ // XTMP3 = XTMP4 rol 15 {DCBA}
VPSHUFB r08_mask<>(SB), XTMP3, XTMP1; \ // XTMP1 = XTMP4 rol 23 {DCBA}
VPXOR XTMP3, XTMP4, XTMP3; \ // XTMP3 = XTMP4 XOR (XTMP4 rol 15 {DCBA})
VPXOR XTMP3, XTMP1, XTMP1; \ // XTMP1 = XTMP4 XOR (XTMP4 rol 15 {DCBA}) XOR (XTMP4 rol 23 {DCBA})
// 4 words message schedule result
VPXOR XTMP1, XTMP0, XDWORD0; \ // XTMP1 = {W[3], W[2], W[1], W[0]}
第四版(优化第一次P1实现):
// Wj ← P1(Wj−16 ⊕ Wj−9 ⊕ (Wj−3 ≪ 15)) ⊕ (Wj−13 ≪ 7) ⊕ Wj−6
// Transpose data into high/low parts
VPERM2I128 $0x20, XTMP2, XTMP0, XDWORD0 // w3, w2, w1, w0
VPERM2I128 $0x31, XTMP2, XTMP0, XDWORD1 // w7, w6, w5, w4
VPERM2I128 $0x20, XTMP3, XTMP1, XDWORD2 // w11, w10, w9, w8
VPERM2I128 $0x31, XTMP3, XTMP1, XDWORD3 // w15, w14, w13, w12
VPALIGNR $12, XDWORD0, XDWORD1, XTMP0; \ // XTMP0 = W[-13] = {w6,w5,w4,w3}
VPSLLD $7, XTMP0, XTMP1; \
VPSRLD $(32-7), XTMP0, XTMP0; \
VPOR XTMP0, XTMP1, XTMP1; \ // XTMP1 = W[-13] rol 7
VPALIGNR $8, XDWORD2, XDWORD3, XTMP0; \ // XTMP0 = W[-6] = {w13,w12,w11,w10}
VPXOR XTMP1, XTMP0, XTMP0; \ // XTMP0 = W[-6] XOR (W[-13] rol 7)
// Prepare P1 parameters
VPALIGNR $12, XDWORD1, XDWORD2, XTMP1; \ // XTMP1 = W[-9] = {w10,w9,w8,w7}
VPXOR XDWORD0, XTMP1, XTMP1; \ // XTMP1 = W[-9] XOR W[-16]
VPSHUFD $0xA5, XDWORD3, XTMP2; \ // XTMP2 = W[-3] {BBAA} {w14,w14,w13,w13}
VPSRLQ $17, XTMP2, XTMP2; \ // XTMP2 = W[-3] ror 17 {xBxA}
VPXOR XTMP1, XTMP2, XTMP2; \ // XTMP2 = W[-9] XOR W[-16] XOR (W[-3] rol 15) {xxxA}
// P1
VPSHUFD $0x00, XTMP2, XTMP2; \ // XTMP2 = {AAAA}
VPSRLQ $17, XTMP2, XTMP3; \ // XTMP3 = XTMP2 rol 15 {xxxA}
VPSRLQ $9, XTMP2, XTMP4; \ // XTMP4 = XTMP2 rol 23 {xxxA}
VPXOR XTMP2, XTMP4, XTMP4; \ // XTMP4 = XTMP2 XOR (XTMP2 rol 23 {xxxA})
VPXOR XTMP4, XTMP3, XTMP4; \ // XTMP4 = XTMP2 XOR (XTMP2 rol 15 {xxxA}) XOR (XTMP2 rol 23 {xxxA})
// First 1 words message schedule result
VPXOR XTMP4, XTMP0, XTMP2; \ // XTMP2 = {..., ..., ..., W[0]}
// Prepare P1 parameters
VPALIGNR $4, XDWORD3, XTMP2, XTMP3; \ // XTMP3 = {W[0], w15, w14, w13}
VPSLLD $15, XTMP3, XTMP4; \
VPSRLD $(32-15), XTMP3, XTMP3; \
VPOR XTMP3, XTMP4, XTMP4; \ // XTMP4 = W[-3] rol 15 {DCBA}
VPXOR XTMP1, XTMP4, XTMP4; \ // XTMP4 = W[-9] XOR W[-16] XOR (W[-3] rol 15) {DCBA}
// P1
VPSLLD $15, XTMP4, XTMP2; \
VPSRLD $(32-15), XTMP4, XTMP3; \
VPOR XTMP3, XTMP2, XTMP3; \ // XTMP3 = XTMP4 rol 15 {DCBA}
VPSHUFB r08_mask<>(SB), XTMP3, XTMP1; \ // XTMP1 = XTMP4 rol 23 {DCBA}
VPXOR XTMP3, XTMP4, XTMP3; \ // XTMP3 = XTMP4 XOR (XTMP4 rol 15 {DCBA})
VPXOR XTMP3, XTMP1, XTMP1; \ // XTMP1 = XTMP4 XOR (XTMP4 rol 15 {DCBA}) XOR (XTMP4 rol 23 {DCBA})
// 4 words message schedule result
VPXOR XTMP1, XTMP0, XDWORD0; \ // XTMP1 = {W[3], W[2], W[1], W[0]}
ARM64 NEON 版(没有用VTBL表查询指令):
VEXT $12, XWORD1.B16, XWORD0.B16, XTMP0.B16; // XTMP0 = W[-13] = {w6,w5,w4,w3}, Vm = XWORD1, Vn = XWORD0
VSHL $7, XTMP0.S4, XTMP1.S4;
VSRI $25, XTMP0.S4, XTMP1.S4; // XTMP1 = W[-13] rol 7
VEXT $8, XWORD3.B16, XWORD2.B16, XTMP0.B16; // XTMP0 = W[-6] = {w13,w12,w11,w10}
VEOR XTMP1.B16, XTMP0.B16, XTMP0.B16; // XTMP0 = W[-6] ^ (W[-13] rol 7)
// Prepare P1 parameters
VEXT $12, XWORD2.B16, XWORD1.B16, XTMP1.B16; // XTMP1 = W[-9] = {w10,w9,w8,w7}, Vm = XWORD2, Vn = XWORD1
VEOR XWORD0.B16, XTMP1.B16, XTMP1.B16; // XTMP1 = W[-9] ^ W[-16]
VEXT $4, XWORD2.B16, XWORD3.B16, XTMP3.B16; // XTMP3 = W[-3] {w11,w15,w14,w13}
VSHL $15, XTMP3.S4, XTMP2.S4;
VSRI $17, XTMP3.S4, XTMP2.S4; // XTMP2 = W[-3] rol 15 {xxBA}
VEOR XTMP1.B16, XTMP2.B16, XTMP2.B16; // XTMP2 = W[-9] ^ W[-16] ^ (W[-3] rol 15) {xxBA}
// P1
VSHL $15, XTMP2.S4, XTMP4.S4;
VSRI $17, XTMP2.S4, XTMP4.S4; // XTMP4 = = XTMP2 rol 15 {xxBA}
VSHL $8, XTMP4.S4, XTMP3.S4;
VSRI $24, XTMP4.S4, XTMP3.S4; // XTMP3 = XTMP2 rol 23 {xxBA}
VEOR XTMP2.B16, XTMP4.B16, XTMP4.B16; // XTMP4 = XTMP2 XOR (XTMP2 rol 15 {xxBA})
VEOR XTMP4.B16, XTMP3.B16, XTMP4.B16; // XTMP4 = XTMP2 XOR (XTMP2 rol 15 {xxBA}) XOR (XTMP2 rol 23 {xxBA})
// First 2 words message schedule result
VEOR XTMP4.B16, XTMP0.B16, XTMP2.B16; // XTMP2 = {..., ..., W[1], W[0]}
// Prepare P1 parameters
VEXT $4, XTMP2.B16, XWORD3.B16, XTMP3.B16; // XTMP3 = W[-3] {W[0],w15, w14, w13}, Vm = XTMP2, Vn = XWORD3
VSHL $15, XTMP3.S4, XTMP4.S4;
VSRI $17, XTMP3.S4, XTMP4.S4; // XTMP4 = W[-3] rol 15 {DCBA}
VEOR XTMP1.B16, XTMP4.B16, XTMP4.B16; // XTMP4 = W[-9] XOR W[-16] XOR (W[-3] rol 15) {DCBA}
// P1
VSHL $15, XTMP4.S4, XTMP3.S4;
VSRI $17, XTMP4.S4, XTMP3.S4; // XTMP3 = XTMP4 rol 15 {DCBA}
VSHL $8, XTMP3.S4, XTMP1.S4;
VSRI $24, XTMP3.S4, XTMP1.S4; // XTMP1 = XTMP4 rol 23 {DCBA}
VEOR XTMP3.B16, XTMP4.B16, XTMP3.B16; // XTMP3 = XTMP4 XOR (XTMP4 rol 15 {DCBA})
VEOR XTMP3.B16, XTMP1.B16, XTMP1.B16; // XTMP1 = XTMP4 XOR (XTMP4 rol 15 {DCBA}) XOR (XTMP4 rol 23 {DCBA})
// 4 words message schedule result
VEOR XTMP1.B16, XTMP0.B16, XTMP1.B16; // XTMP1 = {W[3], W[2], W[1], W[0]}
由于要算52个DWORDs,所以
Ingteger Execution | SIMD Execution |
---|---|
Execute Rounds 0...15 | Compute Message DWORDs 16…31 |
Execute Rounds 16...31 | Compute Message DWORDs 32…47 |
Execute Rounds 32...47 | Compute Message DWORDs 48…63 |
Execute Rounds 48...63 | Compute Message DWORDs 64…67 |
其它SIMD Execution
SHA 256的64轮压缩计算中,对Kt + Wt 进行了并行计算
VPADDD 0*32(TBL)(SRND*1), XDWORD0, XFER
VMOVDQU XFER, (_XFER + 0*32)(SP)(SRND*1)
SM3可以对Wt XOR Wt+4进行并行计算,
VPXOR XDWORD0, XDWORD1, XFER
VMOVDQU XFER, (_XFER + 0*32)(SP)(SRND*1)
但其依然需要Wt, 如何传入和使用Wt呢?难道把Wt和Wt'一起放到栈,譬如W0,W1,W2,W3,W0',W1',W2',W3',
VMOVDQU XDWORD0, (_XFER + 0*32)(SP)(SRND*1)
VPXOR XDWORD0, XDWORD1, XFER
VMOVDQU XFER, (_XFER + 1*32)(SP)(SRND*1)
没有大改变SHA 256 AVX2的主流程,但是栈的处理需要改变。
T常量处理
这个比较麻烦,要是作为参数传入,那就没法和SHA 256一样循环处理,必须把所有压缩轮都列出来。
RORXL
这是循环右移,SM3使用循环左移,没看到有ROLXL这样的指令,根据Intel的指令定义,应该能传入负数来实现(通过测试,用负数来做循环左移没有问题),
Operation
IF (OperandSize = 32)
y ← imm8 AND 1FH;
DEST ← (SRC >> y) | (SRC << (32-y));
ELSEIF (OperandSize = 64 )
y ← imm8 AND 3FH;
DEST ← (SRC >> y) | (SRC << (64-y));
ENDIF
实在不行只能用MOVL + ROLL来处理,但性能就没啥提高了。
主流程
如果只有(只剩)一个block (64字节),则只处理这个block,否则2个block(128字节)处理,两个block并行处理message scheduler。
结果(仅供参考)
AMD64,非SHA256-NI。 (性能的主要差距源于SM3压缩函数比SHA256更多的加法)
架构 | SHA256 | SM3 |
---|---|---|
AVX2+BMI2 | 100% | ~85% |
AVX | 不支持 | 支持 |
SSE | 不支持 | 支持 |
Normal ASM | 支持 | 支持 |
AVX2(SM3):
goos: windows
goarch: amd64
pkg: github.com/emmansun/gmsm/sm3
cpu: Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz
BenchmarkHash1K
BenchmarkHash1K-6
404881 2663 ns/op 384.53 MB/s 0 B/op 0 allocs/op
AVX(SM3):
goos: windows
goarch: amd64
pkg: github.com/emmansun/gmsm/sm3
cpu: Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz
BenchmarkHash1K
BenchmarkHash1K-6
451285 2809 ns/op 364.54 MB/s 0 B/op 0 allocs/op
SSE(SSSE3 SSE2)(SM3):
goos: windows
goarch: amd64
pkg: github.com/emmansun/gmsm/sm3
cpu: Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz
BenchmarkHash1K
BenchmarkHash1K-6
390361 2936 ns/op 348.83 MB/s 0 B/op 0 allocs/op
普通汇编(SM3):
goos: windows
goarch: amd64
pkg: github.com/emmansun/gmsm/sm3
cpu: Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz
BenchmarkHash1K
BenchmarkHash1K-6
314943 3955 ns/op 258.88 MB/s 0 B/op 0 allocs/op
纯golang(SM3):
goos: windows
goarch: amd64
pkg: github.com/emmansun/gmsm/sm3
cpu: Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz
BenchmarkHash1K
BenchmarkHash1K-6
233209 4699 ns/op 217.91 MB/s 0 B/op 0 allocs/op
AVX2(SH256):
goos: windows
goarch: amd64
pkg: github.com/emmansun/gmsm/sm3
cpu: Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz
BenchmarkHash1K_SH256
BenchmarkHash1K_SH256-6
508598 2281 ns/op 448.94 MB/s 0 B/op 0 allocs/op
普通汇编(SH256):
goos: windows
goarch: amd64
pkg: github.com/emmansun/gmsm/sha256
cpu: Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz
BenchmarkHash1K
BenchmarkHash1K-6
260756 4525 ns/op 226.30 MB/s 0 B/op 0 allocs/op