25
无进位乘法和GHASH
Sun Yimin edited this page 2023-08-21 15:15:43 +08:00
This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

概述

参考Reference 1Page 35 - 39

  • The PCLMULQDQ + AES-NI combination for AES-GCM
    • AES-NI facilitate high performance AES encryption and decryption
    • PCLMULQDQ 64 \times 64 \rightarrow 128 (carry-less)
      • Binary polynomial multiplication; speed up computations in binary fields
    • Using it for AES-GCM:
    • To use it for GHASH computations: GF(2^{128}) multiplication:
      1. Compute 128 \times 128 \rightarrow 256 via carry-less multiplication (of 64-bit operands)
      2. Reduction: {256 \rightarrow 128} \ modulo \ {x^{128} + x^7 + x^2 + x + 1} (done efficiently via software)
  • 128-bit Carry-less Multiplication using PCLMULQDQ
    (Gueron Kounavis, 2009) Multiply 128 \times 128 \rightarrow 256 \ [A_1 : A_0]\cdot[B_1 : B_0]
    • Schoolbook (4 PCLMULQDQ invocations)
      A_0 \cdot B_0 = [C_1 : C_0], \ A_1 \cdot B_1 = [D_1 : D_0]
      A_0 \cdot B_1 = [E_1 : E_0], \ A_1 \cdot B_0 = [F_1 : F_0]
      [A_1 : A_0] \cdot [B_1 : B_0] = [D_1:D_0 \oplus E_1 \oplus F_1:C_1 \oplus E_0 \oplus F_0 : C_0]
    • Carry-less Karatsuba (3 PCLMULQDQ invocations)
      A_1 \cdot B_1 = [C_1 : C_0], \ A_0 \cdot B_0 = [D_1 : D_0]
      (A_1 \oplus A_0) \cdot (B_1 \oplus B_0) = [E_1 : E_0]
      [A_1 : A_0] \cdot [B_1 : B_0] = [C_1:C_0 \oplus C_1 \oplus D_1 \oplus E_1 : D_1 \oplus C_0 \oplus D_0 \oplus E_0 : D_0]
  • A new interpretation to GHASH operations
    • GHASH does not use GF(2^{128}) computations "as expected"
      • Not in the usual polynomial representation convention
      • The bits inside the 128-bit operands are reflected
      • Actually - it is an operation on a permutation of elements of GF(2^{128})
        • T1 = reflect(A)
        • T2 = reflect(B)
        • T3 \times T2 \ modulo \ {x^{128} + x^7 + x^2 + x + 1} (a GF(2^{128}) multiplication)
        • reflect(T3)
    • It can be proved that this operation is:
      • A \times B \times x^{-127} \ mod \ x^{128} + x^{127} + x^{126} + x^{121} + 1
        • A weird Montgomery Multiplication in GF(2^{128}) modulo a reversed poly
      • Better written as
        • A \times (B \times x) \times x^{-128} \ mod \ x^{128} + x^{127} + x^{126} + x^{121} + 1
  • Fast reduction modulo x^{128} + x^{127} + x^{126} + x^{121} + 1
    • Input 256-bit operand [X_3:X_2:X_1:X_0]
    • [A_1:A_0] = X_0 \cdot 0xc200000000000000
    • [B_1:B_0] = [X_0 \oplus A_1 : X_1 \oplus A_0]
    • [C_1:C_0] = B_0 \cdot 0xc200000000000000
    • [D_1:D_0] = [B_0 \oplus C_1 : B_1 \oplus C_0]
    • Output: [D_1 \oplus X_3 : D_0 \oplus X_2]
; Input is in T1:T0
vmodqa     T3, [W]               ; poly
vpclmulqda T2, T3, T0, 0x01
vpshufd    T4, T0, 78
vpxor      T4, T4, T2
vpclmulqda T2, T3, T4, 0x01
vpshufd    T4, T4, 78
vpxor      T4, T4, T2
vpxor      T1, T1, T4            ; result in T1
  • Aggregated Reduction
    • In a Horner form (iterative computation)
      • Y_i = MM[(X_i \oplus Y_{i-1}), Hx] ... everyting mod \ x^{128} + x^{127} + x^{126} + x^{121} + 1
    • 4-way expanded Horner form (aggregate results to defer the reduction)
      • MM[X_i , Hx] \oplus MM[X_{i-1} , {(Hx)}^2] \oplus MM[X_{i-2} , {(Hx)}^3] \oplus MM[(X_{i-3} \oplus Y_{i-4}, {(Hx)}^4]
      • Can be expanded to N > 4 blocks, we use 8 blocks now.
      • Overhead: pre-calculate the powers of H \cdot x (amortized for reasonably long buffer)
      • The gain: reduction deferred to once per "N" blocks

参考