首页 > 其他分享 >Creating double-precision integer multiplication with a quad-precision result from single-precision

Creating double-precision integer multiplication with a quad-precision result from single-precision

时间:2023-02-26 21:00:10浏览次数:61  
标签:double precision xmm2 xmm1 64 result multiplication bit

Creating double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result

Raymond Chen

 

December 8th, 20140 0

Suppose you want to multiply two double-word values producing a quad-word result, but your processor supports only single-word multiplication with a double-word result. For concreteness, let’s say that your processor supports 32 × 32 → 64 multiplication and you want to implement 64 × 64 → 128 multiplication. (Sound like any processor you know?)

Oh boy, let’s do some high school algebra. Let’s start with unsigned multiplication.

Let x = A × 2³² + B and y = C × 2³² + D, where ABC, and D are all in the range 0 … 2³² − 1.

× y =  AC × 264 + (AD + BC) × 232 + BD
AC × 264 + BD  +  (AD + BC) × 232
  provisional result   cross-terms

Each of the multiplications (not counting the power-of-two multiplications) is a 32 × 32 → 64 multiplication, so they are something we have as a building block. And the basic implementation is simply to perform the four multiplications and add the pieces together. But if you have SSE, you can perform two multiplies in a single instruction.

    // Prepare our source registers
    movq xmm0, x         // xmm0 = { 0, 0, A, B } = { *, *, A, B }
    movq xmm1, y         // xmm1 = { 0, 0, C, D } = { *, *, C, D }
    punpckldq xmm0, xmm0 // xmm0 = { A, A, B, B } = { *, A, *, B }
    punpckldq xmm1, xmm1 // xmm1 = { C, C, D, D } = { *, C, *, D }
    pshufd xmm2, xmm1, _MM_SHUFFLE(2, 0, 3, 1)
                         // xmm2 = { D, D, C, C } = { *, D, *, C }

The PMULUDQ instruction multiplies 32-bit lanes 0 and 2 of its source and destination registers, producing 64-bit results. The values in lanes 1 and 3 do not participate in the multiplication, so it doesn’t matter what we put there. It so happens that the PUNPCKLDQ instruction duplicates the value, but we really don’t care. I used * to represent a don’t-care value.

    pmuludq xmm1, xmm0 // xmm1 = { AC, BD } // provisional result
    pmuludq xmm2, xmm0 // xmm2 = { AD, BC } // cross-terms

In two PMULUDQ instructions, we created the provisional result and the cross-terms. Now we just need to add the cross-terms to the provisional result. Unfortunately, SSE does not have a 128-bit addition (or at least SSE2 doesn’t; who knows what they’ll add in the future), so we need to do that the old-fashioned way.

    movdqa result, xmm1
    movdqa crossterms, xmm2
    mov    eax, crossterms[0]
    mov    edx, crossterms[4] // edx:eax = BC
    add    result[4], eax
    adc    result[8], edx
    adc    result[12], 0      // add the first cross-term
    mov    eax, crossterms[8]
    mov    edx, crossterms[12] // edx:eax = AD
    add    result[4], eax
    adc    result[8], edx
    adc    result[12], 0      // add the second cross-term

There we go, a 64 × 64 → 128 multiply constructed from 32 × 32 → 64 multiplies.

Exercise: Why didn’t I use the rax register to perform the 64-bit addition? (This is sort of a trick question.)

That calculates an unsigned multiplication, but how do we do a signed multiplication? Let’s work modulo 2128 so that signed and unsigned multiplication are equivalent. This means that we need to expand x and y to 128-bit values X and Y.

Let s = the sign bit of x expanded to a 64-bit value, and similarly t = the sign bit of y expanded to a 64-bit value. In other words, s is 0xFFFFFFFF`FFFFFFFF if x < 0 and zero if x ≥ 0.

The 128-bit values being multiplied are

X =  s × 264 + x
Y =  t × 264 + y

The product is therefore

× Y =  st × 2128  (sy + tx) × 264   +  xy
  zero   adjustment   unsigned product

The first term is zero because it overflows the 128-bit result. That leaves the second term as the adjustment, which simplifies to “If x < 0 then subtract y from the high 64 bits; if y < 0 then subtract x from the high 64 bits.”

    if (x < 0) result.m128i_u64[1] -= y;
    if (y < 0) result.m128i_u64[1] -= x;

If we were still playing with SSE, we could compute this as follows:

    movdqa xmm0, result   // xmm0 = { high, low }
    movq   xmm1, x        // xmm1 = { 0, x }
    movq   xmm2, y        // xmm2 = { 0, y }
    pshufd xmm3, xmm1, _MM_SHUFFLE(1, 1, 3, 2) // xmm3 = { xhi, xhi, 0, 0 }
    pshufd xmm1, xmm1, _MM_SHUFFLE(1, 0, 3, 2) // xmm1 = { x, 0 }
    pshufd xmm4, xmm2, _MM_SHUFFLE(1, 1, 3, 2) // xmm4 = { yhi, yhi, 0, 0 }
    pshufd xmm2, xmm2, _MM_SHUFFLE(1, 0, 3, 2) // xmm2 = { y, 0 }
    psrad  xmm3, 31       // xmm3 = { s, s, 0, 0 } = { s, 0 }
    psrad  xmm4, 31       // xmm4 = { t, t, 0, 0 } = { t, 0 }
    pand   xmm3, xmm2     // xmm3 = { x < 0 ? y : 0, 0 }
    pand   xmm4, xmm1     // xmm4 = { y < 0 ? x : 0, 0 }
    psubq  xmm0, xmm3     // first adjustment
    psubq  xmm0, xmm4     // second adjustment
    movdqa result, xmm0   // update result

The code is a bit strange because SSE2 doesn’t have a full set of 64-bit integer opcodes. We would have liked to have used a psraq instruction to fill a 64-bit field with a sign bit. But there is no such instruction, so instead we duplicate the 64-bit sign bit into two 32-bit sign bits (one in lane 2 and one in lane 3) and then fill the lanes with that bit using psrad.

 

标签:double,precision,xmm2,xmm1,64,result,multiplication,bit
From: https://www.cnblogs.com/ioriwellings/p/17157659.html

相关文章

  • MyBatis_06(自定义映射resultMap)
    主题:自定义映射resultMap"自定义映射resultMap",可以解决什么问题:1-"属性"和"字段名"不一致的情况2-"多对一"的情况3-"一对多"的情况一、若"字段名"和"......
  • [Vulnhub] DOUBLETROUBLE: 1
    下载地址0x00配置攻击机IP:192.168.10.5靶机IP:192.168.10.60x01攻击用Namp扫描靶机开放的端口┌──(root㉿azwhikaru)-[~]└─#nmap-sC-sV-p-192.16......
  • jQuery + parseDouble问题
    只需使用parseFloat():varc=parseFloat(a)+parseFloat(b);......
  • SQL Server In and Not in when result contains null
    请注意,如果列表包含NULL,则IN或NOTIN的结果将为UNKNOWN,意味着将会被认为符合条件,例如:selectcasewhen'1'notin('0',null)then'notin'else'in'end结果如下:......
  • float和double类型在计算机中是怎么储存的
    我们先看一个代码。#include<stdio.h>intmain(){intn=9;float*pfloat=(float*)&n;printf("n的值为:%d",n);printf("*pfloat的值为%f\n",*pfloat);*pfloa......
  • mybatis配置文件resultMap标签的使用
    本文为博主原创,未经允许不得转载:      resultMap标签是为了映射select查询出来结果的集合,其主要作用是将实体类中的字段与数据库表中的字段进行关联映射。    ......
  • js double bitwise not ~~ All In One
    jsdoublebitwisenot~~AllInOnejs双非位运算~~0vsNaNconstarr=[];//[]~~arr.pop();//0+arr.pop();//NaNconstarr=[];//[]consta=ar......
  • mapToDouble.sum的精度丢失问题
    需要四舍五入一下packagecn.***;importjava.math.RoundingMode;importjava.text.DecimalFormat;/***小数位获取*/publicclassDecimalAcquisition{publicst......
  • 关于cout << fixed << setprecision(2)
    fixed:用于设置浮点数的输出精度为固定小数位数(6位)。只要出现了fixed,则后面都是以fixed输出(就是说,如果之后还要继续使用,不用再打一遍fixed了),如果取消使用unsetf函数。......
  • Mybatis-ResultHandler
    Mybatis接口——ResultHandlermybatis中的ResultHandler主要有2个作用:对取出的结果进行统一处理配合defaultFetchSize解决oom问题那如何使用这个接口呢?首先,......