双三次插值及优化

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

1.数学模型
对于一个目的像素，其坐标通过反向变换得到的在原图中的浮点坐标为(i+u,j+v)，其中i、j均为非负整数，u、v为[0,1)区间的浮点数，双三次插值考虑一个浮点坐标(i+u,j+v)周围的16个邻点，目的像素值f(i+u,j+v)可由如下插值公式得到：
f(i+u,j+v) = [A] * [B] * [C]
[A]=[ S(u + 1) S(u + 0) S(u - 1) S(u - 2) ]
┏ f(i-1, j-1) f(i-1, j+0) f(i-1, j+1) f(i-1, j+2) ┓
[B]=┃ f(i+0, j-1) f(i+0, j+0) f(i+0, j+1) f(i+0, j+2) ┃
┃ f(i+1, j-1) f(i+1, j+0) f(i+1, j+1) f(i+1, j+2) ┃
┗ f(i+2, j-1) f(i+2, j+0) f(i+2, j+1) f(i+2, j+2) ┛
┏ S(v + 1) ┓
[C]=┃ S(v + 0) ┃
┃ S(v - 1) ┃
┗ S(v - 2) ┛
┏ 1-2*Abs(x)^2+Abs(x)^3 , 0<=Abs(x)<1
S(x)=｛ 4-8*Abs(x)+5*Abs(x)^2-Abs(x)^3 , 1<=Abs(x)<2
┗ 0 , Abs(x)>=2
S(x)是对 Sin(x*Pi)/x 的逼近（Pi是圆周率——π），为插值核。

2.计算流程
1. 获取16个点的坐标P1、P2……P16
2. 由插值核计算公式S(x) 分别计算出x、y方向的插值核向量Su、Sv
3. 进行矩阵运算，得到插值结果
iTemp1 = Su0 * P1 + Su1 * P5 + Su2 * P9 + Su3 * P13
iTemp2 = Su0 * P2 + Su1 * P6 + Su2 * P10 + Su3 * P14
iTemp3 = Su0 * P3 + Su1 * P7 + Su2 * P11 + Su3 * P15
iTemp4 = Su0 * P4 + Su1 * P8 + Su2 * P12 + Su3 * P16
iResult = Sv1 * iTemp1 + Sv2 * iTemp2 + Sv3 * iTemp3 + Sv4 * iTemp4
4. 在得到插值结果图后，我们发现图像中有“毛刺”，因此对插值结果做了个后处理，即：设该点在原图中的像素值为pSrc，若abs(iResult - pSrc) 大于某阈值，我们认为插值后的点可能污染原图，因此用原像素值pSrc代替。

3. 算法优化
由于双三次插值计算一个点的坐标需要其周围16个点，更有多达20次的乘法及15次的加法，计算量可以说是非常大，势必要进行优化。

我们选择了Intel的SSE2优化技术，它只支持在P4及以上的机器。

测试当前CPU是否支持SSE2，可由CPUID指令得到，代码为：
BOOL g_bSSE2 = FALSE;
__asm
{
mov eax, 1;
cpuid;
test edx, 0x04000000;
jz NotSupport;
mov g_bSSE2, 1
NotSupport:
}
支持SSE2的CPU引入了8个128位的寄存器，这样一个寄存器中就可以存放4个点(RGB)，有利于并行计算。

详细代码见Transform.cpp中函数Optimize_Bicubic。

优化中遇到的问题：
1. 图像每个点由RGB通道组成，由于1个SSE2寄存器有16个字节，这样读入4个像素点后，要浪费4个字节，同时要花费时间将数据对齐，即由BRGB | RGBR | GBRG | BRGB 对齐成 0RGB | 0RGB | 0RGB | 0RGB ;
2. 读16字节数据到寄存器时，由于图像地址不能保证是16字节对齐，因此需用更多时钟周期的MOVDQU指令(6个以上时钟周期)；如能使地址16字节对齐，则可用MOVDQA指令(1个时钟周期) ;
3. 为了消除除法及浮点运算，对权值放大256倍，这样在计算插值核时，必须用2Bytes 来表示1个系数，而图像数据都是1Byte，这样在对齐做乘法时，要浪费一半的SSE2寄存器的空间，导致运算时间变长；而若降低插值核的精度，使其在1Byte表示范围内时，运算的精度又大为下降；
4. 对各指令的周期以及若干行指令是否能够并行流水缺乏经验和认识。

附：SSE2指令整理
算术(Arithmetic)指令：
ADDPD--Packed Double-Precision Floating-Point Add SSE2
2个double对应相加
ADDPD xmm0, xmm1/m128
ADDPS--Packed Single-Precision Floating-Point Add SSE
4个float对应相加
ADDPS xmm0, xmm1/m128
ADDSD--Scalar Double-Precision Floating-Point Add
1个double(低端)对应相加SSE2 ADDSD xmm0, xmm1/m64
ADDSS--Scalar Single-Precision Floating-Point Add SSE
1个float(低端)对应相加
ADDSS xmm0, xmm1/m32
Opcode Instruction Description
0F F6 /r
PSADBW mm1,
mm2/m64
Absolute difference of packed unsigned byte
integers from mm2 /m64 and mm1; differences are
then summed to produce an unsigned word integer
result.
66 0F F6
/r
PSADBW xmm1,
xmm2/m128
Absolute difference of packed unsigned byte
integers from xmm2 /m128 and xmm1; the 8 low
differences and 8 high differences are then summed
separately to produce two word integer results.
Opcode Instruction Description
0F F8 /r PSUBB mm, mm/m64
Subtract packed byte integers in mm/m64 from
packed byte integers in mm.
66 0F F8
/r
PSUBB xmm1,
xmm2/m128
Subtract packed byte integers in xmm2/m128
from packed byte integers in xmm1.
0F F9 /r PSUBW mm, mm/m64
Subtract packed word integers in mm/m64 from
packed word integers in mm.
66 0F F9 PSUBW xmm1, Subtract packed word integers in xmm2/m128
------------------------------------------------------------------------------------------------------
Opcode Instruction Description
0F E5 /r PMULHW mm,
mm/m64
Multiply the packed signed word integers in mm1
register and mm2/m64, and store the high 16 bits
of the results in mm1.
66 0F E5 /r PMULHW xmm1,
xmm2/m128
Multiply the packed signed word integers in xmm1
and xmm2/m128, and store the high 16 bits of the
results in xmm1.
Opcode Instruction Description
0F D5 /r PMULLW mm,
mm/m64
Multiply the packed signed word integers in mm1
register and mm2/m64, and store the low 16 bits
of the results in mm1.
66 0F D5 /r PMULLW xmm1,
xmm2/m128
Multiply the packed signed word integers in xmm1
and xmm2/m128, and store the low 16 bits of the
results in xmm1.
Opcode Instruction Description
0F F4 /r PMULUDQ mm1,
mm2/m64
Multiply unsigned doubleword integer in mm1 by
unsigned doubleword integer in mm2/m64, and
store the quadword result in mm1.
66 OF F4 /r PMULUDQ xmm1,
xmm2/m128
Multiply packed unsigned doubleword integers in
xmm1 by packed unsigned doubleword integers in
xmm2/m128, and store the quadword results in
xmm1.
PMULUDQ instruction with 64-Bit operands: DEST[63-0] DEST[31-0] * SRC[31-0];
PMULUDQ instruction with 128-Bit operands: DEST[63-0]
DEST[31-0] * SRC[31-0];
DEST[127-64] DEST[95-64] * SRC[95-64];
MULPD--Packed Double-Precision Floating-Point Multiply Opcode Instruction Description
66 0F 59 /r MULPD xmm1,
xmm2/m128
Multiply packed double-precision
floating-point values in xmm2/m128 by xmm1.
DEST[63-0] DEST[63-0] * SRC[63-0];
DEST[127-64] DEST[127-64] * SRC[127-64];
MULPS--Packed Single-Precision Floating-Point Multiply Opcode Instruction Description
0F 59 /r MULPS xmm1,
xmm2/m128
Multiply packed single-precision
floating-point values in xmm2/mem by xmm1.
DEST[31-0] DEST[31-0] * SRC[31-0];
DEST[63-32] DEST[63-32] * SRC[63-32];
DEST[95-64] DEST[95-64] * SRC[95-64];
DEST[127-96] DEST[127-96] * SRC[127-96];
MULSD--Scalar Double-Precision Floating-Point Multiply Opcode Instruction Description
F2 0F 59 /r MULSD xmm1,
xmm2/m64
Multiply the low double-precision
floating-point value in xmm2/mem64 by low
double-precision floating-point value in xmm1.
DEST[63-0] DEST[63-0] * xmm2/m64[63-0]; * DEST[127-64] remains unchanged *;
Opcode Instruction Description
F3 0F 59 /r MULSS xmm1,
xmm2/m32
Multiply the low single-precision
floating-point value in xmm2/mem by the low
single-precision floating-point value in xmm1.
DEST[31-0] DEST[31-0] * SRC[31-0]; * DEST[127-32] remains unchanged *;
----------------------------------------------------------------------------------------------------------------------
DIVPD--Packed Double-Precision Floating-Point Divide
DIVPD xmm0, xmm1/m128
DEST[63-0] DEST[63-0] / (SRC[63-0]);
DEST[127-64] DEST[127-64] / (SRC[127-64]);
DIVPS--Packed Single-Precision Floating-Point Divide
DIVPS xmm0, xmm1/m128
DEST[31-0] DEST[31-0] / (SRC[31-0]);
DEST[63-32] DEST[63-32] / (SRC[63-32]);
DEST[95-64] DEST[95-64] / (SRC[95-64]);
DEST[127-96] DEST[127-96] / (SRC[127-96]);
DIVSD--Scalar Double-Precision Floating-Point Divide
DIVSD xmm0, xmm1/m64
DEST[63-0] DEST[63-0] / SRC[63-0];
* DEST[127-64] remains unchanged *;
DIVSS--Scalar Single-Precision Floating-Point Divide
DIVSS xmm0, xmm1/m32
DEST[31-0] DEST[31-0] / SRC[31-0];
* DEST[127-32] remains unchanged *;
--------------------------------------------------------------------------------------------------------------------
Opcode Instruction Description
0F E0 /r PAVGB mm1,
mm2/m64
Average packed unsigned byte integers from
mm2/m64 and mm1, with rounding.
66 0F E0, /r PAVGB xmm1,
xmm2/m128
Average packed unsigned byte integers from
xmm2/m128 and xmm1, with rounding.
0F E3 /r PAVGW mm1,
mm2/m64
Average packed unsigned word integers from
mm2/m64 and mm1, with rounding.
66 0F E3 /r PAVGW xmm1,
xmm2/m128
Average packed unsigned word integers from
xmm2/m128 and xmm1, with rounding.
----------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------
DEST[31-0] APPROXIMATE(1.0/(SRC[31-0]));
DEST[63-32] APPROXIMATE(1.0/(SRC[63-32]));
DEST[95-64] APPROXIMATE(1.0/(SRC[95-64]));
DEST[127-96] APPROXIMATE(1.0/(SRC[127-96]));
RCPSS--Scalar Single-Precision Floating-Point Reciprocal Opcode Instruction Description
F3 0F 53 /r RCPSS xmm1,
xmm2/m32
Returns to xmm1 the packed approximation of the
reciprocal of the low single-precision
floating-point value in xmm2/m32.
DEST[31-0] APPROX (1.0/(SRC[31-0]));
* DEST[127-32] remains unchanged *;
RSQRTPS--Packed Single-Precision Floating-Point Square Root Reciprocal Opcode Instruction Description
0F 52 /r RSQRTPS xmm1,
xmm2/m128
Returns to xmm1 the packed approximations of the
reciprocals of the square roots of the packed
single-precision floating-point values in
xmm2/m128.
DEST[31-0] APPROXIMATE(1.0/SQRT(SRC[31-0])); DEST[63-32] APPROXIMATE(1.0/SQRT(SRC[63-32])); DEST[95-64] APPROXIMATE(1.0/SQRT(SRC[95-64])); DEST[127-96] APPROXIMATE(1.0/SQRT(SRC[127-96]));
Opcode Instruction Description
F3 0F 52 /r RSQRTSS xmm1,
xmm2/m32
Returns to xmm1 an approximation of the
reciprocal of the square root of the low
single-precision floating-point value in
xmm2/m32.
DEST[31-0] APPROXIMATE(1.0/SQRT(SRC[31-0]));
* DEST[127-32] remains unchanged *;
Opcode Instruction Description
66 0F 51 SQRTPD xmm1, Computes square roots of the packed
移动(Move)指令：
MASKMOVDQU--Mask Move of Double Quadword Unaligned
MASKMOVDQU xmm0, xmm1
MASKMOVQ--Mask Move of Quadword
MASKMOVQ mm0, mm1
MOVAPD--Move Aligned Packed Double-Precision Floating-Point Values MOVAPD xmm0, xmm1/m128
MOVAPD xmm1/m128, xmm0
MOVAPS--Move Aligned Packed Single-Precision Floating-Point Values MOVAPS xmm0, xmm1/m128
Instruction Description
MOVD mm, r/m32Move doubleword from r/m32 to mm.
MOVD r/m32, mm Move doubleword from mm to r/m32.
MOVD xmm, r/m32Move doubleword from r/m32 to xmm.
MOVD r/m32, xmm Move doubleword from xmm register to r/m32.
Instruction Description
MOVDQ2Q mm, xmm Move low quadword from xmm to mmx register .
Opcode Instruction Description
F3 0F D6MOVQ2DQ xmm, mm Move quadword from mmx to low quadword of xmm.
DEST[63-0] SRC[63-0];
DEST[127-64] 00000000000000000H;
Instruction Description
MOVDQA xmm1, xmm2/m128Move aligned double quadword from xmm2/m128 to xmm1.
MOVDQA xmm2/m128, xmm1Move aligned double quadword from xmm1 to xmm2/m128.
Instruction Description
MOVDQU xmm1, xmm2/m128Move unaligned double quadword from xmm2/m128 to xmm1.
MOVDQU xmm2/m128, xmm1Move unaligned double quadword from xmm1 to xmm2/m128.
MOVHLPS-- Move Packed Single-Precision Floating-Point Values High to Low Instruction Description
MOVHLPS xmm1, xmm2Move two packed single-precision floating-point values from high quadword of xmm2 to low quadword of xmm1.
DEST[63-0] SRC[127-64]; * DEST[127-64] unchanged *;
Instruction Description
MOVLHPS xmm1, xmm2Move two packed single-precision floating-point values from low quadword of xmm2 to high quadword of xmm1.
Instruction Description
MOVHPD xmm, m64Move double-precision floating-point value from m64 to high quadword of xmm.
MOVHPD m64, xmm Move double-precision floating-point value from high quadword of xmm to m64.
DEST[127-64] SRC ;
* DEST[63-0] unchanged *;
MOVHPD instruction for XMM to memory move: DEST SRC[127-64] ;
Instruction Description
MOVHPS xmm, m64Move two packed single-precision floating-point values from m64 to high quadword of xmm.
MOVHPS m64, xmm Move two packed single-precision floating-point values from high quadword of xmm to m64.
MOVLPD--Move Low Packed Double-Precision Floating-Point Value
MOVLPD m64, xmm Move double-precision floating-point nvalue from low quadword of xmm register to m64.
Opcode Instruction Description
0F 12 /r MOVLPS xmm,
m64
Move two packed single-precision floating-point
values from m64 to low quadword of xmm.
0F 13 /r MOVLPS m64,
xmm
Move two packed single-precision floating-point
values from low quadword of xmm to m64.
MOVMSKPD - Extract Packed Double-Precision Floating-Point Sign Mask MOVMSKPD r32, xmm
DEST[0] SRC[63];
DEST[1] SRC[127];
DEST[3-2] 00B;
DEST[31-4] 0000000H;
MOVMSKPS - Extract Packed Single-Precision Floating-Point Sign Mask MOVMSKPS r32, xmm
DEST[0] SRC[31];
DEST[1] SRC[63];
DEST[1] SRC[95];
DEST[1] SRC[127];
DEST[31-4] 000000H;
Opcode Instruction Description
66 0F E7 /r MOVNTDQ m128,
xmm
Move double quadword from xmm to m128,
minimizing pollution in the cache hierarchy.
which is assumed to contain integer data (packed bytes, words, doublewords, or quadwords).
MOVNTPD--Move Packed Double-Precision Floating-Point Values Opcode Instruction Description
Opcode Instruction Description
66 0F 10 /r MOVUPD xmm1,
xmm2/m128
Move packed double-precision floating-point
numbers from xmm2/m128 to xmm1.
66 0F 11 /r MOVUPD
xmm2/m128, xmm
Move packed double-precision floating-point
numbers from xmm1 to xmm2/m128.
Opcode Instruction Description
0F,10 /r MOVUPS xmm1,
xmm2/m128
Move packed single-precision floating-point
numbers from xmm2/m128 to xmm1.
0F,11 /r MOVUPS
xmm2/m128, xmm1
Move packed single-precision floating-point
numbers from xmm1 to xmm2/m128.
Opcode Instruction Description
0F D7 /r PMOVMSKB r32, mm Move the byte mask of mm to r32.
66 0F D7 /r PMOVMSKB r32, xmm Move the byte mask of xmm to r32.
r32[0] SRC[7];
r32[1] SRC[15];
* repeat operation for bytes 2 through 6;
r32[7] SRC[63];
r32[31-8] 000000H;
PMOVMSKB instruction with 128-bit source operand:
r32[0] SRC[7];
r32[1] SRC[15];
* repeat operation for bytes 2 through 14;
r32[15] SRC[127];
r32[31-16] 0000H;
比较(Compare)指令：
CMPPD--Compare Packed Double-Precision Floating-Point Values SSE2
CMPPS--Compare Packed Single-Precision Floating-Point Values SSE
CMPSD--Compare Scalar Double-Precision Floating-Point Value SSE2
1个double(低端)对应比较大小
CMPSS--Compare Scalar Single-Precision Floating-Point Values SSE
1个float(低端)对应比较大小
COMISD--Compare Scalar Ordered Double-Precision Floating-Point Values and Set EFLAGS
SSE2 COMISS--Compare Scalar Ordered Single-Precision Floating-Point Values and Set EFLAGS
SSE
MAXPD--Maximum Packed Double-Precision Floating-Point Values MAXPD xmm0, xmm1/m128
MAXPS--Maxiumum Packed Single-Precision Floating-Point Values MAXPD xmm0, xmm1/m128
MAXSD--Maximum Scalar Double-Precision Floating-Point Value MAXSD xmm0, xmm1/m64
MAXSS--Maximum Scalar Single-Precision Floating-Point Value MAXSS xmm0, xmm1/m32
MINPD--Packed Double-Precision Floating-Point Minimum
MINPD xmm0, xmm1/m128
MINPS--Minimum Packed Single-Precision Floating-Point Values MINPS xmm0, xmm1/m128
MINSD--Minimum Scalar Double-Precision Floating-Point Value MINSD xmm0, xmm1/m64
MINSS--Minimum Scalar Single-Precision Floating-Point Value
MINSS xmm0, xmm1/m32
逻辑(Logic)指令：
ANDNPD--Bitwise Logical AND NOT of Packed Double-Precision Floating-Point Values
SSE2
16字节(2个double)，oprand1先非，再与oprand2按位与ANDNPD xmm0, xmm1/m128
ANDNPS--Bit-wise Logical And Not For Single-FP
SSE
16字节(4个float), oprand1先非，再与oprand2按位与ANDNPS xmm0, xmm1/m128
ANDPD--Bitwise Logical AND of Packed Double-Precision Floating-Point Value
SSE2
16字节(2个double)，oprand1与oprand2按位与
ANDPD xmm0, xmm1/m128
ANDPS--Bitwise Logical AND of Packed Single-Precision Floating-Point Values
SSE
16字节(4个float), oprand1与oprand2按位与
ANDPS xmm0,xmm1/m128
ORPD--Bitwise Logical OR of Double-Precision Floating-Point Values
Opcode Instruction Description
66 0F 56 /r ORPD xmm1, xmm2/m128Bitwise OR of xmm2/m128 and xmm1. DEST[127-0] DEST[127-0] BitwiseOR SRC[127-0];
ORPS--Bitwise Logical OR of Single-Precision Floating-Point Values Opcode Instruction Description
0F 56 /r ORPS xmm1, xmm2/m128Bitwise OR of xmm2/m128 and xmm1
DEST[127-0] DEST[127-0] BitwiseOR SRC[127-0];
Opcode Instruction Description
0F DB /r PAND mm, mm/m64Bitwise AND mm/m64 and mm.
66 0F DB /r PAND xmm1, xmm2/m128Bitwise AND of xmm2/m128 and xmm1.
Opcode Instruction Description
0F DF /r PANDN mm, mm/m64Bitwise AND NOT of mm/m64 and mm.
66 0F DF /r PANDN xmm1, xmm2/m128Bitwise AND NOT of xmm2/m128 and xmm1. DEST (NOT DEST) AND SRC;
Opcode Instruction Description
0F EB /r POR mm, mm/m64Bitwise OR of mm/m64 and mm.
66 0F EB /r POR xmm1, xmm2/m128Bitwise OR of xmm2/m128 and xmm1. Opcode Instruction Description
66 0F 73 /7 ib PSLLDQ xmm1,
imm8
Shift left xmm1 by imm8 bytes, clearing
low-order bits.
of bytes specified in the count operand (second operand). The empty high-order bytes are cleared (set to all 0s). If the value specified by the count operand is greater than 15, the destination operand is set to all 0s. The destination operand is an XMM register. The count operand is an 8-bit immediate.
转换(Convert)指令：
CVTDQ2PD--Convert Packed Signed Doubleword Integers to Packed Double-Precision Floating-Point Values
SSE2
CVTDQ2PD xmm0, xmm1/m64
DEST[63-0]
Convert_Integer_To_Double_Precision_Floating_Point(SRC[3 1-0]);
DEST[127-64]
Convert_Integer_To_Double_Precision_Floating_Point(SRC[6
3-32]);
CVTDQ2PS--Convert Packed Signed Doubleword Integers to Packed
Single-Precision Floating-Point Values
CVTDQ2PS xmm0,xmm1/m128
DEST[31-0]
Convert_Integer_To_Single_Precision_Floating_Point(SRC[31-0]);
DEST[63-32]
Convert_Integer_To_Single_Precision_Floating_Point(SRC[63-32]);
DEST[95-64]
Convert_Integer_To_Single_Precision_Floating_Point(SRC[95-64]);
DEST[127-96]
Convert_Integer_To_Single_Precision_Floating_Point(SRC[127-96]);
CVTPD2DQ--Convert Packed Double-Precision Floating-Point Values to Packed Doubleword Integers
CVTPD2DQ xmm0, xmm1/m128
DEST[31-0]
Convert_Double_Precision_Floating_Point_To_Integer(SRC[63-0]);
DEST[63-32]
Convert_Double_Precision_Floating_Point_To_Integer(SRC[127-64]); DEST[127-64] 0000000000000000H;
CVTPD2PI--Convert Packed Double-Precision Floating-Point to Packed Doubleword Integers
CVTPD2PI mm, xmm0/m128
DEST[31-0]
Convert_Double_Precision_Floating_Point_To_Integer(SRC[63-0]);
DEST[63-32]
Convert_Double_Precision_Floating_Point_To_Integer(SRC[127-64]);
CVTPD2PS--Covert Packed Double-Precision Floating-Point Values to Packed Single-Precision Floating-Point Values CVTPD2PS xmm0, xmm1/m128
DEST[31-0]
Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[63-0] );
DEST[63-32]
Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[127-6 4]);
DEST[127-64] 0000000000000000H;
CVTPI2PD--Convert Packed Doubleword Integers to Packed Double-Precision Floating-Point Values
CVTPI2PD xmm0, mm/m64
DEST[63-0]
Convert_Integer_To_Double_Precision_Floating_Point(SRC[31-0]);
DEST[127-64]
Convert_Integer_To_Double_Precision_Floating_Point(SRC[63-32]);
CVTPI2PS--Convert Packed Doubleword Integers to Packed Single-Precision Floating-Point Values
CVTPI2PS xmm0, mm/m64
DEST[31-0]
Convert_Integer_To_Single_Precision_Floating_Point(SRC[31-0]);
DEST[63-32]
Convert_Integer_To_Single_Precision_Floating_Point(SRC[63-32]);
* high quadword of destination remains unchanged *;
CVTPS2DQ--Convert Packed Single-Precision Floating-Point Values to Packed Doubleword Integers
CVTPS2DQ xmm0, xmm1/m128
DEST[31-0]
Convert_Single_Precision_Floating_Point_To_Integer(SRC[31-0]);
DEST[63-32]
Convert_Single_Precision_Floating_Point_To_Integer(SRC[63-32]); DEST[95-64]。