MASM – Stack Memory Alignment
(originally published on codeproject.com)
SIMD instruction sets may expect a special alignment of memory, but when that memory is on the stack MASM does not provide alignment facilities.
In MASM, the ALIGN directive does not align local (or stack) variables, i.e. those variables that you declare at the start of a procedure by using the LOCAL directive. The only guarantee you have for local variables is that 32-bit Windows aligns them on a 4-byte boundary and 64-bit Windows aligns them on an 8-byte boundary.
Of course, MASM does align variables declared in the .DATA section of the source code, but these are static and may not be what you require, namely if the code is meant to be thread safe.
The lack of stack data alignment facilities has not become really critical until the appearance of the SSE instruction set. Many SSE instructions that read data from memory, require data to be aligned on a 16-byte boundary, otherwise a fault is granted.
Most recent C/C++ compilers have directives to align stack data, but we are dealing with MASM. If you are linking C/C++ with Assembly Language, or doing applications in Assembly Language, you need to be aware of the potential problems.
SSE provides instructions to load potentially misaligned data into registers or to store data from the SSE registers into potentially misaligned memory, namely the “movups” and “movdqu” instructions. The performance penalty is not as evident on modern CPUs as it used to be on the old Pentium 3 and 4, and this is the route to take in most cases.
Still, it continues to be useful to know how to align stack memory in MASM. For example, you might need to call external functions from within Assembly Language that expect the received data to be 16-byte aligned.
Using the code
The problem of aligning stack memory in Assembly Language has been discussed in various forums for years but I never found a really manageable solution, so I decided to propose my own recipe.
This solution permits an unlimited number of aligned 16-byte (it can be easily modified to 32-byte or higher, if needed. This is left as an exercise) memory variables.
It works in the following stages:
1) Save the current stack position, so that we can restore it later.
This is done through a macro (here is the 32-bit version. Both 32-bit and 64-bit version can be downloaded from the link below):
mov TopOfAllocatedStackMem, esp ;; Save the current top of stack
2) Reserve a chunk of 16-byte aligned memory on the stack for some variable. A variable containing a pointer to it has been previously declared through a LOCAL directive, so that we can access it later.
Another macro does this job (here is the 32-bit version):
and rsp, -10h ;; Align to 16 byte boundary
sub rsp, memsize ;; Make room for the new variable
mov PtrToStackMem, rsp
3) Now that we have memory for the variable, we can save some data there.
A third macro (here, the 32-bit version) does this job:
push eax ;; Now we are safe to "push" and "pop" registers.
mov eax, PtrToStackMem
movaps [eax], reg ;; Were the memory not aligned and an exception would occur here.
4) When we need to retrieve a variable, we make use of a fourth macro (here, the 32-bit version):
mov eax, PtrToStackMem
movaps reg, [eax] ;; Were the memory not aligned and an exception would occur here.
5) When the procedure is returning to the caller we need to release all the memory we have allocated from the stack.
So insert the following macro (here, the 32-bit version), just before the “ret” instruction (the MASM compiler will issue a “leave” before the “ret”):
mov esp, TopOfAllocatedStackMem
Our demo consists of a callable ASM function (AsmMemAlignDemo) and a mini C++ project containing the caller to it. AsmMemAlignDemo is called with 2 parameters, a __m128, which corresponds in ASM to a XMMWORD, and a float, which corresponds in ASM to a REAL4. It returns a __m128.
Its C++ declaration is:
__m128 AsmMemAlignDemo(__m128 param1, float param2);
AsmMemAlignDemo is called with the param1 containing a vector of 4 floats (1.0, 2.0, 3.0, 4.0) and param2 contains the float 10.0
Within the ASM function, 4 operations will take place to obtain the final result.
1) The float is multiplied by the vector, obtaining the partial result:
(10.0, 20.0, 30.0, 40.0)
2) A value of 17.0 is added to the vector, obtaining the partial result:
(27.0, 37.0, 47.0, 57.0)
3) The vector is divided by 3, giving the final result of:
Of course, there is also an opportunity to demonstrate our recipe for the 16-byte stack memory alignment, this is after all the main purpose of the article.
In the “C++” project, we produced a few comments about the way __m128 parameters are passed and the __m128 result received under the cdecl and Microsoft x64 calling conventions, under Visual Studio. You will become aware that that they depart from the specifications, as they are understood by other compiler vendors.
Finally, the ASM compilation:
To compile the 32-bit ASM, you run MASM with: "Path to Visual Studio"\VC\bin\ml /c asmtest32.asm Note: You can also compile with JWasm without any change: "Path to JWasm"\jwasm" -coff asmtest32.asm To compile for 64-bit ASM, you run MASM with: "Path to Visual Studio"\VC\bin\amd64\ml64" -c asmtest64.asm Note: You can also compile with JWasm without any change: "Path to JWasm"\jwasm -c -win64 asmtest64.asm
After compilation you will have to link your Visual Studio 32-bit build with the asmtest32.obj and the Visual Studio 64-bit build with asmtest64.obj.
The reliabilty of this recipe is based on the assumption that all unwinding will take place after the call to RESTORE_STACK_POSITION.
This happens in our Demo, the MASM compiler will issue a ‘leave’ followed by a ‘ret’ after the RESTORE_STACK_POSITION.
If an ASM module needs to handle SEH (Structured Exception Handling) or preserve some registers across the whole procedure (i.e., “pushes” registers on the stack before SAVE_STACK_POSITION), some extra care needs to be taken. The same if the ASM module is not a leaf (calls other procedures). JWasm makes it easy to deal with these cases, but MASM requires that you know exactly what you are doing.