x86 Assembly for Userland Applications: A Hands-On Approach

Source: https://github.com/epicvrvs/x86-tutorial

1. Introduction

2. What is Assembly?

3. What is the Purpose of Learning Assembly?

4. Getting to know the x86 Architecture

5. Bits, Bytes and Numbers

6. Bit Operations

7. Signed Integers

8. The First Steps

9. Writing Windows Applications Using x86 32-bit Assembly

9.1 MASM

9.2 FASM

9.3 NASM

10. Writing Linux/BSD Applications Using x86 32-bit Assembly

10.1 as

11. Partial Registers

12. Addresses

13. Control Structures

14. Multiplication and Division of Integers

15. Floating Point Operations

Introduction

This document is an introduction to creating programs for microprocessors of the x86 architecture family - in particular 32-bit code. The reader is expected to be familiar with programming in C/C++ (or similar languages such as Java at least) and the essential API of the operating system they are using. Some mathematical knowledge up to highschool/university level is essential for understanding a lot of things aswell. I will try not to be too OS specific but the environments I am going to focus on in this document are Windows, Linux and BSD. Pure knowledge of assembly in itself is useless if you do not know how to combine it with the API of the operating system you are going to run it on so I want to make sure that this will be demonstrated to a certain extent. It is not that different from the way you would do it in a high level programming language such as C++ so it is not difficult to understand.

What is Assembly?

Assembly (short ASM) is the lowest level programming possible - if you are a programmer and you want to get as close to the hardware as you can get then this is the place you want to be. In assembly code you get to control every single instruction that your CPU (central processing unit) is going to execute. There are different assembly languages for different microprocessor architectures and all of them are different from each other. Usually they are totally incompatible so you will have to write assembly code that is specific to the particular architecture you want your program to run on. In assembly you write instructions using the ASCII character set which directly represent machine code instructions that are executed by your processor. The names of these instructions are usually extremely short and are often abbreviations of full names. These assembly instruction names are called mnemonics.

Before your microprocessor can actually run the program you have written in assembly you will have to run it through a program which translates all the mnemonics and arguments to numerical machine code. This program is called an assembler. Assemblers often also support more features than just the pure instructions to make the jobs easier for the programmers but you will see how they do that later on.

What is the Purpose of Learning Assembly?

This is a very important question and subject to a lot of discussion. My answer to this question is a list of reasons, really. Better understanding of what goes on at the lowest level can make you a better programmer at a higher level. It allows you to see what goes on behind the scenes and it often gives you a totally new perspective on things. It has its uses in writing high performance parts of high level language programming where you need to use features of your processor that are not easily accessible in that high level language. Knowing assembly is obviously also necessary to be able to write compilers for a particular microprocessor architecture which convert high level language code to machine instructions.

The truth is that most people learn x86 ASM nowadays to crack commercial software, to reverse engineer closed source programs and to write cheats for computer games. Cracking is the one that made me learn it but I have to admit that I never got particularly good at it and I got totally distracted from my original goal in the process of understanding how it works. It is my experience that it is essential to learn how to write x86 ASM yourself first in order to be successful at cracking and reverse engineering. Knowing how to manually translate a C++ program to assembly is a valuable skill to have for this purpose.

Getting to know the x86 Architecture

So, what are we dealing with here? The IA-32 microprocessor is basically a register machine which uses a CISC (complex instruction set computer) instruction set. At first I am going to explain what a register machine is. After that I will move on to the CISC part.

A register machine is a computing device which stores results of arithmetic operations and such primarily in so called registers. These are small but highly efficient binary storage units inside the processor which can hold integer values. When you are doing assembly programming you deal with them all the time. They are incapable of holding a lot of data at once but they are essential as temporary placeholders used in most instructions executed by the processor. In the terms of the memory hierarchy of contemporary computers they are at the very top. The hierarchy looks like this:

CPU registers
CPU cache
RAM (random access memory)
HDD (hard disk drive storage)
External storage, even optical media like CDs, DVDs, BluRay disks and such

Registers hold minimal amounts of data and are extremely fast. The cache holds far larger amounts of data but it is still pretty fast. The RAM holds even larger amounts of data and it is far slower (in terms of both bandwidth and latency) than any memory operation inside a CPU. Hard disks have an even larger capacity than your RAM and they are very slow in comparison to the objects at the top of the hierarchy.

At this point I should probably briefly explain what the CPU cache actually is. It is a small high performance storage unit inside your CPU to which chunks of memory from the RAM are copied whenever you perform memory accesses. This way the CPU does not have to access the RAM over and over again when it is processing the same piece of data. This speeds up the execution of code a lot - RAM access is vey slow in comparison to cache access after all. In reality this cache is actually not a single unit but it is divided into multiple cache levels. The Level 1 Cache is the smallest but fastest one. The next level is bigger but slower and so on. Caching is not of much interest to somebody who is new to assembly, though. I might cover this topic later in sections with deal with optimising code for speed.

Let us get back to the CISC part I mentioned earlier. There are two major microprocessor architecture philosophies known as RISC (reduced instruction set computing) and CISC (complex instruction set computing). In RISC architectures instructions are rather simple and perform very fundamental operations. RISC instructions are incapable of performing multiple actions at once but they are very fast. The instruction format for RISC architectures is usually quite uniform and each instruction takes up the same number of bytes in memory. This makes them very easy to decode for the processor. CISC architectures generally feature complex instructions which perform multiple tasks sequentially, like loading a value from memory, peforming an arithmetic operation on it and then writing the result of the operation into memory. In a RISC architecture this would be divided into multiple instructions. CISC instructions are usually of variable length and they are very complicated to decode for the CPU.

Property	RISC	CISC (x86)
Instruction encoding	Uniform, all instructions are the same size	Variable, the size of instructions strongly varies
Complexity of operations	Simple, each instruction performs one fundamental operation	Complex, each instruction can perform many fundamental operations
Code density	Low	High
Decoding effort for the CPU	Low	High

Bits, Bytes and Numbers

A bit is simply the smallest digital piece of information possible. To us, it is a rather abstract unit which can have two values, zero or one. A byte is composed of a certain number of bits. On all mainstream architectures a byte is composed of exactly 8 bits. In this document we focus on the x86 32-bit architecture. The 32-bit part means that the word size of this processor is 32 bits. The word size of a microprocessor tells you what size most registers and operands are. So in this case all general purpose registers are 32 bits = 4 bytes in size. It would be more precise to say "octet" instead of byte in that case, which is the terminology you usually encounter in literature on the subject of networking. In this text I will stick to bytes, though. General purpose means that we, the programmers, can pretty much use these registers as we want without breaking anything. They are meant to be used for all major calculations. There are special registers such as the stack pointer, which may not be used in such a way. In fact, that would usually result in a crash at some point, but we will get back to that later. Our registers consist of 32 single bits which can each assume two different values. This means that each register can assume one of \(2^{32}\) (that is: 2 to the power of 32) different values at a time.

There are different systems of representing numbers using digits (which can actually be arbitrary symbols) such as our decimal system. At this point it is important to introduce the concept of the positional notation. Positional notation is a method which takes a base \(b \in \mathbb{N}\) and a set of digits \(D = \bigcup_{i = 0}^{b - 1} \{d_i\}\) with values \(v : D \to \mathbb{N}\), \(v(d_i) = i\). I hope I did not scare anybody away with those equations - I should probably clarify. \(b\) is a natural number - a positive integer like 1, 2, 3, etc. \(D\) is the union of the digits which are associated with the numerical values 0 to \(b - 1\). In a positional system the number which is composed of the digits \(d_{(a_n)} \ldots d_{(a_1)}\) actually represents the natural number \(\sum_{i = 0}^{b - 1} v(d_{(a_i)}) b^i = \sum_{i = 0}^{b - 1} a_i b^i\).

The decimal system is actually a positional system with \(b = 10\). So we have the digits \(\) with the corresponding values. According to the equation the decimal number 913 represents the natural number \(9 \cdot 100 + 1 \cdot 10 + 3 \cdot 1 = 913\).

Now that was fairly obvious but let us move on to other systems such as the hexadecimal one, which is quite frequently used in the world of programming, especially in assembly languages. In the hexadecimal system we deal with the base \(b = 16\) and the digits \(D_{16} = \{0 \ldots 9, A, B, C, D, E, F\}\). So A stands for 10, B for 11, and so on. So the hexadecimal number F4D7 actually represents the natural number \(15 \cdot 16^3 + 4 \cdot 16^2 + 13 \cdot 16^1 + 7 \cdot 16^0 = 15 \cdot 4096 + 4 \cdot 256 + 13 \cdot 16 + 7 \cdot 1 = 61440 + 1024 + 208 + 7 = 62679\).

Then there is of course the binary system which has the digits \(D_2 = \{0, 1\}\). The binary number 1101001 represents the natural number \(1 \cdot 2^6 + 1 \cdot 2^5 + 0 \cdot 2^4 + 1 \cdot 2^3 + 0 \cdot 2^2 + 0 \cdot 2^1 + 1 \cdot 2^0 = 1 \cdot 64 + 1 \cdot 32 + 0 \cdot 16 + 1 \cdot 8 + 0 \cdot 4 + 0 \cdot 2 + 1 \cdot 1 = 105\). This is pretty much exactly the way unsigned numbers are stored at the hardware level in the registers of your x86 microprocessor. The single binary digits represent the bits in the register. But the binary number 1101001 has 7 bits and our 32-bit registers have 32 bits. What happens to the other 25 bits? The upper bits are simply set to 0, so the register would actually be set to 00000000000000000000000001101001.

Bit Operations

In the introduction I stated that this document is intended for people who already know a high level programming language which also implies knowledge about the six essential binary operations but I would still like to introduce them at this point. I will use the C operators & for AND, | for OR, ~ for NOT, ^ for XOR, << for SHIFT LEFT and >> for SHIFT RIGHT in this document. Let's start with AND. It takes two bits as input and outputs a single bit. If you do not fully grasp these operations yet I suggest taking a few moments to totally digest and appreciate their output. So what is the AND operation usually used for? We use it to set particular bits of an integer to 0, to "mask off" particular ranges of bits we do not care about.

a	b	`a & b`
0	0	0
0	1	0
1	0	0
1	1	1

Now for the OR operator. Like AND it takes two bits as input and has a single bit as output. We use it to set particular bits of an integer to 1:

a	b	`a \| b`
0	0	0
0	1	1
1	0	1
1	1	1

The NOT operator inverts bits so it takes one argument instead of two like the previous operators:

a	`~a`
0	1
1	0

The XOR operator is used to "flip" bits. It is not an essential function, meaning that it can be expressed with the help of AND, OR and NOT. Its primary purpose curiously is setting registers to zero but we will get to that later. It is also very useful for pseudo random number and encryption algorithms.

a	b	`a ^ b`
0	0	0
0	1	1
1	0	1
1	1	0

The bit shift operations SHIFT LEFT and SHIFT RIGHT cannot be described in compact tables like the other ones. They shift the bits inside the register to the left or to the right, discarding information as bits "are pushed out" and shifting new 0-bits in. The first operand for a shift operation is the number that will get shifted. The second operator describes the number of bits that are shifted in and out. Here are a few examples on 32-bit numbers to clarify what these operations mean:

a	b	`a << b`
`11010001010010100001101010001010`	1	10100010100101000011010100010100
`00000000100000001100000001011111`	3	10001010010100001101010001010000
`11100000010100110001010010111010`	2	01000101001010000110101000101000
`00001001101110000000001110000000`	29	00000000000000000000000000000000

a	b	`a >> b`
`11010001010010100001101010001010`	1	01101000101001010000110101000101
`00000000100000001100000001011111`	3	00000000000100000001100000001011
`11100000010100110001010010111010`	2	00010001010010100001101010001010
`00001001101110000000001110000000`	27	00000000000000000000000000000001

Signed Integers

We have seen how unsigned integers are encoded in the 32-bit registers of the microprocessor but surely the x86 architecture has to deal with signed integers aswell. There are different systems of representing signed integers using binary encodings. The most intuitive approach is to sacrifice one bit (the "left-most" one) to represent the sign of the integer. This system is called sign and magnitude and the bit we sacrificed is called the sign bit. A sign bit of 0 stands for the lack of the minus and the sign bit of 1 means that the number is signed. The absolute value is expressed in all the other bits. In an 8-bit register the binary number 10011100 represents the integer \(- (16 + 8 + 4) = - 28\). One irritating property of this system is that there are two representations for the number 0: 00000000, 10000000 - a "positive" zero and a "negative" zero - which is obviously a waste. The actual major problem with this system is that it is very annoying to perform addition and subtraction in it because there are three different cases to be dealt with (positive + positve, positive + negative, negative + negative).

Another possibility is to represent the negative version of the natural number \(n \in \mathbb{N}\), \(-n\), by inverting all its bits (see NOT operation). Hence this system is called One's Complement. This way also one bit is sacrificed to represent the sign and there are two representations of the number zero again. So the negative version of 00001111 (which is 15) is 11110000 (which is -15) in our hypothetical 8 bit register. The advantage of this system is that it makes addition/subtraction easier. Consider the operation -7 + 3: 11111000 + 00000011 = 11111011, which is the One's Complement representation of -4. As you can see we can reduce the complexity of arithmetic operations this way as long as we don't cross the zero barrier. Consider the case -1 + 2: 11111110 + 00000010 = 00000000. This is obviously the wrong result, - 1 + 2 is not zero, it's 1! Every time you cross the "zero barrier" in One's Complement you have to fix the result by offsets of 1.

So to solve this problem smart computer scientists came up with Two's Complement. It is similar to One's Complement but all negative numbers have an additional offset of 1. This way there is no redundant zero and you can flawlessly perform arithmetic operations without having to worry about the sign really. Two's Complement is the system used by all important mainstream CPUs in desktop computers including the x86 architecture - so be sure to understand this. There is no inherent difference between storing an unsigned integer and a signed integer in a 32-bit register - it is all about how you treat it. The 32-bit value 11111111111111111111111111111111 can either stand for \(2^{32} - 1 = 4294967295\) (which is the greatest unsigned integer that can be represented with 32 bits) or, if interpreted as negative number, for -1. To calculate the Two's Complement representation of \(-n\) with \(n \in \mathbb{N}\) you need to:

Subtract 1
Invert bits

All of these operations are injective so you can simply reverse the order and invert the type of the operations in these two steps in order to convert a negative number to its positive representation. Let's have a look at the representations of a few integers in Two's Complement encoding:

Signed decimal number	Two's Complement encoding
23	`00000000000000000000000000010111`
-1	`11111111111111111111111111111111`
-2	`11111111111111111111111111111110`
-3	`11111111111111111111111111111101`
-2147483648	`10000000000000000000000000000000`
2147483647	`01111111111111111111111111111111`

As you can see the greatest signed number that can be represented with Two's Complement is 2147483647 and the lowest one is -2147483648.

The First Steps

Now that we have covered most of the basics we can finally get started with some x86 ASM. I am going to use Intel syntax for now which is far more popular and more intuitive than AT&T syntax. Let's have a look at the following code:

Assembly


mov eax, 1 
mov ebx, 10 
xor ecx, ecx 
our_loop :
add ecx, eax 
inc eax 
cmp eax, ebx 
jle our_loop

Most of these lines represent an instruction that gets executed by the processor. The pattern is instruction [argument 1], [argument 2], .... Let's go through it line by line. mov eax, 1 calls the instruction mov with the arguments eax and 1. The mov destination, source operator is kind of like the assignment operator in C++. It copies the value of source to destination. In this case it writes the constant 1 to the 32-bit general purpose register eax. That is one of the registers we can use for all kinds of calculations without breaking anything. Here is a list of all the 32-bit general purpose registers of the x86 architecture:

eax
ebx
ecx
edx
edi
esi
ebp

As you can see they have different naming patterns and historically some of them are used by special instructions which hardly appear in modern code, though. They are mostly still supported for legacy reasons but they are inefficient and are not produced by modern compilers. After that instruction has been processed the instruction pointer gets increased. The instruction pointers contains of the current address that is being executed. The instruction pointer of the x86 architecture is stored in the 32-bit register eip. So eip gets increased by the length of the mov eax, 1 instruction after it has been executed and eip points towards the next instruction which is mov ebx, 10. As you might have guessed, mov ebx, 10 writes the value 10 to the register ebx. But now it gets interesting. xor ecx, ecx is basically the x86 ASM equivalent of the C++ code ecx ^= ecx;. You might ask: What is the point? The educated mind sees immediately that a ^ a is equal to zero for all values of a. So what this basically is does is setting ecx to zero. So you might wonder why you do not use mov ecx, 0 instead. Well, you could, but xor ecx, ecx is actually shorter because it does not involve any constant. By always using xor reg, reg instead we reduce the overall file size of our binaries and this can save time and space. But this instruction does more than just modify the ecx register. The x86 microprocessor has a set of so called flags. These are 1 bit values which are used for conditional jumps and other operations. We do not think about them most of the time and it is not too relevant for writing conventional code but it is absolutely essential to know how they exist, what they are good for and how they are set. All arithmetic-logical operations set some of these flags. The mov instruction does not modify the flags but the xor instruction sets several of them. The Intel x86 Instruction Set Reference says: "The OF and CF flags are cleared; the SF, ZF, and PF flags are set according to the result. The state of the AF flag is undefined." You can look up such data up in the original PDF document released by Intel. Here is a list of the most important x86 flags and what they stand for:

Abbreviation	Full name
`CF`	Carry Flag
`PF`	Parity Flag
`ZF`	Zero Flag
`SF`	Sign Flag
`OF`	Overflow Flag

When you perform the addition of two large unsigned 32-bit numbers the result can obviously be too large to be stored in a 32-bit register so what you get is a so called wraparound or an overflow. Consider the addition of the following numbers:

Description	Binary value
Summand 1	`11111111111111111111111111111111`
Summand 2	`+ 00000000000000000000000000000001`
Sum	100000000000000000000000000000000

The result requires at least 33 bits to be displayed properly using Two's Complement. So the bit that has been marked red is basically discarded and not stored in the target register - instead it gets stored in the Carry Flag. The Carry Flag is basically always the "33rd" bit. Actually This is not a 1 bit value but it is a single bit at offset 0 in a 32 bit register known as the FLAGS register. This is where all the flags are stored - so obviously there are far more flags than just the ones listed above but most of them are not of any relevance for writing user land applications. The Parity Flag is set to 1 if the number of 1-bits in the right most 8 bits of the result contains an even number of 1-bits. It is set to 0 otherwise. It is basically the result of XORing those 8 bits with each other and eventually inverting the result. The Zero Flag is set when the the result is equal to zero - very useful. The Sign Flag is the left most bit of he 32-bit result - it represents its sign. The Overflow Flag is set when an overflow occured.

But let us get back to the actual code we were discussing. So none of the flags that are set by most arithmetic logical operations actually get used and are simply overwritten immediately. The next line in the code is our_loop: is not an actual instruction that gets executed but an instruction for your assembler which allows you to define labels. These are markings in the code which are essential to control the flow of execution by providing named locations to which you can jump to. add ecx, eax is the x86 ASM equivalent of ecx += eax; and it sets the flags again but they are unused - again. inc eax increments the value of eax by 1 - it is the equivalent of eax++;. We could have used add eax, 1 instead and it is pretty much the same but the inc instruction is smaller so it is preferable. It does not affect the carry flag but it does not make any difference in this case. cmp eax, ebx is short for "compare" and it performs the subtraction eax - ebx but the result is not stored anywhere. Instead it simply sets the flags we just talked about according to the result. This is an essential way to check whether two registers have equal values, for example. In case they are equal the result of the subtraction is zero so the Zero Flag gets set and we can act accordingly. A cmp is the first part of the x86 ASM equivalent of a C++ if construct, so to say. The next part is one of the conditional jumps - in this case it is jle our_loop. jle is short for "jump if less or equal". So, what is a jump? It is a method of controlling the flow of execution of your program by basically telling your microprocessor where to resume the execution after the jump. In the case of the x86 32-bit architecture it means modifying the value of eip. When jumps actually get encoded they just contain numeric offsets which are either relative to the position of the jump ("jump 25 backward") or even absolute ("go to the instruction at offset 0x12345678"). There are unconditional jumps which are executed no matter what and conditional jumps which only perform jumps under certain conditions. If a conditional jump is not taken the execution simply continues with the next instruction after the jump as usual. our_loop refers to the label we previously defined so if the jump gets executed the next instruction will be add ecx, eax. So in this case we are dealing with a jump which is executed if the flags indicate that the result is "less or equal". What is this supposed to mean?! Less than or equal to what? The cmp operation is usually always used in combination with a conditional jump. Together they are the x86 ASM equivalent of the C++ if construct. So what the "less or equal" part relates to is the cmp executed prior to the conditional jump. Read cmp eax, ebx in combination with jle our_loop as if(eax <= ebx) goto our_loop; - resume execution at label our_loop if eax is less than or equal to ebx. There are actually 32 different mnemonics for such jumps:

Mnemonic	Meaning
`ja`	Jump if above (CF=0 and ZF=0).
`jae`	Jump if above or equal (CF=0).
`jb`	Jump if below (CF=1).
`jbe`	Jump if below or equal (CF=1 or ZF=1).
`jc`	Jump if carry (CF=1).
`je`	Jump if equal (ZF=1).
`jz`	Jump if 0 (ZF=1).
`jg`	Jump if greater (ZF=0 and SF=OF).
`jge`	Jump if greater or equal (SF=OF).
`jl`	Jump if less (SF!=OF).
`jle`	Jump if less or equal (ZF=1 or SF!=OF).
`jne`	Jump if not above (CF=1 or ZF=1).
`jna`	Jump if not above (CF=1 or ZF=1).
`jnae`	Jump if not above or equal (CF=1).
`jnb`	Jump if not below (CF=0).
`jnbe`	Jump if not below or equal (CF=0 and ZF=0).
`jnc`	Jump if not carry (CF=0).
`jne`	Jump if not equal (ZF=0).
`jng`	Jump if not greater (ZF=1 or SF!=OF).
`jnge`	Jump if not greater or equal (SF!=OF).
`jnl`	Jump if not less (SF=OF).
`jnle`	Jump if not less or equal (ZF=0 and SF=OF).
`jno`	Jump if not overflow (OF=0).
`jnp`	Jump if not parity (PF=0).
`jns`	Jump if not sign (SF=0).
`jnz`	Jump if not zero (ZF=0).
`jo`	Jump if overflow (OF=1).
`jp`	Jump if parity (PF=1).
`jpe`	Jump if parity even (PF=1).
`jpo`	Jump if parity odd (PF=0).
`js`	Jump if sign (SF=1).
`jz`	Jump if 0 (ZF=1).

As you can see several of these mnemonics basically do the same thing and they are actually encoded the same way in the actual executable files - they just exist to make the jobs for the programmers easier. So let's translate this assembly code to C++ so we can see what it does:

C++


int  eax, ebx, ecx;

//... 
  

//mov eax, 1 
  
eax =1 ;

//mov ebx, 10 
  
ebx =10 ;

//xor ecx, ecx 
  
ecx =0 ;

our_loop :

//add ecx, eax 
  
ecx += eax;

//inc eax 
  
eax++;

//cmp eax, ebx 
  
//jle our_loop 
  

if (eax <= ebx)
goto  our_loop;

As you can see there is some kind of iteration going on. eax is the iterator which is involved in the decision when to leave the loop, ebx contains the limit for the iterator and ecx += eax is the actual body of the loop. So we can further simplify the program to:

C++


int  sum = 0 ;
for (int i = 1; i <=10; i++)
    sum += i;

Nice, now that is actually really short and easy to comprehend C/C++ code. So it actually calculates the sum 1 + 2 + ... + 9 + 10 = 55. Currently the upper limit is 10 - what do we do if want to set this limit to an arbitrary number? In a high level language we would write a new function for this purpose which takes the limit for the loop as its sole argument and which returns an integer, something along the lines of:

C++


int  calculate_sum(int  limit)
{
int  sum = 0 ;
for (int  i = 1; i <= limit; i++)
        sum += i;
return  sum;
}

So, how do we translate this back to x86 ASM? There is a direct equivalent for functions, too. Let us check out what it looks like when we translate this function to assembly language and call the function with a larger limit 20:

Assembly


calculate_sum :
mov eax ,1 
mov ebx , [esp  +4 ]
xor ecx ,ecx 
sum_loop :
add ecx ,eax 
inc eax 
cmp eax ,ebx 
jle sum_loop 
mov eax ,ecx 
ret 4 

start :
push 20 
call calculate_sum

The first change you can see is the calculate_sum label that has been added. It is necessary so we have marker in the code which we can use to call the function. The next change is the mov ebx, [esp + 4] part. The brackets mean that we perform a memory access (RAM). It is the x86 ASM equivalent of the C/C++ * operator to dereference pointers. So in this case the pointer is esp + 4. The x86 architecture supports statements of the form register + constant_offset for this purpose. So in this case the address esp + 4 is calculated and dereferenced. But how many bytes are read and stored in ebx? The register is 32-bits in size so 4 bytes are read, beginning at the address esp + 4, so the bytes esp + 4, esp + 5, esp + 6, esp + 7 are stored in ebx - but in what order? Which byte goes where? The x86 architecture uses the so called Little Endian byte order in memory. In this order the least significant byte is stored first in memory:

Binary value
10010011000100001100000001110010

In this case the red bit is the most significant bit and the green bit is the least significant bit. Significance means how much of a difference the bit can make to the value of the integer. The most significant bit has a value of \(2^{31}\) whereas the least significant bit only makes a difference of \(2^0 = 1\). We do not have to worry about as long as we stick to writing values of size \(n \in \mathbb{N}\) to memory and reading \(n\) bytes later. When we read bytes from memory with an instruction like mov ebx, [esi] the x86 microprocessor reads 4 bytes in Little Endian order, of course. The only time we have to worry about it is we have an integer of \(n > 1\) bytes in memory and we want to access \(m\) particular bytes of it with \(m \in \mathbb{N} \land 1 \le m < n\). In that case we have to know what Endianess the underlying system uses so we read the right bytes. The opposite of Little Endian byte order is Big Endian. In Big Endian byte order the most significant byte is stored first - just like we do in normal hexadecimal and decimal notation. We write down the number 931 to express \(9 \cdot 100 + 3 \cdot 10 + 1 \cdot 1\). If we applied the Little Endian philosophy to this we would write 139 instead of 931, so to say. Let us have a look at a few examples of how bytes are stored in memory with different "Endianesses":

Hexadecimal notation	Consecutive bytes in memory with Little Endian byte order	Consecutive bytes in memory with Big Endian byte order
`0xFF`	`0xFF`	`0xFF`
`0x1234`	`0x34, 0x12`	`0x12, 0x34`
`0x09451143`	`0x43, 0x11, 0x45, 0x09`	`0x09, 0x45, 0x11, 0x43`

So for single bytes it does not make any difference but as soon as you deal with integers which require two or more bytes the Endianess becomes an issue. Big Endian byte order is much more intuitive as you can see, since it matches the way we usually write numbers so it is easier for humans to read when looking at hex dumps and stuff like that. With Little Endian byte order you would have to reverse it first in order to parse it in the way you usually do. Unluckily the x86 architecture uses the Little Endian byte order so there is not much we can do about it. Even the internet is based on the Big Endian byte order - all the official protocols such as IP, TCP and UDP use it. This is why it is also called Network Byte Order. When you send multi byte integer values in the headers of the official protocols using a Little Endian byte order all the bytes must be reversed in order first in order to satisfy the specifications - what a bummer! Now that we have cleared this up we should get back to the statement we were originally discussing, mov ebx, [esp + 4]. So there is a new register in that statement: esp. Its name is short for extended stack pointer which is actually a 32-bit "general purpose register" just like the other 7 ones that we listed above. But messing with it can have disastrous results and there is not really a point to use it for any other purpose in normal code, really. The stack is a part of the memory your application uses. The C++ analogy to this mechanism is std::stack (which I hope you are familiar with) and in Java it would be java.util.Stack. A stack is one of the basic container/data structure types which (in theory) only allows you to access the element at the top of the stack by "popping" it off and it allows you to add new elements to the stack by "pushing" them on the top. In terms of microprocessors it is a bit different because you usually access several elements inside the stack which are not at the top because it is more efficient. Let us have a look at a few examples of operations on an abstract stack of integers. This is what our stack looks like at the beginning: (do not be confused by the "Bottom of the stack" part, the bottom does not serve a special purpose and it is just marked to clarify things)

Description	Value
Top of the stack	4
	17
Bottom of the stack	2

We are going to execute a series of operations on this stack now to observe the changes. First we will push an 8 onto the top. The C++ pseudo code equivalent for this action shall be: push(8);

Description	Value
Top of the stack	8
	4
	17
Bottom of the stack	2

As you can see the 4 is no longer the top of the stack and we value we just pushed onto the stack is now at the top. Next pseudocode to execute: int a = pop(); int b = pop();

Description	Value
Top of the stack	4
	17
Bottom of the stack	2

Description	Value
Top of the stack	17
Bottom of the stack	2

The pseudo function pop() pops one value off the stack and returns its value. So the two variables we used in the pseudocode we just executed now have the values a == 8 and b == 4. So what is the point of this whole stack? What does it get used for in the x86 architecture and other microprocessor architectures? It is an essential data structure which is used to:

Pass arguments to functions
Store return addresses
Allocate local variables which are too large for the registers

We will get back to what a return address is and how we allocate space for local variables later. At first I would like to discuss the necessity of the stack to pass arguments to functions. You might wonder: "Why do you not simply pass the arguments inside registers?" There actually is a special type of calling convention for functions which does use this mechanism and it is known as fastcall - because it is extremely fast since it uses registers only for the arguments. This, too, is not yet the point to discuss calling conventions, though. The problem with having all your arguments in registers is that you have a very limited amount of registers but a large amount of functions which call each other in complicated nested ways so you will eventually run out of registers to use in one way or the other. Recursive function calls of arbitrary call depth without storing the arguments somewhere in memory is physically not possible. This is why we pass most arguments on the stack. We push the arguments on the "top", call our function and remove them after that. The x86 architecture stores the current pointer to the top of the stack, the stack pointer, in the esp register. So mov eax, [esp] performs a memory access and copies the value of the top element of the stack into the eax register. We are dealing with a 32-bit architecture so all elements on the stack are of the same size which is 4 bytes. This is why esp will usually have a value which is a multiple of 4. You might expect that mov eax, [esp - 4] might access the element right "below" the top one - but that is not so. On the x86 architecture the next element below the top is actually stored at esp + 4 so we have to use mov eax, [esp + 4] to copy its value to eax. The third element would be at esp + 8, the fourth at esp + 12 and so on. I am not entirely sure why they picked this somewhat counter-intuitive order for the stack - probably because they thought that addition is more comfortable to deal with to access arguments. The x86 instruction to push a new argument on top of the stack is push. It takes one argument which can be a register or an immediate value, for example. An immediate value is any constant value like 2 or 0x00ff1234 which is directly encoded into the instruction. So what push eax does is:

sub esp, 4 (decrease the stack pointer by the word size of the microprocessor to make space for the new top element)
mov [esp], eax (write the value of eax to the memory at the address contained in esp - the top of the stack is now equal to eax)

You can basically replace push with those two instructions all the time but x86 is a CISC architecture so we save space by using push instead. sub is short for subtract and as you might have guessed already, it performs a subtraction like add performs an addition. The arguments are intuitive, sub register1, register2 is equivalent to the C++ code register1 -= register; or register1 = register1 - register2;. The x86 instruction to pop an element off the stack is called pop. Like push, it takes one argument, too. Usually it is a register so let us check out what pop ebx does:

mov ebx, [esp] (read 4 bytes from the memory at the address specified by the register esp and store them in ebx)
add esp, 4 (add 4 to the stack pointer so it makes the element "below" the top one the new top element of the stack)

So, to get back to the piece of x86 ASM we were previously discussing: What mov ebx, [esp + 4] really does is copying the first argument passed to the function into the ebx register. The first argument is the second element of the stack, the second argument is the third and so on. So what does the top of the stack contain? What is stored at [esp]? This is the so called return address which is necessary for a function to know where it was originally called from so eip, the instruction pointer, can be set to the right value after the function returns. We want our program to resume execution with the instruction after our call, after all.

The start label is not really relevant for the understanding of the principle but I simply wanted to point out where the execution actually started in case somebody got confused by the stream of x86 ASM instructions. The assembly code push 20, call calculate_sum first pushes the immediate value 20 on top of the stack and then calls the function. Remember that we only see this label in our assembly language code. It does not exist is such in the actual binary file. In the executable file you will simply see the call with an address after it. This address points to the first instruction of the function we marked with the label calculate_sum. In this case it is the instruction mov eax, 1. This is what call address actually does:

esp -= 4; (make space for the return address on the stack)
*esp = address_of_next_instruction; (stores the address of the instruction after the call on the stack)
eip = address; (modify the instruction pointer so it resumes execution in the body of the function)

As you can see I did not use x86 ASM instructions to describe what this function does. Instead, I resorted to the C++ pseudo code I use occasionally. This is because we cannot modify eip like a normal register. It takes a jump or a call (which is similar to a jump) to achieve that. What the combination of push 20 and call calculate_sum really translates to in C++ is calculate_sum(20);. The function calculate_sum is called with the argument 20 - which is what we originally wanted. You will see the push/call construct quite frequently. usually functions have even more than one argument so you will see multiple pushes prior to a call. Perhaps you have noticed that another element has been added to the function. The ret instruction is used to return from a function after it has been called. Additionally it can take an argument which in the case of ret 4 is equal to 4. Let us have a look at what this ret argument instruction does:

eip = *esp (copy the instruction pointer from the stack back to the instruction pointer to resume execution at the instruction after the call)
esp += 4 + argument; (remove the return address from the stack and, if necessary, arguments that were pushed onto the stack for the function)

If a function takes \(n \in \mathbb{N}\) arguments you will usually return from it with the instruction \(\mbox{ ret } 4 \cdot n\). If the function is nullary (if it takes no arguments), you can simply use ret instead of ret 0. You also might have noticed the additional mov eax, ecx I inserted prior to the ret 4. I did this because it is a common calling convention to return the result of a function in the eax register which is the first of the 8 (or rather 7 without esp) general purpose registers. Of course you do not have to do this but if you interface with C/C++ code you will have to know this of course. Now that we have covered the basics of calling and returning we can move on to the topic of calling conventions. You might remember me talking about the fastcall convention - it is one of them. If you are an experienced C/C++ programmer you should already be familiar with a few of them. They differ in the following regards:

How do arguments get passed? On the stack? In registers?
In what order are arguments passed?
Who is responsible for fixing the stack after the function has processed the arguments?
What registers must be preserved by the function?

Let us take a look at the most common calling conventions:

Calling convention	Where are the arguments stored?	In what order are the arguments passed	Who is responsible for cleaning up the stack?	Registers that have to be preserved
cdecl (C declaration)	Stack	Right to left	Calling function	`ebx edi esi ebp`
stdcall (standard call)	Stack	Right to left	Called function	`ebx edi esi ebp`
fastcall (fast call)	Registers and stack	1. `ecx` 2. `edx` 3. and further: stack from right to left	Called function	`ebx edi esi ebp`

You will find the cdecl calling convention in pretty much all C based interfaces, especially obviously in the world of UNIX-like operating systems. Windows uses the stdcall convention for most of its API. Right to left means that the rightmost argument in conventional notation is pushed first in the assembly code. The order does obviously not make any difference if there is only a single argument as in the code we were discussing. If the calling function is responsible for fixing the stack - that is, getting rid of the arguments below the return address which were passed to the function - it simply does what our code does: it uses the ret instruction with the right argument to clean up the stack. If the calling function is responsible for cleaning up the stack, the function in question usually simply calls ret and leaves all the work for whoever called him. Let us clarify by example:

C++


void  __cdecl test1(int  a,int  b,int  c)
{
}

void  __stdcall test2(int  a,int  b,int  c)
{
}

test1(1 ,2 ,3 );

test2(1 ,2 ,3 );

This C++ code would look something like this in x86 ASM:

Assembly


test1 :
ret 

test2 :
ret 12 

start :
push 3 
push 2 
push 1 
call test1 
add esp ,12 

push 3 
push 2 
push 1 
call test2

So the differences between these two function calls are:

Calling convention	Description
cdecl (C declaration)	Uses `ret` to return and forces the calling function to `add esp, 12` to fix the stack.
stdcall (standard call)	Uses `ret 12` to return so the calling function does not have to fix the stack.

stdcall is more efficient than cdecl because it requires one instruction less to execute. Do not be fooled by the number of instructions - there are tons of examples in which code which uses more instructions to solve the same problem is actually faster but in this case it happens to be intuitively right because all these simple instructions get executed at about the same speed. However, it is not yet the time to talk optimisations in general so let us return to the matter at hand. Now that we have covered some of the fundamentals of x86 assembly programming we can finally start writing simple programs that actually interact with the API of the underlying OS. We are going to implement the same basic command line program on different operating systems using various assembler programs in the next section of this document. This program will simply be an extension of what we have already written.

Writing Windows Applications Using x86 32-bit Assembly

At first we are going to take a look at how to write assembly for the popular Windows operating system. We are going to check out the Microsoft (R) Macro Assembler Version 9.00.21022.08 (short MASM), the great open source program flat assembler 1.67.27 for Windows (short FASM) and the netwide assembler 2.04 RC 1 (short NASM).

MASM

This is the most used assembler for writing assembly code for the Microsoft Windows operating systems. Before I am going to just throw the code right into your face I would like to dedicate a few words to what is known as MASM32 - a great source of confusion and frustration.

MASM32 (from movsd.com/masm32.com) does not have anything to do with Microsoft or MASM as such. It is a package mostly made by an Australian known as Steve Hutchesson (aka hutch). This package contains lots of headers and example code and it uses the Microsoft ml/link binaries version 6.x from 1998 or so. It is absolutely useless for any modern x86 32-bit Windows assembly programming. Never make the mistake of mixing up MASM and MASM32. I see a lot of people making this mistake. I have even seen people give up on MASM because they thought that MASM32 was it. It is really sickening. As for the actual package itself, the example code may be useful for new ASM programmers but you should stay away from the MASM32 "library" and the outdated headers which give you parsing errors in new versions of MASM.

Let us get back to the actual code. We will go through it line by line:

Assembly


;Arithmetic sum program 
  
.686p 
  
.model flat ,stdcall 

option epilogue :none 
option prologue :none 

GetStdHandle proto  :dword 
wsprintfA protoc  :vararg 
WriteConsoleA proto  :dword , :dword , :dword , :dword , :dword 
ExitProcess proto  :dword 

STD_OUTPUT_HANDLE equ -11 

.data 
  

sum_string db "Arithmetic sum for the limit %d is %d" ,10 ,0 

.data? 
  

buffer db 64dup (? )
bytes_written dd ? 

.code 
  

calculate_sum proc limit :dword 
xor esi ,esi 
mov edi ,1 
jmp loop_start 
    sum_loop :
add esi ,edi 
inc edi 
    loop_start :
cmp edi , [esp  +4 ]
jle sum_loop 
ret 
calculate_sum endp 

public main 

main proc 
invoke GetStdHandle ,STD_OUTPUT_HANDLE 
mov ebp ,eax 
xor ebx ,ebx 
main_loop :
invoke calculate_sum ,ebx 
invoke wsprintfA ,addr buffer ,addr sum_string ,ebx ,esi 
invoke WriteConsoleA ,ebp ,addr buffer ,eax ,addr bytes_written ,0 
inc ebx 
cmp ebx ,20 
jle main_loop 
invoke ExitProcess ,0 
main endp 

end

;Arithmetic sum program is a comment. In MASM comments are made using the semicolon character. There are only single line comments so it is pretty much the equivalent of the C/C++ //. Unluckily there is no multi line comment feature. .686p specifies what instructions are allowed to be used in the assembly code. There are numerous other ones but this is a nobrainer really. .686p is simply the most "modern" one. .model flat, stdcall specifies two things. For one it specifies the memory model to be used. The flat memory model is the only one supported by modern Windows operating systems so it is hardly worth discussing. The stdcall part specifies the default calling convention. Since we are dealing with Windows API code we definitely should use stdcall. option epilogue: none and option prologue: none tells the assembler that we do not want to use the default MASM epilogues/prologues. These are instructions that are automatically appended at the front of your function and whereever you call ret. It is a nice feature to support but it is very annoying that it actually puts some instructions there you do not see in your code - instructions you never expected to be there until you run your binary in a debugger to find out that some strange feature is totally breaking your assembly! In the case of MASM the default code is establishing a so called frame stack. This is mostly a nasty relic of the past which is still used by unenlightened copycats who never bother to actually check whether such features actually make sense or not. I am just going to demonstrate this dumb practice to you once:

Assembly


 ;non-sensical function 
   
 some_function proc 
 xor eax ,eax 
 jmp return 
 ret 
 return :
 mov eax ,1 
 ret

This function does not perform any meaningful task, it is merely an example. With the default epilogue/prologue code this will assemble to:

Assembly


push ebp 
mov ebp ,esp 
xor eax ,eax 
jmp some_constant;I do not know the immediate value that will be used here so I just made up some
  name. 
pop ebp 
ret 
  
mov eax ,1 
pop ebp 
ret

So as you can see every proc (a fancy way of declaring a function, short for "procedure") will generate an additional push ebp, mov ebp, esp and an additional pop ebp prior to every single ret. So at first the value of ebp is supposed to get preserved by being copied to the stack. Then esp is copied to ebp. In the end the original value of ebp gets restored by pop ebp. So, what is the point of all this? When you manipulate esp in your function body you will have to watch what offset your original arguments which were passed to your function are at. The idea of the stack frame is to preserve the original value so you do not have to do any thinking to come up with the offset. The truth is that stack frames are useless unless you are performing operations which do not let you predict the argument offset at assembly time. It is a waste of instructions. All you usually achieve is being able to write stuff like mov eax, [ebp + 4] instead of mov eax, [esp + 4] which is really idiotic in general. So do not use this feature. Now that we have cleared this up, let us get back to the code at hand.

GetStdHandle proto :dword declares the prototype ( proto) of a function called GetStdHandle which uses the default calling convention we specified. We specified stdcall in the .model directive. We could even have used GetStdHandle proto stdcall :dword to be even more explicit about it but it is unnecessary in this case since we already specified a default calling convention. If you do not know what a particular function does you should definitely look it up on MSDN. You might even know some of these Windows API calls but you are probably surprised by these -A prefixes some of these have. The Windows API uses the -A prefix for "ASCII" and the -W prefix for "Wide chars" - Unicode. They are redirected according to the preprocessor definitions in the C interface. The -A variants usually simply call the -W ones after a few string conversions. Windows handles most strings internally as Unicode. It is more convenient to use ASCII in this case, though, so we use the -A interfaces. On the next few lines a few more prototypes are declared and the majority of them are just like this one but the the wsprintfA one stands out because of the vararg part. This nasty function takes a variable amount of arguments and it is up to us to fix them later. It uses the C calling convention, hence the c part. These prototypes basically work like the function declarations in C/C++. We definitely need to use them for external calls so we need to declare them as prototypes using the proto directive first before we can use them. STD_OUTPUT_HANDLE equ -11 is a directive which will replace all future occurences of the name STD_OUTPUT_HANDLE with the constant -11. We need this one for GetStdHandle. The .data directive declares the start of the data section. This is where we store static data/global variables, so to say. sum_string db "Arithmetic sum for the limit %d is %d", 10, 0 declares a zero terminated string "Arithmetic sum for the limit %d is %d\n&ququot; by the name of format_string. db is short for "declare byte(s)", I believe. Consecutive bytes can simply be declared using the comma notation you can see in those lines. The .data? directive introduces the uninitialised data section. We store static variables which have no predefined value in this part of the program. buffer db 64 dup (?) declares a range of 64 bytes which have no defined value as specified in the parenthesis ( ?). dup is short for "duplicate", or so, I believe. bytes_written dd ? uses the dd directive which is short for "declare DWORD" which has no predefined value. At this point I should probably explain the Microsoft/Intel terms for different byte counts:

Term	Full name	Size in bytes
`BYTE`	Byte	1
`WORD`	Word	2
`DWORD`	Double word	4
`QWORD`	Quad word	8

Actually, "word" usually refers to whatever the normal size of the general purpose registers of a microprocessor may be which in the case of the x86 32-bit architecture happens to be 4 bytes. Intel and Microsoft somewhat hijacked these terms so you have to be careful not to mix this stuff up. .code represents the begining of the code section where all the instructions are stored. calculate_sum proc limit:dword declares the beginning of a function (procedure) by the name of calculate_sum which takes one argument which happens to go by the name of limit which is for documentative purpose mostly, though. MASM does support statements like mov eax, limit but in this case it would translate to mov eax, [ebp + 4] because this inferior software simply assumes that you use the stupid frame stack thing I previously explained. So just stick to accessing the arguments using esp directly - it is much more comfortable and efficient on the long run. The ret of the function is actually pretty much the same - except for the ret. As you can see there no longer is any argument on the return instruction. This is managed by MASM automatically since we specified the number of arguments in the proc line. calculate_sum endp declares the end of the procedure. Requiring the programmer to add the name of the function there is really redundant and I have no idea why they coded it this way but we do not have much of a choice if we want to use the proc feature in MASM. end is short for "end of procedure" I suppose. public main makes the label/function main a public symbol so we can later specify it as the entry point of our program in the linker command line. main proc declares the beginning of the main procedure. I chose to call it main because of the meaning of the name in C/C++ but you could pretty much give it any name you want really. The main procedure contains the code that gets executed first. invoke is not an actual x86 instruction - it is a MASM macro which makes calling functions with one more arguments more comfortable. As you can see it allows you to call functions pretty much as in C/C++. The addr directive gets the address of a named object inside the data section. invoke wsprintfA, addr buffer, addr sum_string, ebx, esi calls the user32.dll function wsprintfA which has a variable number of arguments and uses the C calling convention. We use it to fill the static buffer with the formatted string output with the appropriate values plugged in where the %d are but you should be familiar with this nasty stuff from C and possibly other high level programming languages. The rest of the program should be fairly obvious with the knowledge you now possess so let us find out how to assemble, link and run this stuff.

At first you will need to get your hands on the relevant MASM binaries. mspdb80.dll and link.exe come with Microsoft Visual C++ Express 9 which can be downloaded for free from http://www.microsoft.com/express/Downloads/. I recommend that you get the offline installation instead of the annoying web installer - it simply gives you more control. I do not know where to get the current ml.exe from for free. You can get the 8.0 one at http://www.microsoft.com/downloads/details.aspx?FamilyID=7a1c9da0-0510-44a2-b042-7ef370530c64&displaylang=en apparently but that's outdated. If you already own a legal or illegal copy of Microsoft Visual C++ 9 you will not need to download it, obviously. We only care about three files out of the installation anyways. They are:

ml.exe (771 KiB)
link.exe (349 KiB)
mspdb80.dll (188 KiB)

Ok, actually we can use more of the files - the lib and dll files. Here is the commandline I used to assemble and link this program:


D:\Code\Assembly\calculate_sum\source>"D:\Code\Assembly\binaries\32\ml.exe"
  /c /Cx /coff /nologo calculate_sum.asm
 Assembling: calculate_sum.asm


  D:\Code\Assembly\calculate_sum\binary>"D:\Code\Assembly\binaries\32\link.exe" /MACHINE:X86 /NODEFAULTLIB
  /NOLOGO /ENTRY:main /OUT:calculate_sum.exe /SUBSYSTEM:CONSOLE calculate_sum.obj /LIBPATH:"C:\Program Files
  (x86)\Microsoft Visual Studio 8\VC\PlatformSDK\Lib" kernel32.lib "C:\Program Files (x86)\Microsoft Visual
  Studio 9.0\VC\lib\msvcrt.lib"

ml.exe is the actual Microsoft Macro Assembler which we use to produce object files. link.exe is the Microsoft linker which produces the actual binaries from the object files. Let us first quickly go through the arguments of each program. /c tells it that you want to generate a single object file without having ml produce an .exe directly. /Cx makes it case sensitive: "Preserve case in publics, externs". /coff makes it generate COFF files which is the standard object format Windows uses. /nologo suppresses the Microsoft copyright message which increases the file size for no particular reason. calculate_sum.asm is simply the name of the source file in which I stored the assembly code.

Mow for the linker arguments. /MACHINE:X86 specifies the x86 machine as target platform. /NODEFAULTLIB specifies that it must not use default libraries. /NOLOGO does the same thing as before. /ENTRY:main specifies the main entry point of the program - this is the place where the execution starts first. This is directly related to the public main in our assembly code. /OUT:calculate_sum.exe specifies the output file name which is our binary - calculate_sum.exe. /SUBSYSTEM:CONSOLE specifies the subsystem to use. In this case we want a console by default so we're using the Windows PE Loader Console setting. calculate_sum.obj is the output from ml.exe. /LIBPATH:"C:\Program Files (x86)\Microsoft Visual Studio 8\VC\PlatformSDK\Lib" specifies the default path where link.exe should look for library files to link with. kernel32.lib is an essential library which we need for the ExitProcess call. "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\lib\msvcrt.lib" links the object file with the Microsoft Visual C runtime. We need this to use all the C library functions we are using. So after executing both lines we should have a working small 2.5 KiB executable which we are going to run now:


D:\Code\Assembly\calculate_sum\binary>calculate_sum.exe
Arithmetic sum for the limit 0 is 0
Arithmetic sum for the limit 1 is 1
Arithmetic sum for the limit 2 is 3
Arithmetic sum for the limit 3 is 6
Arithmetic sum for the limit 4 is 10
Arithmetic sum for the limit 5 is 15
Arithmetic sum for the limit 6 is 21
Arithmetic sum for the limit 7 is 28
Arithmetic sum for the limit 8 is 36
Arithmetic sum for the limit 9 is 45
Arithmetic sum for the limit 10 is 55
Arithmetic sum for the limit 11 is 66
Arithmetic sum for the limit 12 is 78
Arithmetic sum for the limit 13 is 91
Arithmetic sum for the limit 14 is 105
Arithmetic sum for the limit 15 is 120
Arithmetic sum for the limit 16 is 136
Arithmetic sum for the limit 17 is 153
Arithmetic sum for the limit 18 is 171
Arithmetic sum for the limit 19 is 190
Arithmetic sum for the limit 20 is 210

D:\Code\Assembly\calculate_sum\binary>

FASM

Flat assembler is a great multi platform open source program which you can download at http://flatassembler.net/download.php. Here is the equivalent code:

Assembly


format PE Console 4.0 
entry start 

include 'D:\Downloads\fasmw16727\INCLUDE\win32a.inc' 
include 'D:\Downloads\fasmw16727\INCLUDE\macro\masm.inc' 

option prologue :none 
option epilogue :none 

section '.data' data readable writeable 

sum_string db 'Arithmetic sum for the limit %d is %d' ,10 ,0 

section '.bss' readable writeable 

buffer db 64dup  (? )
bytes_written dd ? 

section '.code' code readable executable 

proc calculate_sum limit 
xor esi ,esi 
mov edi ,1 
jmp loop_start 
    sum_loop :
add esi ,edi 
inc edi 
    loop_start :
cmp edi , [esp  +4 ]
jle sum_loop 
ret 4 
endp 

start :
invoke GetStdHandle ,STD_OUTPUT_HANDLE 
mov ebp ,eax 
xor ebx ,ebx 
main_loop :
stdcall calculate_sum ,ebx 
invoke wsprintfA ,buffer ,sum_string ,ebx ,esi 
invoke WriteConsoleA ,ebp ,buffer ,eax ,bytes_written ,0 
inc ebx 
cmp ebx ,20 
jle main_loop 
invoke ExitProcess ,0 

section '.idata' import data readable 

library \
kernel32 ,'kernel32.dll' ,\
user32 ,'user32.dll' 

import kernel32 ,\
GetStdHandle ,'GetStdHandle' ,\
WriteConsoleA ,'WriteConsoleA' ,\
ExitProcess ,'ExitProcess' 

import user32 ,\
wsprintfA ,'wsprintfA'

format PE Console 4.0 specifies the format of the output file. As you can see from this FASM is both an assembler and a linker. It specifies the portable executable format (PE) which is used by Windows and the console subsystem just like last time. entry start specifies the entry point - in this case it is the label start. The include directive includes another FASM file. In this case I included files from the FASM installation which contain information about the Win32 API and a file which contains macros. We require those macros for the next two lines: option prologue:none, option epilogue:none are basically equivalent to the MASM directives. Unluckily FASM, too, uses a stack frame for procs by default - what a shame. So we need to use those directives to get rid of the stuff again. The section directives declare new sections, their names and their properties. The following lines do not require any special explanations with two exceptions. The .bss section is used for uninitialised data and it is basically equivalent to MASM's .data?. stdcall calculate_sum, ebx is different, aswell. For some peculiar reason you cannot use the invoke macro in that case and I really have no idea why. It will give you an assembler error if you do. The only difference between between those calls is the fact that the invoke lines operate on imported functions whereas calculate_sum was defined by us. Another difference to MASM is the section '.idata' import data readable part which actually defines an import section. This part is not visible in MASM as such. We define which functions to import from dynamic Windows libraries (DLL) in this part of the code so we can call and use them. The command line is unspectacular:


D:\Downloads\fasmw16727>FASM.EXE
  D:\code\Assembly\calculate_sum_fasm\calculate_sum.asm D:\code\Assembly\calculate_sum_fasm\calculate_sum.exe
flat assembler  version 1.67.27  (1207647 kilobytes memory)
3 passes, 2048 bytes.

D:\Downloads\fasmw16727>

The output of the executable is the same as before so we do not need to go over it again.

NASM

The netwide assembler is another multi platform open source assembler which I do not have much experience with. Windows include files come in a separate package known as NASMX and unluckily it is outdated and does not work well with the current version of NASM (produces warnings, possibly errors). I tried to use it at first for this small example but eventually I gave up on it and went the hard way. I shall link NASMX here aswell in case it gets updated at some point but I cannot recommend getting it right now except as example code. Go to http://nasm.sourceforge.net/ to download NASM and http://www.asmcommunity.net/projects/nasmx/ to get the outdated NASMX. I am using the Microsoft linker with the NASM object file output in this example but you can use other linkers aswell. The NASMX package comes with a different linker but I will stick with Microsoft's current linker which comes with Visual Studio 2008. Here is the equivalent code in NASM which is "rawer" than the MASM and FASM one:

Assembly


%define STD_OUTPUT_HANDLE -11 

extern _GetStdHandle4 
extern _WriteConsoleA20 
extern _wsprintfA 
extern _ExitProcess4 

global _start 

[section .data ]

format_string db 'Arithmetic sum for the limit %d is %d' ,10 ,0 

[section  .bss ]

buffer resb 64 
bytes_written resd 1 

[section  .text ]

    calculate_sum :
xor esi ,esi 
mov edi ,1 
jmp loop_start 
    sum_loop :
add esi ,edi 
inc edi 
    loop_start :
       &nbsnbsp;cmp edi , [esp  +4 ]
jle sum_loop 
ret 

_start :
push STD_OUTPUT_HANDLE 
call _GetStdHandle4 
mov ebp ,eax 
xor ebx ,ebx 
main_loop :
push ebx 
call calculate_sum 
push esi 
push ebx 
push format_string 
push buffer 
call _wsprintfA 
add esp ,16 
push 0 
push bytes_written 
push eax 
push buffer 
push ebp 
call _WriteConsoleA20 
inc ebx 
cmp ebx ,20 
jle main_loop 
push 0 
call _ExitProcess4

%define STD_OUTPUT_HANDLE -11 simply defines the constant STD_OUTPUT_HANDLE again, just with a more C'ish feeling on it. extern _GetStdHandle4 declares the external symbol _GetStdHandle4 which is the actual name of GetStdHandle inside the Microsoft library kernel32.lib. MASM handles those on its own and can use the names as they actually appear in the Windows DLLs. The number behind the ] is the size of the arguments in bytes so it is basically the number of arguments times 4 mostly. global _start declares a symbol which gets exported. We require this in order do define _start as an entrypoint in the linker settings later on. The underscore prefix is required by the Microsoft linker as you will see later. The section declaration syntax is a bit different as you can see but the names are still the same. buffer resb 64 declares 64 uninitialised bytes in the .bss section. The b in resb stands for byte. bytes_written resd 1 declares a single uninitialised "dword", a 4 byte integer, hence the d in resd. The calls below show that NASM does not come with any fancy invoke (MASM, FASM) or stdcall (FASM) macros by default. NASMX has those but I do not use it for aforementioned reasons. So we have to push all the arguments manually as necessary. The add esp, 16 stack fix is necessary because wsprintfA uses the C calling convention so it does not clean up after itself. We pushed 4 arguments onto the stack so we have to add \(4 \cdot 4 = 16\) to esp. Now for the commandline arguments to assemble and link this program:


d:\Downloads\nasm-2.04rc1-win32\nasm-2.04rc1\nasm.exe -f win32
  D:\code\Assembly\calculate_sum_nasm\calculate_sum.asm -o D:\code\Assembly\calculate_sum_nasm\calculate_sum.obj
D:\Code\Assembly\binaries\32\link.exe /MACHINE:X86 /NODEFAULTLIB /NOLOGO
  /ENTRY:start /OUT:calculate_sum.exe /SUBSYSTEM:CONSOLE d:\Code\Assembly\calculate_sum_nasm\calculate_sum.obj
  /LIBPATH:"C:\Program Files (x86)\Microsoft Visual Studio 8\VC\PlatformSDK\Lib" kernel32.lib user32.lib

This does not require much explanation really since it is the same stuff all over again. One notable thing is the /ENTRY:start part. The Microsoft linker prefixes all those names with an underscore ( _) when it actually looks them up inside the objects files which is why your entry point must have this prefix inside the code, hence the name _start in the source code. The output is the same as usual so we should move on.

Writing Linux/BSD Applications Using x86 32-bit Assembly

Now that we have covered Windows x86 programming with various assemblers it is time to find out how to implement this program on Linux/BSD, too. FASM runs on x86 Linux/BSD and NASM will run on any Linux/BSD aswell but I have already demonstrated the syntax for those so I am just going to cover the infamous GNU assembler version 2.17 in this section.

as

as is the command to call the GNU assembler (short gas) on Linux/BSD systems. It comes with any sane distro so I really do not need to tell you where to get it. Consult apt-get or emerge otherwise. It is generally agreed upon that the gas syntax is very archaic and esoteric. It uses the AT&T syntax (as opposed to the Intel syntax I have been using so far) in which basically the order of all arguments for the x86 instructions is reversed. Additionally it uses customised instruction names with -b, -w, -l suffixes to imply byte/word/long (1 byte, 2 bytes, 4 bytes). Additionally it introduces terribly redundant and complicated notations for offsets and cryptic register and number prefixes. Luckily, a feature was introduced which enables crude Intel syntax support. I am going to use this feature in the following code. Unluckily I had to use a silly hack to make it work properly because I could not find proper documentation on replacements for standard operations which are not possible in the Intel syntax mode using the conventional ways. I am going to show you some examples of conventional gas code later. For the record: I am using GNU assembler version 2.17 (i486-linux-gnu) using BFD version 2.17 Debian GNU/Linux and collect2 version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) (i386 Linux/ELF) in this example. Here is my code:

Assembly


.code32 
.intel_syntax noprefix 

.section .data 

    format_string : .string "Arithmetic sum for the limit %d is %d\n" 

.section  .text 

    calculate_sum :
xor esi ,esi 
mov edi ,1 
jmp loop_start 
    sum_loop :
add esi ,edi 
inc edi 
    loop_start :
cmp edi , [esp  +4 ]
jle sum_loop 
ret 4 

.global main 

main :
xor ebx ,ebx 
main_loop :
push ebx 
call calculate_sum 
push esi 
push ebx 
lea edi ,format_string 
push edi 
call printf 
add esp ,12 
add ebx ,1 
cmp ebx ,20 
jle main_loop 
xor eax ,eax 
ret

You can probably figure out most of this stuff on your own by now, it is the same stuff over and over again. It is merely the syntax that differs, the idea is usually still the same. .section declare sections, etc. .code32 forces the assembler to produce 32-bit code instead of 64-bit code. This is only necessary if you are running a 64-bit box. .intel_syntax noprefix does what I was talking about. It allows you to write code which is more similar to what you are (hopefully) used to - the Intel syntax. gas is pretty low level by default as you can see and it does not offer any fancy macros as such. I think it supports the C/C++ preprocessor specs, though. So you could write all kinds of crazy stuff using #define and such. The "hack" I was talking about in the introduction is lea edi, format_string. lea is short for "load effective address". It is used to store the address of some object in a register, so they say. In practice a simple lea is hardly any different from what mov does. In this case we could have achieved the same result using a mov. The problem is just that I have not figured out how to get the address of a string when you are using the .intel_syntax noprefix mode otherwise. In reality lea is only used to perform simple arithmetic operations because it possesses curious multiplicative and additive capabilities which allow you to perform certain simple tasks much faster than with other native instructions but we will get back to that later. In this case it is really simply a silly unnecessary hack so ignore it for now. In the next line I just push the value of the register onto the stack but what I really wanted to do should be clear from the previous source code flavours you have seen so far - I simply wanted to push the address of that label onto the stack. The only peculiar property of Linux/BSD you can see in this code is that you do not use any special function to exit the program. You can simply ret from the main function and it will return the program exit code in eax.

Figuring out how to assemble and link this application on my Debian box was a painful process. I recommend that you simply write a simple C program which performs something similar to what you intend to do first. Run gcc with -S in order to produce ASM output. Check out the .s-file ( gas ASM file extension) to get an idea. Proceed to run gcc again, this time without -S and with -v. That switch will tell you every single command it actually executes. At first it produces the ASM code, then it assembles and links it. We can use this information to rip out the assembling and linking information to use with our project. Be sure to include all libraries you intend to use in your ASM program in your C test program - the linking information will pove invaluable. Here is how I assembled and linked this monstrosity:


as -o calculate_sum_gas.o calculate_sum_gas.asm

/usr/lib/gcc/i486-linux-gnu/4.1.2/collect2 --eh-frame-hdr -m elf_i386
  -dynamic-linker /lib/ld-linux.so.2 -o sum_test /usr/lib/gcc/i486-linux-gnu/4.1.2/../../../../lib/crt1.o
  /usr/lib/gcc/i486-linux-gnu/4.1.2/../../../../lib/crti.o /usr/lib/gcc/i486-linux-gnu/4.1.2/crtbegin.o
  -L/usr/lib/gcc/i486-linux-gnu/4.1.2 -L/usr/lib/gcc/i486-linux-gnu/4.1.2
  -L/usr/lib/gcc/i486-linux-gnu/4.1.2/../../../../lib -L/lib/../lib -L/usr/lib/../lib calculate_sum_gas.o -lgcc
  --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed
  /usr/lib/gcc/i486-linux-gnu/4.1.2/crtend.o /usr/lib/gcc/i486-linux-gnu/4.1.2/../../../../lib/crtn.o

As you can see there is an excessive amount of arguments used to actually link this program. I am not sure how many of these are even necessary but I will not bother to find out as long as it works. If you have a 64-bit Linux/BSD this might prove problematic. You might need the right gcc library binaries etc to run this 32-bit example there. I previously claimed that the gas syntx is outrageously cryptic and difficult to understand. Here is the same same program using "normal" gas syntax, without the lea hack:

Assembly


.code32 
.section .data 

    format_string : .string "Arithmetic sum for the limit %d is %d\n" 

.section  .text 

    calculate_sum :
xorl  %esi , %esi 
movl $1 , %edi 
jmp loop_start 
    sum_loop :
addl  %edi , %esi 
incl  %edi 
    loop_start :
cmpl 4 (%esp ), %edi 
jle sum_loop 
retl $4 

.global main 

main :
xorl  %ebx , %ebx 
main_loop :
pushl  %ebx 
call calculate_sum 
pushl  %esi 
pushl  %ebx 
push $format_string 
call printf 
addl $12 , %esp 
addl $1 , %ebx 
cmp $20 , %ebx 
jle main_loop 
xorl  %eax , %eax 
ret

The real gas syntax uses so many special characters that you get the feeling of looking at some esoteric programming language like brainfuck. As long as I have a choice between various assemblers I definitely would not pick this one. gcc inline assembly has an even more outrageous and complicated syntax.

Partial Registers

One thing I did not tell you about, on purpose, until now, are the partial registers of the x86 architecture. This is because they frequently do harm and slow down programs when used incorrectly. On the x86 architecture it is possible to access lower parts of all general purpose registers according to the following pattern which I am going to demonstrate on eax:

Bits 31 - 24	Bits 15 - 8	Bits 7 - 0
`eax`
	`ax`
	`ah`	`al`

So eax can contain 32-bits, alright, we already knew that. But eax also has the three partial registers ax, ah and al. These point to the data eax contains - they just show a smaller part of the same register. So ax represents the lower 16 bits of eax. ah represents the upper 8 bits of ax and al gives you the lower 8 bits of it. This has mostly backward compabitibility reasons. ax is the original 16-bit accumulator register the x86 architecture started with. The e in eax stands for "extended" since it is an extended version of ax. The l and the h in al/ ah stand for low/high. Not all general purpose registers feature the "high"/"low" partial registers. They are only available on eax, ebx, ecx and edx. You can write to them and read from them just like normal registers. Here is a full list of all the other 32-bit registers and their partial registers:

Bits 31 - 24	Bits 15 - 8	Bits 7 - 0
`eax`
	`ax`
	`ah`	`al`
`ebx`
	`bx`
	`bh`	`bl`
`ecx`
	`cx`
	`ch`	`cl`
`edx`
	`dx`
	`dh`	`dl`
`ebp`
	`bp`
`esp`
	`sp`
`esi`
	`si`
`edi`
	`di`

The problem with using partial registers in 32-bit code is that they can cause so called partial register stalls. Reading from partial registers and writing to partial registers causes more dependencies than doing the same to the full registers. This can stall the execution pipelines of the processor which results in a slowdown of your code. Understanding what the execution pipeline is and how it works on modern CPUs requires a lot of explaining and it is not yet the time to cover this topic. So all you should keep in mind for now is that you should always avoid using partial registers. There usually is a faster solution which minimises partial register usage - usually totally eliminating it. Both the Intel Optimization Manual and the AMD Software Optimization Guide strongly discourage the use of partial registers for this very reason. Here are a few examples of how partial registers work:

Assembly


;in MASM syntax you use the following notation for hex numbers: 
  
mov eax ,012345678h 
;this writes 0x12345678 to eax, the decisive parts are the 0-prefix and the h-suffix 
  

;the x86 instruction movzx is usually used to move values from 1-byte or 2-byte sized
    operands to 32-bit registers 
  
;it writes them to the lower part of the register and sets the rest of the register to
    0 
  
;the "zx" part stands for "zero extend" 
  

;read al, zero extend it and write it to ebx: 
  
movzx ebx ,al 

;ebx now contains 0x00000078 
  

;read ah, zero extend it and write it to ebx: 
  
movzx ebx ,ah 

;ebx now contains 0x00000056 
  

;read ax, zero extend it and write it to ebx: 
  
movzx ebx ,ax 

;ebx now contains 0x00005678 
  

;normal mov: 
  
mov ebx ,eax 

;ebx now contains 0x12345678

Addresses

The x86 architecture allows us to use fairly complex expressions when we are addressing memory. The pattern it supports is:


address = base + index * scale + offset

The variables used in the pattern have the following meaning:

base: general purpose register
index: general purpose register
scale: 1, 2, 4 or 8
offset: 32-bit integers

To fully comprehend what this means you should take a look at the following example:

Assembly


;ebx <<= 3, shift it left by 3 bits 
  
shl ebx ,3 

add eax ,ebx 
add eax ,17 

;the "byte ptr" signifies that we want to read a single byte only and not 2 or even
    4 bytes 
  
;I have to use movzx because ecx is a 32-bit register and the byte is only 8 bits in
    size 
  
movzx ecx ,byte ptr  [eax ]

So what this basically does is reading a single byte from memory. This byte is stored at the address eax + (ebx << 3) + 17. Shifts can be expressed with multiplication/division with powers of two, in case you never heard about it. Shifting ebx 3 bits to the left is the same as multiplying it by 8 because two to the power of three is equal to 8. In general n <<= a; is equivalent to n *= static_cast<long>(std::pow(2.0, static_cast<double>(a))); and n >>= a; is equivalent to n /= static_cast<long>(std::pow(2.0, static_cast<double>(a)));, assuming that a and n are both of type long. This is a direct result of the positional notation I formulated mathematically at the beginning of this document. By shifting you basically add zeroes or remove digits from the sequence of digits. This leads to multiplications by the powers of the base or integer divisions with the same. The great thing is that the x86 architecture allows us to do all of those things in a single step:

Assembly


movzx ecx ,byte ptr  [eax  +ebx  *8  +17 ]

It is a really useful feature and has lots of applications. In combination with lea you can even use it to accelerate simple arithmetic operations which would usually require several steps for multiplication and addition so do keep it in mind.

Control Structures

In the introduction I stated that this document is intended for people who are already familiar with iterative high level programming languages such as C/C++. At the beginning it is difficult to stop thinking in terms of if, for and while and to start thinking in terms of labels and conditional jumps so I wanted to show you a few examples which demonstrate how to translate these standard control structures to x86 32-bit assembly. My apology to Linux/BSD users - but I am going to use Windows specific API calls in these examples occasionally.

The first thing we are going to check out is the basic if-construct:

C++


if (lstrlen(some_string) >32 )
    return  -1 ;
return 0 ;

Here is the equivalent x86 32-bit ASM code:

Assembly


 invoke lstrlen ,addr some_string 
 cmp eax ,32 
 jle return_zero 
 mov eax ,-1 
 ret 
 return_zero :
 xor eax ,eax 
 ret

So the original comparison operator was "greater than" but in x86 assembly we jump away from the if-body so we need to check for the exact opposite of the original condition. The logical negation of "greater than" is "less than or equal to" so we use jle in this case. Here is a list of logical negations of standard operators:

Original operator	Logical negation
`==`	`!=`
`!=`	`==`
`<`	`>=`
`>`	`<=`
`<=`	`>`
`>=`	`<`

The if- else construct is next:

C++


void  divisor_check(int  divisor)
{
if (divisor ==0 )
        puts("Division by zero!" );
else 
        puts("Valid division" );
    puts("This gets executed in either case" );
}

The approximate equivalent in x86 ASM is:

Assembly


.data 
  

division_by_zero db "Division by zero!" ,0 
valid_division db "Valid division" ,0 
either_case db "This gets executed in either case" ,0 

.text 

divisor_check proc divisor :dword 
cmp  [esp  +4 ],0 
jne else_label 
invoke puts ,addr division_by_zero 
jmp end_of_if 
    else_label :
invoke puts ,addr valid_division 
    end_of_if :
invoke puts ,addr either_case 
ret

So if the result of the comparison implies that the else body must be executed we jump to the else-label and simply continue the execution after that because it gets us right where we are supposed to be anyways. In the other case the first jump is not taken but we must not fall through to the else-body so an unconditional jump is necessary to take us to the end of the if-structure.

Let us move on to the while-construct:

C++


long  i =0 ;
while (data[i] >=128  || some_function(data[i]))
    i++;

Let us assume that i gets stored in the general purpose register eax and that data is a string/an array of bytes. then the approximate equivalent in x86 ASM is:

Assembly


xor eax ,eax 
loop :
movzx ebx ,byte ptr  [array  +eax ]
cmp ebx ,128 
jge iterate 
invoke some_function ,ebx 
test eax ,eax 
jz end_of_loop 
iterate :
inc eax 
jmp loop 
end_of_loop :

I am using the new instruction test in this example. test is to the logical AND what cmp is to subtraction. It performs the a logical AND using the operands behind the instruction as arguments for the logical function but it does not store the result of it anywhere. It is simply used to set the flags. If you look closely at the logical table for AND you will realise that for all boolean values a the term a & a evaluates to a. This is called idempotence. An idempotent operation can be applied to a variable over and over again but in the end it always evaluate to the same variable again. The logical OR operation is idempotent aswell. So what test eax, eax really does is checking the value of eax and setting the flags accordingly. In this case we use it to determine whether the register is equal to zero. In case you are not used to this particular feature of C/C++: while(data[i] >= 128 || some_function(data[i])) is equivalent to while(data[i] >= 128 || some_function(data[i]) != 0) - which is why we jump out of the loop if that part of the boolean term fails, too. We could even optimise this loop by using an initial jump to enter the loop and putting the iteration at the beginning of the body:

Assembly


xor eax ,eax 
jmp enter_loop 

iterate :
inc eax 

enter_loop :
movzx ebx ,byte ptr  [array  +eax ]
cmp ebx ,128 
jge iterate 
invoke some_function ,ebx 
test eax ,eax 
jnz iterate

We do have that additional jump at the beginning now but you always have to keep in mind that 99% of all CPU time is spent inside loops. Optimising the body of a loop greatly outweighs such a one-time operation. The new body looks unintuitive and does not remind one of the original C/C++ code at all so I did not want to start with it right away and used the less optimised version first.

Now that we have covered if and while the next part is not a big deal. It is time for the for-statement:

C++


for (long  i =1 ; i <=10 ; i++)
    printf("Iteration step  %d \n " , i);

Here is the equivalent x86 32-bit ASM code:

Assembly


.data 
  

iteration_string db "Iteration step %d\n" ,0 

.text 

    _start :

mov ebx ,1 
    loop :
invoke printf ,addr iteration_string ,ebx 
inc ebx 
cmp ebx ,10 
jle loop 

;...

What we actually do here is skipping the first comparison 1 <= 10 because we know that is true anyways. If you have a more complex iteration condition where you cannot tell whether the first iteration will always occur you can obviously not skip it so be careful with optimisations like these.

Multiplication and Division of Integers

Multiplication and division of integers on the x86 architecture are not as intuitive as the addition and subtraction of integers. The largest unsigned integers you can store in a 32-bit general purpose register has the value \(2^{32} - 1\) so when we have two of those and multiply them we get \((2^{32} - 1) \cdot (2^{32} - 1) = 2^{64} - 2 \cdot 2^{32} + 1 = 2^{64} - 2^{33} + 1 = 18446744065119617025 = m\). In order to calculate how many bits this number requires to be represented in a standard binary positional system we use the formula \(\lceil \frac{\log{m}}{\log{b}} \rceil\) with a base of \(b = 2\): \(\lceil \frac{\log{18446744065119617025}}{\log{2}} \rceil \approx \lceil 63.9999999993281927699 \ldots \rceil = 64\). This means that the result of multiplications in general is at least 64 bits in size. But our x86 architecture has only 32-bit general purpose registers so how is this dealt with? The result is stored in two registers. The mul reg instruction is used for the multiplication of two unsigned 32-bit integers. It takes a single argument - a general purpose register. It uses eax as the other factor so what it calculates is eax * reg. The result is stored in edx:eax which means that the upper 32 bits get stored in edx and the lower 32 bits of the result are stored in eax. Example:

Assembly


mov eax ,012345678h 
mov edx ,012345678h 
mul edx 
;The result of this multiplication is: 
  
;0x12345678 * 0x12345678 = 0x014B66DC1DF4D840 
  
;Which produces the following values of edx and eax: 
  
;edx = 0x014B66DC 
  
;eax = 0x1DF4D840

Multiplications used to be considerably more expensive than addition and subtraction. Multiplication is considerably more expensive than say addition or subtraction in terms of execution time. The imul instruction is used for the multiplication of signed integers and it supports more operands. Like mul, it can store the result in edx:eax, taking only a single argument. But it can also take multiple registers and even an immediate value but it will discard the upper 32-bits of the result then. Example:

Assembly


mov eax ,-2 
mov ebx ,-2 
imul ebx 
;edx:eax = (-2) * (-2) = 4 = 0x0000000000000004 
  

;Let us use the largest positive signed integers possible: 
  
mov edi ,2147483647 
mov esi ,2147483647 

;this stores the result in edi and discards the upper 32 bits: 
  
imul edi ,esi 
;2147483647 * 2147483647 = 4611686014132420609 = 0x3FFFFFFF00000001 
  
;edi = 0x00000001 
  
;0x3FFFFFFF is discarded 
  

mov eax ,10 
mov ebx ,20 

;this simply adds another factor to the multiplication: 
  
imul eax ,ebx ,3 
;10 * 20 * 3 = 600 
  
;upper 32 bit are discarded but they are 0 anyways, so: 
  
;eax = 600 = 0x00000258

Division is far more expensive than multiplication. The div reg instruction is used for unsigned division. It performs the division edx:eax / reg and stores the quotient in eax, the remainder in edx. In case you do not even know what those words mean I am going to explain it with a small example: \(17 : 5 = 3 + \frac{2}{5}\). The result of the division 17 / 5 is 3 and the remainder is 2 because \(3 \cdot 5 + 2 = 15 + 2 = 17\). You might wonder why the x86 architecture supports 64-bit dividends but only 32-bit divisors. I am not absolutely sure why this is, actually. But it was probably more convenient at the time and it is not a big deal to work around it. Usually you do not require such large numbers anyways. Enough of this primary school stuff, let us have a look at a few examples:

Assembly


mov eax ,100 
xor edx ,edx 
mov ebx ,10 
div ebx 
;edx:eax / ebx = 100 / 10 = 10 
  
;edx:eax % ebx = 0 
  
;Quotient: eax = 10 
  
;Remainder: edx = 0 
  

mov eax ,21 
div ebx 
;edx:eax / ebx = 21 / 10 = 2 
  
;edx:eax % ebx = 21 % 10 = 1 
  
;Quotient: eax = 2 
  
;Remainder: edx = 1

Just like there are mul and imul, there is the idiv instruction for signed division. It works just like div, though. The remainder is always the same sign as the quotient:

Assembly


;Set edx:eax to -100 
  
mov eax ,-100 
mov edx ,-1 
;-1 because we want all upper bits to be 1 - it is Two's complement! 
  

mov ebx ,10 
idiv ebx 
;edx:eax / ebx = - 100 / 10 = - 10 
  
;edx:eax % ebx = 0 
  
;Quotient: eax = 10 
  
;Remainder: edx = 0 
  

mov eax ,-21 
mov edx ,-1 
div ebx 
;edx:eax / ebx = - 21 / 10 = - 2 
  
;edx:eax % ebx = - 21 % 10 = - 1 
  
;Quotient: eax = - 2 
  
;Remainder: edx = - 1

Floating Point Operations

Now that we have covered the most important integer operations we should move on to floating point instructions. The CPU contains many different units which work together when a program is executed. Instructions like add, sub, xor, shl, mul and idiv are all executed on the so called Arithmetic Logical Unit (ALU). The operations we are going to concern ourselves with are executed in a completely different unit. It is called the Floating Point Unit (FPU). This is where stuff like 0.942 - 7.183 = -6.241 is calculated. The x86 architecture implements the IEEE-754 standard ( http://en.wikipedia.org/wiki/IEEE_floating-point_standard) and offers support for three different precision formats:

Single precision 32-bit floating point numbers ( float in C/C++)
Double precision 64-bit floating point numbers ( double in C/C++)
Extended precision 80-bit floating point numbers ( long double in C/C++)

We are going to explore the mathematical concept behind this system first. The real numbers \(\mathbb{R}\) know no bounds and even if we specify boundaries for them they are still not discrete. There are infinitely many real values between the numbers 0 and 1 so we have to make sacrifices to represent real numbers with boundaries and approximative capabilities that meet our requirements. We can only represent approximations of most real values using this concept. A few of them which are composed of sums of positive/negative powers of 2 can be represented flawlessly without error using IEEE-754. We are going to examine the smallest of the three formats in this part of the document only. The others are based on the same idea and just use different parameters really.

The IEEE-754 32-bit precision format divides the 32-bit value into 3 different sections:

Section	Size	Bit range
Sign	1 bit	31 - 31
Biased exponent	8 bits	30 - 23
Mantissa	23 bits	22 - 0

The mantissa expresses a fraction of the form 1.x where x is expressed by the mantissa. The numbers are encoded using the Big Endian format (most significant bit first). Let \(s\) stand for the sign bit, \(be\) for the biased exponent and \(m_i\) with \(i \in \{1, 2 \ldots 23\}\) for the mantissa bits \(m_1 m_2 \ldots m_{23}\) (they are read from the left to the right), then the value most IEEE-754 encoded reals represent can be determined using the following formula:

\(\mbox{ value } = (-1)^s \cdot 2^{be - 127} \cdot (1 + \sum_{i = 1}^{23} m_i \cdot 2^{-i})\)

The \((-1)^s\) part simply changes the sign according to the sign bit. The value is negative if the sign bit is set to 1 and positive if it is set to 0. \(2^{be - 127}\) is where the the power of 2 gets expressed through the actual exponent \(e = be - 127\). The "bias" I was talking about is a necessity to allow the exponent to be negative. Otherwise we would have positive/zero exponents only and we would be unable to express numbers \(r \in \mathbb{R} \land |r| < 1\) since this requires negative powers of 2. The mantisa represents the binary digits to the right of the point in a number like \(1.1101_2\) which had the leading "1." simply chopped off. Let us manually convert the number 149.8 to the IEEE-754 single precision format. At first we require a binary representation of this decimal number. We examine the part to the left of the point first and repeatedly perform divisions by 2, splitting the result into an integer part and a rational part after each step. Then we repeat the procedure with this integer part until it eventually becomes zero.


149 / 2 = 74 + 1 / 2
 74 / 2 = 37 + 0 / 2
 37 / 2 = 18 + 1 / 2
 18 / 2 =  9 + 0 / 2
  9 / 2 =  4 + 1 / 2
  4 / 2 =  2 + 0 / 2
  2 / 2 =  1 + 0 / 2
  1 / 2 =  0 + 1 / 2

You get the binary representation of the left part of the number by reading the "bit stream" on the right bottom-up, which gives us the binary number 10010101 which is indeed equal to the decimal representation 149. The part to the right of the of the point works in a similar fashion but it can be tricky at times because some numbers cannot be expressed with a finite number of digits in another base. We chop off the left part of the number and replace it with 0. We repeatedly multiply it by 2, chopping off the leading digit to replace it with 0 each step. We are done until there is a 0 left or until we discover a periodic pattern or until we simply hit the limit for the precision:


0.8 * 2 = 1.6 => 1 <--+
0.6 * 2 = 1.2 => 1    | periodic pattern detected
0.2 * 2 = 0.4 => 0    |
0.4 * 2 = 0.8 => 0 ---+

We discovered a periodic pattern which means that 0.8 cannot be written in binary notation in a finite number of digits without any loss of precision. We conclude:

\(149.8_{10} = 10010101.\overline{1100}_2\)

We can prove this mathematically by making use of geometric series with \(a \in \mathbb{R} \land 0 < a < 1\):

\(\sum_{i = 0}^{\infty} a^i = \frac{1}{1 - a}\)

All we need to do is to perform some index shifts and splitting up of sums to show the desired result:

\(0.\overline{1100}_2 = \sum_{i = 1}^{\infty} (1 \cdot 2^{- (4i - 3)} + 1 \cdot 2^{- (4i - 2)} + 0 \cdot 2^{- (4i - 1)} + 0 \cdot 2^{- (4i)}) = \frac{8}{15} + \frac{4}{15} = \frac{4}{5} = 0.8\)

The next step in the conversion process is shifting our binary representation of the number to the left or to the right such that it has the form "1.x". We have to keep track of the number of shifting steps we performed and of the direction aswell. We store this information in the exponent of a power of two:

\(10010101.\overline{1100}_2 = 2^7 \cdot 1.0010101\overline{1100}_2 = (-1)^0 \cdot 2^{134 - 127} \cdot (1 + 0.0010101\overline{1100}_2)\)

As you can see I algebraically manipulated the term so it matches the pattern in the original formula I showed to you. Now we can simply read off the values:

\(s = 0 \land be = 134 \land m = (m_1, m_2 \ldots m_{23}) = (0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0 \ldots)\)

We are nearly done, we simply need to convert the biased exponent to its 8-bit binary representation now:


134 / 2 = 67 + 0 / 2
 67 / 2 = 33 + 1 / 2
 33 / 2 = 16 + 1 / 2
 16 / 2 =  8 + 0 / 2
  8 / 2 =  4 + 0 / 2
  4 / 2 =  2 + 0 / 2
  2 / 2 =  1 + 0 / 2
  1 / 2 =  0 + 1 / 2

This is the same algorithm we used before and it tells us that the binary representation is 10000110. So the IEEE-754 representation of the decimal value 149.8 is:


Sign : biased exponent : mantissa

0 : 10000110 : 00101011100110011001100

= 01000011 00010101 11001100 11001100

In the case of periodic binary representations you simply repeat the periodic digit group until you hit the limits imposed by the 23 bit boundary for the mantissa. Let us get back to the x86 architecture. The other formats simply use different biased exponent and mantissa sizes which also forces them to use different bias values, obviously. There are certain special cases for invalid numbers, infinity and zero and such but it is not relevant for this document, really. Check out http://en.wikipedia.org/wiki/IEEE_floating-point_standard if you are interested in the topic.

The FPU contains 8 80-bit elements called st(0), st(1), ..., st(7). I did not say "register" because they are implemented as a stack so st(0) always refers to the top element, st(1) to the below the top, etc. This makes it quite tricky to use. Honestly, I am not even quite sure why they implemented it this way. Maybe it makes some common operations shorter this way or something like that. FPU code is very fragile and often hard to read. I strongly recommend reading a good instruction set reference simultaneously when you are working with FPU code. There are lots of details you have to pay attention to and it is easy to make mistakes so be extra careful with FPU code. Let us start out with some easy examples to see how the FPU stack works. Here is the state of the FPU at the beginning of the program:

Position	Value
`st(0)`	`undefined`
`st(1)`	`undefined`
`st(2)`	`undefined`
`st(3)`	`undefined`
`st(4)`	`undefined`
`st(5)`	`undefined`
`st(6)`	`undefined`
`st(7)`	`undefined`

All FPU instructions have the f- prefix. If they have a -p suffix it means that they pop the top element off the stack at the end of their execution. We are going to execute the following instructions on the FPU now:

Assembly


;... 
  

;push one on top of the stack 
  
fld1 
  
;push zero on top of the stack 
  
fldz 
  
;add them up and store the result in st(0) 
  
fadd st(0) ,st(1) 
;add the top two elements up and store the result in st(1), pop the stack 
  
faddp st(1) ,st(0) 

;...

Let us go through the changes in the FPU stack, instruction by instruction. Changes to registers are marked in red. We start out with fld1:

Position	Value
`st(0)`	1.0
`st(1)`	`undefined`
`st(2) - st(7)`	`undefined`

fldz adds another well known constant to the top of the stack:

Position	Value
`st(0)`	0.0
`st(1)`	`1.0`
`st(2) - st(7)`	`undefined`

Perform the addition st(0) = st(0) + st(1); with the specified stack elements using fadd:

Position	Value
`st(0)`	1.0
`st(1)`	`1.0`
`st(2) - st(7)`	`undefined`

As you can see the rest of the stack remains the same. We are going to look at the next one in two steps because it basically does two things. We have the FPU perform the addition st(1) = st(1) + st(0); using faddp, which discards the top element after the addition:

Position	Value
`st(0)`	`1.0`
`st(1)`	2.0
`st(2) - st(7)`	`undefined`

Position	Value
`st(0)`	`2.0`
`st(1)`	`undefined`
`st(2) - st(7)`	`undefined`

So how do you load floating point values from memory onto the FPU stack and how do you store them in memory again? Take a look at the following snippet:

Assembly


.data 
  

some_float dd 0.123 
some_double dq 4.556456823 

.data? 
  

float_storage dd ? 
double_storage dq ? 

.text 

;... 

;load float from memory and push it on the top of the stack: 
fld some_float 
;load double from memory and push it on the top of the stack: 
fld some_double 
;add them up and pop the stack: 
faddp st(1) ,st(0) 
;store the result as a float: 
fst float_storage 
;store the result as a double (in case we need more precision) and pop the stack so it's
  empty again: 
fstp double_storage 

;...

The first new thing in this code is the dq (declare quadword) assembler directive which we use to initialise double precision floats (64 bits, 8 bytes). In this case I use it to define static values which I load onto the FPU stack. Let us go through the changes again, instruction by instruction. In the beginning the stack is totally empty:

Position	Value
`st(0) - st(7)`	`undefined`

fld some_float pushes the approximation of 0.123 onto the stack:

Position	Value
`st(0)`	0.123
`st(1) - st(7)`	`undefined`

fld some_double does the same again, just with a double precision value which is an approximation of 4.556456823:

Position	Value
`st(0)`	4.556456823
`st(1)`	`0.123`
`st(2) - st(7)`	`undefined`

faddp st(1), st(0) adds them up and pops the stack, just like we have seen before in the previous example:

Position	Value
`st(0)`	4.679456826
`st(1) - st(7)`	`undefined`

fst float_storage stores the top of the stack as a 32-bit single precision value in memory at the specified address. It does not modify the FPU stack in any way, though. fstp double_storage does, though. After storing the double precision value in double_storage it pops the top off the stack:

Position	Value
`st(0) - st(7)`	`undefined`

Now that we have covered the basics of the FPU stack is time to move on to a more elaborate example of how to use the FPU. Here is the original C++ code we are going to translate to x86 32-bit assembly:

C++


double  f(double  x)
{
char  buffer[10 ];
return  std::sin(x * x +1.0 ) /2.0 ;
}

int  main()
{
for (double  i = 0.0 ; i <1.0 ; i += 0.1 )
    {
double  value = f(i);
        printf("f( %.1f ) =  %.4f , first two digits:  %ld \n " , i, value,static_cast <long >(100  *

    }
    std::cin.get();
return 0 ;
}

The FPU example code I came up with is quite nasty I must say. I apologise in advance. Here is my translation to x86 ASM:

Assembly


.686p 
  
.model flat ,stdcall 

option epilogue :none 
option prologue :none 

GetStdHandle proto  :dword 
wsprintfA protoc  :vararg 
WriteConsoleA proto  :dword , :dword , :dword , :dword , :dword 
ExitProcess proto  :dword 

STD_OUTPUT_HANDLE equ -11 

.data 
  

format_string db "f(%s) = %s, first two digits: %d" ,10 ,0 
iteration_step dq 0.1 
hundred dq 100.0 
ten dq 10.0 
half dq 0.5 

one_two_three dq 123.456789 

.data? 
  

number_buffer db 16dup  (? )
number_buffer2 db 16dup  (? )
buffer db 64dup  (? )
bytes_written dd ? 

iterator dq ? 

.code 
  

f proc 
fmul st(0) ,st(0) 
fld1 
faddp st(1) ,st(0) 
fsin 
fmul half 
ret 
f endp 

write_double proc double_buffer :dword ,precision :dword 
fstqword ptr  [esp  -12 ]
mov esi , [esp  +4 ]
movbyte ptr  [esi ],'.' 
fld st(0) 
fisttpdword ptr  [esp  -4 ]
mov eax , [esp  -4 ]
mov ecx ,10 
    write_left_digits :
xor edx ,edx 
div ecx 
add edx ,'0' 
dec esi 
movbyte ptr  [esi ],dl 
test eax ,eax 
jnz write_left_digits 
mov ecx , [esp  +4 ]
mov eax , [esp  +8 ]
    write_right_digits :
filddword ptr  [esp  -4 ]
fsubp st(1) ,st(0) 
fmul ten 
fld st(0) 
fisttpdword ptr  [esp  -4 ]
mov ebx , [esp  -4 ]
add ebx ,'0' 
inc ecx 
movbyte ptr  [ecx ],bl 
dec eax 
jnz write_right_digits 
movbyte ptr  [ecx  +1 ],0 
fstp st(0) 
fldqword ptr  [esp  -12 ]
ret 8 
write_double endp 

public main 

main :
invoke GetStdHandle ,STD_OUTPUT_HANDLE 
mov ebp ,eax 

fldz 

    main_loop :
fst iterator 
invoke write_double ,addr number_buffer  +8 ,4 
mov edi ,esi 
call f 
invoke write_double ,addr number_buffer2  +8 ,6 
fld st(0) 
fmul hundred 
fisttpdword ptr  [esp  -4 ]
mov eax , [esp  -4 ]
invoke wsprintfA ,addr buffer ,addr format_string ,edi ,esi ,eax 
invoke WriteConsoleA ,ebp ,addr buffer ,eax ,addr bytes_written ,0 
fstp st(0) 
fld iterator 
fadd iteration_step 
fld1 
fcomip st(0) ,st(1) 
jnb main_loop 
invoke ExitProcess ,0 
end

Let us start with the main code. fldz pushes the zero constant on the stack. So after execution this instruction st(0) contains zero while all other FPU registers are still undefined/empty. If you use such an undefined register in an FPU operation you will cause an exception. If you try to pop an element off an empty stack - meaning, all values are undefined - you will get an exception, too. Another thing you have to watch out for is stack overflow. You cannot just keep on adding more elements onto the stack. If all registers have been set and you push an additional element onto the stack you will cause an FPU stack overflow so keep track of the height of your FPU stack at all times. fst iterator reads the 64-bit value from st(0) and stores it at the memory location iterator - the stack remains the same. On the next line write_double gets called for the first time. This is a stupid workaround I had to come up with because the Windows API does not offer any functions which allow you to convert floating point numbers to strings to my knowledge. I tried to use the Microsoft Visual C Runtime but those attempts did not bear any fruit either because using msvcrt in ASM is pretty tricky nowadays. You have to link the application with those silly manifest files and annoying stuff like that so I eventually gave up on making, say, printf work and wrote my own double -> string converter.

The write_double body starts out with fst dword ptr [esp - 12]. This is not any different from the previous fst store call really, it simply does not use an address which is known as compile time. Instead we use unused bytes on the stack to store it temporarily locally. The qword ptr part is necessary because the assembler cannot magically know what format we actually want to store. It could be single precision, double precision or even extended precision - it simply cannot tell. With the prior fst call this was not necessary because the size of that object is defined in the data section. The fld loads values and pushes them on top of the stack. This value can be a location in memory or even an FPU register. In this case we are using fld st(0) which basically just duplicates the top of the stack. fisttp dword ptr [esp - 4] performs a double to int conversion and stores the result at the specified memory location. This function simply truncates the digits to the right of the decimal point away. So we would convert something like 2.913 to 2. In the following loop we perform a conventional integer to decimal ASCII number conversion by repeatedly dividing the integer by 10, adding 0x30 to it, and writing the lower byte of the register to a buffer. This buffer gets first filled from right to left, starting with the decimal point. In the write_right_digits section we read the decimal digits to the right of the decimal point from the number by repeatedly multiplying it by 10, converting it to an integer, converting the integer to a decimal digit which gets added to the buffer and by eventually subtracting the integer from the double again to get rid of the digit to the left of the decimal point. The buffer gets filled from the left to the right, starting with the byte to the right of the decimal point. It starts with fild dword ptr [esp - 4]. This instruction loads an integer from memory, converts it to a double and pushes it onto the FPU stack so we can access it via st(0). fsubp st(1), st(0) performs the calculation st(1) - st(0) and stores the result in std(1). Additionally it pops off the top of the stack so what was once st(1) is now st(0). You can tell this from the -p suffix in the mnemonic. fmul ten multiplies st(0) with the value 10.0 which is stored in memory. The result of the operation gets stored in std(0) again. It is followed by another fld st(0) which duplicates the top of the stack. It is necessary because we are going to use fisttp on the next line. Unlickily this instruction does not feature any non -p version so we are forced to perform that somewhat redundant additional FPU stack push. So what this loop does is simply writing a certain number of digits to the right of the decimal point. The number is specified by the second agument precision. At the end of the function you can see a fstp st(0). This stores st(0) in st(0) and pops the top element of the stack (which is st(0). So it basically works like a pop - with the exception of being somewhat unintuitive and cryptic. At the very end fld qword ptr [esp - 12] is called. This restores the original value st(0) had when the function was first called because we wrote it to that part of the memory in the beginning.

After returning from write_double for the first time the function f is called. This is pretty much a direct translation of the corresponing C/C++ function we defined earlier. fmul st(0), st(0) is basically equivalent to st(0) = st(0) * st(0); in C/C++ pseudo code. fld1 is another "load constant" function, just like ldz. It pushes the value zero on top of the stack. faddp st(1), st(0) adds st(1) to st(0) and stores the reslt in st(1). After this step the top element of the stack gets popped off so the result is in st(0). fsin performs st(0) = sin(st(0)). And last but not least we divide the resulting value by 2.0 using fmul half. This actually multiplies st(0) with 0.5 but it is obviously the same as dividing by zero since \(x / r = x \cdot \frac{1}{r}\). The value of f is returned in st(0) and written into a buffer. Most of the following instructions have already been explained, with the exception of fcomip st(0), st(1). This compares the values of st(0) and st(1) and sets the flags Zero/Parity/Carry flags according to a particular pattern. More importantly, it specifies CF = 1 if and only if st(0) < st(i) - which is exactly what we want. The jnb instruction jumps only if the iterator is still smaller than 1.0. Here is the actual output of the function:


f(0.0000) = 0.420735, first two digits: 42
f(0.1000) = 0.423415, first two digits: 42
f(0.2000) = 0.431202, first two digits: 43
f(0.3000) = 0.443313, first two digits: 44
f(0.4000) = 0.458401, first two digits: 45
f(0.5000) = 0.474492, first two digits: 47
f(0.6000) = 0.488932, first two digits: 48
f(0.7000) = 0.498368, first two digits: 49
f(0.7999) = 0.498803, first two digits: 49
f(0.9000) = 0.485763, first two digits: 48
f(0.9999) = 0.454648, first two digits: 45

As you can see there are a few errors in the iterator such as 0.7999 and 0.9999. You should probably already be familiar with the problem of rounding errors on floating point numbers from imperative high level programming languages. It is simply a result of the system we use. Not every number can be perfectly represented. Most of them have certain small errors and the results of arithmetic operations on them are inaccurate. There are many FPU instructions with extremely specific applications and I am not going to cover them here. Rule of thumb: If you want to know how something works, write it in C/C++, turn on assembly listings and read the code in combination with an x86 Instruction Set Reference.