Source: https://github.com/epicvrvs/x86-tutorial
- 1. Introduction
- 2. What is Assembly?
- 3. What is the Purpose of Learning Assembly?
- 4. Getting to know the x86 Architecture
- 5. Bits, Bytes and Numbers
- 6. Bit Operations
- 7. Signed Integers
- 8. The First Steps
- 9. Writing Windows Applications Using x86 32-bit Assembly
- 10. Writing Linux/BSD Applications Using x86 32-bit Assembly
- 11. Partial Registers
- 12. Addresses
- 13. Control Structures
- 14. Multiplication and Division of Integers
- 15. Floating Point Operations
Introduction
This document is an introduction to creating programs for microprocessors of the x86 architecture family - in particular 32-bit code. The reader is expected to be familiar with programming in C/C++ (or similar languages such as Java at least) and the essential API of the operating system they are using. Some mathematical knowledge up to highschool/university level is essential for understanding a lot of things aswell. I will try not to be too OS specific but the environments I am going to focus on in this document are Windows, Linux and BSD. Pure knowledge of assembly in itself is useless if you do not know how to combine it with the API of the operating system you are going to run it on so I want to make sure that this will be demonstrated to a certain extent. It is not that different from the way you would do it in a high level programming language such as C++ so it is not difficult to understand.
What is Assembly?
Assembly (short ASM) is the lowest level programming possible - if you are a programmer and you want to get as close to the hardware as you can get then this is the place you want to be. In assembly code you get to control every single instruction that your CPU (central processing unit) is going to execute. There are different assembly languages for different microprocessor architectures and all of them are different from each other. Usually they are totally incompatible so you will have to write assembly code that is specific to the particular architecture you want your program to run on. In assembly you write instructions using the ASCII character set which directly represent machine code instructions that are executed by your processor. The names of these instructions are usually extremely short and are often abbreviations of full names. These assembly instruction names are called mnemonics.
Before your microprocessor can actually run the program you have written in assembly you will have to run it through a program which translates all the mnemonics and arguments to numerical machine code. This program is called an assembler. Assemblers often also support more features than just the pure instructions to make the jobs easier for the programmers but you will see how they do that later on.
What is the Purpose of Learning Assembly?
This is a very important question and subject to a lot of discussion. My answer to this question is a list of reasons, really. Better understanding of what goes on at the lowest level can make you a better programmer at a higher level. It allows you to see what goes on behind the scenes and it often gives you a totally new perspective on things. It has its uses in writing high performance parts of high level language programming where you need to use features of your processor that are not easily accessible in that high level language. Knowing assembly is obviously also necessary to be able to write compilers for a particular microprocessor architecture which convert high level language code to machine instructions.
The truth is that most people learn x86 ASM nowadays to crack commercial software, to reverse engineer closed source programs and to write cheats for computer games. Cracking is the one that made me learn it but I have to admit that I never got particularly good at it and I got totally distracted from my original goal in the process of understanding how it works. It is my experience that it is essential to learn how to write x86 ASM yourself first in order to be successful at cracking and reverse engineering. Knowing how to manually translate a C++ program to assembly is a valuable skill to have for this purpose.
Getting to know the x86 Architecture
So, what are we dealing with here? The IA-32 microprocessor is basically a register machine which uses a CISC (complex instruction set computer) instruction set. At first I am going to explain what a register machine is. After that I will move on to the CISC part.
A register machine is a computing device which stores results of arithmetic operations and such primarily in so called registers. These are small but highly efficient binary storage units inside the processor which can hold integer values. When you are doing assembly programming you deal with them all the time. They are incapable of holding a lot of data at once but they are essential as temporary placeholders used in most instructions executed by the processor. In the terms of the memory hierarchy of contemporary computers they are at the very top. The hierarchy looks like this:
- CPU registers
- CPU cache
- RAM (random access memory)
- HDD (hard disk drive storage)
- External storage, even optical media like CDs, DVDs, BluRay disks and such
Registers hold minimal amounts of data and are extremely fast. The cache holds far larger amounts of data but it is still pretty fast. The RAM holds even larger amounts of data and it is far slower (in terms of both bandwidth and latency) than any memory operation inside a CPU. Hard disks have an even larger capacity than your RAM and they are very slow in comparison to the objects at the top of the hierarchy.
At this point I should probably briefly explain what the CPU cache actually is. It is a small high performance storage unit inside your CPU to which chunks of memory from the RAM are copied whenever you perform memory accesses. This way the CPU does not have to access the RAM over and over again when it is processing the same piece of data. This speeds up the execution of code a lot - RAM access is vey slow in comparison to cache access after all. In reality this cache is actually not a single unit but it is divided into multiple cache levels. The Level 1 Cache is the smallest but fastest one. The next level is bigger but slower and so on. Caching is not of much interest to somebody who is new to assembly, though. I might cover this topic later in sections with deal with optimising code for speed.
Let us get back to the CISC part I mentioned earlier. There are two major microprocessor architecture philosophies known as RISC (reduced instruction set computing) and CISC (complex instruction set computing). In RISC architectures instructions are rather simple and perform very fundamental operations. RISC instructions are incapable of performing multiple actions at once but they are very fast. The instruction format for RISC architectures is usually quite uniform and each instruction takes up the same number of bytes in memory. This makes them very easy to decode for the processor. CISC architectures generally feature complex instructions which perform multiple tasks sequentially, like loading a value from memory, peforming an arithmetic operation on it and then writing the result of the operation into memory. In a RISC architecture this would be divided into multiple instructions. CISC instructions are usually of variable length and they are very complicated to decode for the CPU.
Property RISC CISC (x86) Instruction encoding Uniform, all instructions are the same size Variable, the size of instructions strongly varies Complexity of operations Simple, each instruction performs one fundamental operation Complex, each instruction can perform many fundamental operations Code density Low High Decoding effort for the CPU Low High Bits, Bytes and Numbers
A bit is simply the smallest digital piece of information possible. To us, it is a rather abstract unit which can have two values, zero or one. A byte is composed of a certain number of bits. On all mainstream architectures a byte is composed of exactly 8 bits. In this document we focus on the x86 32-bit architecture. The 32-bit part means that the word size of this processor is 32 bits. The word size of a microprocessor tells you what size most registers and operands are. So in this case all general purpose registers are 32 bits = 4 bytes in size. It would be more precise to say "octet" instead of byte in that case, which is the terminology you usually encounter in literature on the subject of networking. In this text I will stick to bytes, though. General purpose means that we, the programmers, can pretty much use these registers as we want without breaking anything. They are meant to be used for all major calculations. There are special registers such as the stack pointer, which may not be used in such a way. In fact, that would usually result in a crash at some point, but we will get back to that later. Our registers consist of 32 single bits which can each assume two different values. This means that each register can assume one of \(2^{32}\) (that is: 2 to the power of 32) different values at a time.
There are different systems of representing numbers using digits (which can actually be arbitrary symbols) such as our decimal system. At this point it is important to introduce the concept of the positional notation. Positional notation is a method which takes a base \(b \in \mathbb{N}\) and a set of digits \(D = \bigcup_{i = 0}^{b - 1} \{d_i\}\) with values \(v : D \to \mathbb{N}\), \(v(d_i) = i\). I hope I did not scare anybody away with those equations - I should probably clarify. \(b\) is a natural number - a positive integer like 1, 2, 3, etc. \(D\) is the union of the digits which are associated with the numerical values 0 to \(b - 1\). In a positional system the number which is composed of the digits \(d_{(a_n)} \ldots d_{(a_1)}\) actually represents the natural number \(\sum_{i = 0}^{b - 1} v(d_{(a_i)}) b^i = \sum_{i = 0}^{b - 1} a_i b^i\).
The decimal system is actually a positional system with \(b = 10\). So we have the digits \(\) with the corresponding values. According to the equation the decimal number 913 represents the natural number \(9 \cdot 100 + 1 \cdot 10 + 3 \cdot 1 = 913\).
Now that was fairly obvious but let us move on to other systems such as the hexadecimal one, which is quite frequently used in the world of programming, especially in assembly languages. In the hexadecimal system we deal with the base \(b = 16\) and the digits \(D_{16} = \{0 \ldots 9, A, B, C, D, E, F\}\). So A stands for 10, B for 11, and so on. So the hexadecimal number F4D7 actually represents the natural number \(15 \cdot 16^3 + 4 \cdot 16^2 + 13 \cdot 16^1 + 7 \cdot 16^0 = 15 \cdot 4096 + 4 \cdot 256 + 13 \cdot 16 + 7 \cdot 1 = 61440 + 1024 + 208 + 7 = 62679\).
Then there is of course the binary system which has the digits \(D_2 = \{0, 1\}\). The binary number 1101001 represents the natural number \(1 \cdot 2^6 + 1 \cdot 2^5 + 0 \cdot 2^4 + 1 \cdot 2^3 + 0 \cdot 2^2 + 0 \cdot 2^1 + 1 \cdot 2^0 = 1 \cdot 64 + 1 \cdot 32 + 0 \cdot 16 + 1 \cdot 8 + 0 \cdot 4 + 0 \cdot 2 + 1 \cdot 1 = 105\). This is pretty much exactly the way unsigned numbers are stored at the hardware level in the registers of your x86 microprocessor. The single binary digits represent the bits in the register. But the binary number 1101001 has 7 bits and our 32-bit registers have 32 bits. What happens to the other 25 bits? The upper bits are simply set to 0, so the register would actually be set to 00000000000000000000000001101001.
Bit Operations
In the introduction I stated that this document is intended for people who already know a high level programming language which also implies knowledge about the six essential binary operations but I would still like to introduce them at this point. I will use the C operators
&
forAND
,|
forOR
,~
forNOT
,^
forXOR
,<<
forSHIFT LEFT
and>>
forSHIFT RIGHT
in this document. Let's start withAND
. It takes two bits as input and outputs a single bit. If you do not fully grasp these operations yet I suggest taking a few moments to totally digest and appreciate their output. So what is theAND
operation usually used for? We use it to set particular bits of an integer to 0, to "mask off" particular ranges of bits we do not care about.
a b a & b
0 0 0 0 1 0 1 0 0 1 1 1 Now for the
OR
operator. LikeAND
it takes two bits as input and has a single bit as output. We use it to set particular bits of an integer to 1:
a b a | b
0 0 0 0 1 1 1 0 1 1 1 1 The
NOT
operator inverts bits so it takes one argument instead of two like the previous operators:
a ~a
0 1 1 0 The
XOR
operator is used to "flip" bits. It is not an essential function, meaning that it can be expressed with the help ofAND
,OR
andNOT
. Its primary purpose curiously is setting registers to zero but we will get to that later. It is also very useful for pseudo random number and encryption algorithms.
a b a ^ b
0 0 0 0 1 1 1 0 1 1 1 0 The bit shift operations
SHIFT LEFT
andSHIFT RIGHT
cannot be described in compact tables like the other ones. They shift the bits inside the register to the left or to the right, discarding information as bits "are pushed out" and shifting new 0-bits in. The first operand for a shift operation is the number that will get shifted. The second operator describes the number of bits that are shifted in and out. Here are a few examples on 32-bit numbers to clarify what these operations mean:
a b a << b
11010001010010100001101010001010
1 10100010100101000011010100010100 00000000100000001100000001011111
3 10001010010100001101010001010000 11100000010100110001010010111010
2 01000101001010000110101000101000 00001001101110000000001110000000
29 00000000000000000000000000000000
a b a >> b
11010001010010100001101010001010
1 01101000101001010000110101000101 00000000100000001100000001011111
3 00000000000100000001100000001011 11100000010100110001010010111010
2 00010001010010100001101010001010 00001001101110000000001110000000
27 00000000000000000000000000000001 Signed Integers
We have seen how unsigned integers are encoded in the 32-bit registers of the microprocessor but surely the x86 architecture has to deal with signed integers aswell. There are different systems of representing signed integers using binary encodings. The most intuitive approach is to sacrifice one bit (the "left-most" one) to represent the sign of the integer. This system is called sign and magnitude and the bit we sacrificed is called the sign bit. A sign bit of 0 stands for the lack of the minus and the sign bit of 1 means that the number is signed. The absolute value is expressed in all the other bits. In an 8-bit register the binary number 10011100 represents the integer \(- (16 + 8 + 4) = - 28\). One irritating property of this system is that there are two representations for the number 0:
00000000
,10000000
- a "positive" zero and a "negative" zero - which is obviously a waste. The actual major problem with this system is that it is very annoying to perform addition and subtraction in it because there are three different cases to be dealt with (positive + positve, positive + negative, negative + negative).Another possibility is to represent the negative version of the natural number \(n \in \mathbb{N}\), \(-n\), by inverting all its bits (see
NOT
operation). Hence this system is called One's Complement. This way also one bit is sacrificed to represent the sign and there are two representations of the number zero again. So the negative version of 00001111 (which is 15) is 11110000 (which is -15) in our hypothetical 8 bit register. The advantage of this system is that it makes addition/subtraction easier. Consider the operation -7 + 3: 11111000 + 00000011 = 11111011, which is the One's Complement representation of -4. As you can see we can reduce the complexity of arithmetic operations this way as long as we don't cross the zero barrier. Consider the case -1 + 2: 11111110 + 00000010 = 00000000. This is obviously the wrong result, - 1 + 2 is not zero, it's 1! Every time you cross the "zero barrier" in One's Complement you have to fix the result by offsets of 1.So to solve this problem smart computer scientists came up with Two's Complement. It is similar to One's Complement but all negative numbers have an additional offset of 1. This way there is no redundant zero and you can flawlessly perform arithmetic operations without having to worry about the sign really. Two's Complement is the system used by all important mainstream CPUs in desktop computers including the x86 architecture - so be sure to understand this. There is no inherent difference between storing an unsigned integer and a signed integer in a 32-bit register - it is all about how you treat it. The 32-bit value 11111111111111111111111111111111 can either stand for \(2^{32} - 1 = 4294967295\) (which is the greatest unsigned integer that can be represented with 32 bits) or, if interpreted as negative number, for -1. To calculate the Two's Complement representation of \(-n\) with \(n \in \mathbb{N}\) you need to:
- Subtract 1
- Invert bits
All of these operations are injective so you can simply reverse the order and invert the type of the operations in these two steps in order to convert a negative number to its positive representation. Let's have a look at the representations of a few integers in Two's Complement encoding:
Signed decimal number Two's Complement encoding 23 00000000000000000000000000010111
-1 11111111111111111111111111111111
-2 11111111111111111111111111111110
-3 11111111111111111111111111111101
-2147483648 10000000000000000000000000000000
2147483647 01111111111111111111111111111111
As you can see the greatest signed number that can be represented with Two's Complement is 2147483647 and the lowest one is -2147483648.
The First Steps
Now that we have covered most of the basics we can finally get started with some x86 ASM. I am going to use Intel syntax for now which is far more popular and more intuitive than AT&T syntax. Let's have a look at the following code:
Assembly
mov eax, 1 mov ebx, 10 xor ecx, ecx our_loop : add ecx, eax inc eax cmp eax, ebx jle our_loop
Most of these lines represent an instruction that gets executed by the processor. The pattern is
instruction [argument 1], [argument 2], ...
. Let's go through it line by line.mov eax, 1
calls the instructionmov
with the argumentseax
and1
. Themov destination, source
operator is kind of like the assignment operator in C++. It copies the value ofsource
todestination
. In this case it writes the constant 1 to the 32-bit general purpose registereax
. That is one of the registers we can use for all kinds of calculations without breaking anything. Here is a list of all the 32-bit general purpose registers of the x86 architecture:
eax
ebx
ecx
edx
edi
esi
ebp
As you can see they have different naming patterns and historically some of them are used by special instructions which hardly appear in modern code, though. They are mostly still supported for legacy reasons but they are inefficient and are not produced by modern compilers. After that instruction has been processed the instruction pointer gets increased. The instruction pointers contains of the current address that is being executed. The instruction pointer of the x86 architecture is stored in the 32-bit register
eip
. Soeip
gets increased by the length of themov eax, 1
instruction after it has been executed andeip
points towards the next instruction which ismov ebx, 10
. As you might have guessed,mov ebx, 10
writes the value 10 to the registerebx
. But now it gets interesting.xor ecx, ecx
is basically the x86 ASM equivalent of the C++ codeecx ^= ecx;
. You might ask: What is the point? The educated mind sees immediately thata ^ a
is equal to zero for all values ofa
. So what this basically is does is settingecx
to zero. So you might wonder why you do not usemov ecx, 0
instead. Well, you could, butxor ecx, ecx
is actually shorter because it does not involve any constant. By always usingxor reg, reg
instead we reduce the overall file size of our binaries and this can save time and space. But this instruction does more than just modify theecx
register. The x86 microprocessor has a set of so called flags. These are 1 bit values which are used for conditional jumps and other operations. We do not think about them most of the time and it is not too relevant for writing conventional code but it is absolutely essential to know how they exist, what they are good for and how they are set. All arithmetic-logical operations set some of these flags. Themov
instruction does not modify the flags but thexor
instruction sets several of them. The Intel x86 Instruction Set Reference says: "TheOF
andCF
flags are cleared; theSF
,ZF
, andPF
flags are set according to the result. The state of theAF
flag is undefined." You can look up such data up in the original PDF document released by Intel. Here is a list of the most important x86 flags and what they stand for:
Abbreviation Full name CF
Carry Flag PF
Parity Flag ZF
Zero Flag SF
Sign Flag OF
Overflow Flag When you perform the addition of two large unsigned 32-bit numbers the result can obviously be too large to be stored in a 32-bit register so what you get is a so called wraparound or an overflow. Consider the addition of the following numbers:
Description Binary value Summand 1 11111111111111111111111111111111
Summand 2 + 00000000000000000000000000000001
Sum 100000000000000000000000000000000 The result requires at least 33 bits to be displayed properly using Two's Complement. So the bit that has been marked red is basically discarded and not stored in the target register - instead it gets stored in the Carry Flag. The Carry Flag is basically always the "33rd" bit. Actually This is not a 1 bit value but it is a single bit at offset 0 in a 32 bit register known as the FLAGS register. This is where all the flags are stored - so obviously there are far more flags than just the ones listed above but most of them are not of any relevance for writing user land applications. The Parity Flag is set to 1 if the number of 1-bits in the right most 8 bits of the result contains an even number of 1-bits. It is set to 0 otherwise. It is basically the result of
XOR
ing those 8 bits with each other and eventually inverting the result. The Zero Flag is set when the the result is equal to zero - very useful. The Sign Flag is the left most bit of he 32-bit result - it represents its sign. The Overflow Flag is set when an overflow occured.But let us get back to the actual code we were discussing. So none of the flags that are set by most arithmetic logical operations actually get used and are simply overwritten immediately. The next line in the code is
our_loop:
is not an actual instruction that gets executed but an instruction for your assembler which allows you to define labels. These are markings in the code which are essential to control the flow of execution by providing named locations to which you can jump to.add ecx, eax
is the x86 ASM equivalent ofecx += eax;
and it sets the flags again but they are unused - again.inc eax
increments the value ofeax
by 1 - it is the equivalent ofeax++;
. We could have usedadd eax, 1
instead and it is pretty much the same but theinc
instruction is smaller so it is preferable. It does not affect the carry flag but it does not make any difference in this case.cmp eax, ebx
is short for "compare" and it performs the subtractioneax - ebx
but the result is not stored anywhere. Instead it simply sets the flags we just talked about according to the result. This is an essential way to check whether two registers have equal values, for example. In case they are equal the result of the subtraction is zero so the Zero Flag gets set and we can act accordingly. Acmp
is the first part of the x86 ASM equivalent of a C++if
construct, so to say. The next part is one of the conditional jumps - in this case it isjle our_loop
.jle
is short for "jump if less or equal". So, what is a jump? It is a method of controlling the flow of execution of your program by basically telling your microprocessor where to resume the execution after the jump. In the case of the x86 32-bit architecture it means modifying the value ofeip
. When jumps actually get encoded they just contain numeric offsets which are either relative to the position of the jump ("jump 25 backward") or even absolute ("go to the instruction at offset 0x12345678"). There are unconditional jumps which are executed no matter what and conditional jumps which only perform jumps under certain conditions. If a conditional jump is not taken the execution simply continues with the next instruction after the jump as usual.our_loop
refers to the label we previously defined so if the jump gets executed the next instruction will beadd ecx, eax
. So in this case we are dealing with a jump which is executed if the flags indicate that the result is "less or equal". What is this supposed to mean?! Less than or equal to what? Thecmp
operation is usually always used in combination with a conditional jump. Together they are the x86 ASM equivalent of the C++if
construct. So what the "less or equal" part relates to is thecmp
executed prior to the conditional jump. Readcmp eax, ebx
in combination withjle our_loop
asif(eax <= ebx) goto our_loop;
- resume execution at labelour_loop
ifeax
is less than or equal toebx
. There are actually 32 different mnemonics for such jumps:
Mnemonic Meaning ja
Jump if above (CF=0 and ZF=0). jae
Jump if above or equal (CF=0). jb
Jump if below (CF=1). jbe
Jump if below or equal (CF=1 or ZF=1). jc
Jump if carry (CF=1). je
Jump if equal (ZF=1). jz
Jump if 0 (ZF=1). jg
Jump if greater (ZF=0 and SF=OF). jge
Jump if greater or equal (SF=OF). jl
Jump if less (SF!=OF). jle
Jump if less or equal (ZF=1 or SF!=OF). jne
Jump if not above (CF=1 or ZF=1). jna
Jump if not above (CF=1 or ZF=1). jnae
Jump if not above or equal (CF=1). jnb
Jump if not below (CF=0). jnbe
Jump if not below or equal (CF=0 and ZF=0). jnc
Jump if not carry (CF=0). jne
Jump if not equal (ZF=0). jng
Jump if not greater (ZF=1 or SF!=OF). jnge
Jump if not greater or equal (SF!=OF). jnl
Jump if not less (SF=OF). jnle
Jump if not less or equal (ZF=0 and SF=OF). jno
Jump if not overflow (OF=0). jnp
Jump if not parity (PF=0). jns
Jump if not sign (SF=0). jnz
Jump if not zero (ZF=0). jo
Jump if overflow (OF=1). jp
Jump if parity (PF=1). jpe
Jump if parity even (PF=1). jpo
Jump if parity odd (PF=0). js
Jump if sign (SF=1). jz
Jump if 0 (ZF=1). As you can see several of these mnemonics basically do the same thing and they are actually encoded the same way in the actual executable files - they just exist to make the jobs for the programmers easier. So let's translate this assembly code to C++ so we can see what it does:
C++
int eax, ebx, ecx; //... //mov eax, 1 eax =1 ; //mov ebx, 10 ebx =10 ; //xor ecx, ecx ecx =0 ; our_loop : //add ecx, eax ecx += eax; //inc eax eax++; //cmp eax, ebx //jle our_loop if (eax <= ebx) goto our_loop;
As you can see there is some kind of iteration going on.
eax
is the iterator which is involved in the decision when to leave the loop,ebx
contains the limit for the iterator andecx += eax
is the actual body of the loop. So we can further simplify the program to:C++
int sum = 0 ; for (int i = 1; i <=10; i++) sum += i;
Nice, now that is actually really short and easy to comprehend C/C++ code. So it actually calculates the sum 1 + 2 + ... + 9 + 10 = 55. Currently the upper limit is 10 - what do we do if want to set this limit to an arbitrary number? In a high level language we would write a new function for this purpose which takes the limit for the loop as its sole argument and which returns an integer, something along the lines of:
C++
int calculate_sum(int limit) { int sum = 0 ; for (int i = 1; i <= limit; i++) sum += i; return sum; }
So, how do we translate this back to x86 ASM? There is a direct equivalent for functions, too. Let us check out what it looks like when we translate this function to assembly language and call the function with a larger limit 20:
Assembly
calculate_sum : mov eax ,1 mov ebx , [esp +4 ] xor ecx ,ecx sum_loop : add ecx ,eax inc eax cmp eax ,ebx jle sum_loop mov eax ,ecx ret 4 start : push 20 call calculate_sum
The first change you can see is the
calculate_sum
label that has been added. It is necessary so we have marker in the code which we can use to call the function. The next change is themov ebx, [esp + 4]
part. The brackets mean that we perform a memory access (RAM). It is the x86 ASM equivalent of the C/C++*
operator to dereference pointers. So in this case the pointer isesp + 4
. The x86 architecture supports statements of the formregister + constant_offset
for this purpose. So in this case the addressesp + 4
is calculated and dereferenced. But how many bytes are read and stored inebx
? The register is 32-bits in size so 4 bytes are read, beginning at the addressesp + 4
, so the bytesesp + 4
,esp + 5
,esp + 6
,esp + 7
are stored inebx
- but in what order? Which byte goes where? The x86 architecture uses the so called Little Endian byte order in memory. In this order the least significant byte is stored first in memory:
Binary value 10010011000100001100000001110010 In this case the red bit is the most significant bit and the green bit is the least significant bit. Significance means how much of a difference the bit can make to the value of the integer. The most significant bit has a value of \(2^{31}\) whereas the least significant bit only makes a difference of \(2^0 = 1\). We do not have to worry about as long as we stick to writing values of size \(n \in \mathbb{N}\) to memory and reading \(n\) bytes later. When we read bytes from memory with an instruction like
mov ebx, [esi]
the x86 microprocessor reads 4 bytes in Little Endian order, of course. The only time we have to worry about it is we have an integer of \(n > 1\) bytes in memory and we want to access \(m\) particular bytes of it with \(m \in \mathbb{N} \land 1 \le m < n\). In that case we have to know what Endianess the underlying system uses so we read the right bytes. The opposite of Little Endian byte order isBig Endian
. In Big Endian byte order the most significant byte is stored first - just like we do in normal hexadecimal and decimal notation. We write down the number 931 to express \(9 \cdot 100 + 3 \cdot 10 + 1 \cdot 1\). If we applied the Little Endian philosophy to this we would write 139 instead of 931, so to say. Let us have a look at a few examples of how bytes are stored in memory with different "Endianesses":
Hexadecimal notation Consecutive bytes in memory with Little Endian byte order Consecutive bytes in memory with Big Endian byte order 0xFF
0xFF
0xFF
0x1234
0x34, 0x12
0x12, 0x34
0x09451143
0x43, 0x11, 0x45, 0x09
0x09, 0x45, 0x11, 0x43
So for single bytes it does not make any difference but as soon as you deal with integers which require two or more bytes the Endianess becomes an issue. Big Endian byte order is much more intuitive as you can see, since it matches the way we usually write numbers so it is easier for humans to read when looking at hex dumps and stuff like that. With Little Endian byte order you would have to reverse it first in order to parse it in the way you usually do. Unluckily the x86 architecture uses the Little Endian byte order so there is not much we can do about it. Even the internet is based on the Big Endian byte order - all the official protocols such as IP, TCP and UDP use it. This is why it is also called Network Byte Order. When you send multi byte integer values in the headers of the official protocols using a Little Endian byte order all the bytes must be reversed in order first in order to satisfy the specifications - what a bummer! Now that we have cleared this up we should get back to the statement we were originally discussing,
mov ebx, [esp + 4]
. So there is a new register in that statement:esp
. Its name is short for extended stack pointer which is actually a 32-bit "general purpose register" just like the other 7 ones that we listed above. But messing with it can have disastrous results and there is not really a point to use it for any other purpose in normal code, really. The stack is a part of the memory your application uses. The C++ analogy to this mechanism isstd::stack
(which I hope you are familiar with) and in Java it would bejava.util.Stack
. A stack is one of the basic container/data structure types which (in theory) only allows you to access the element at the top of the stack by "popping" it off and it allows you to add new elements to the stack by "pushing" them on the top. In terms of microprocessors it is a bit different because you usually access several elements inside the stack which are not at the top because it is more efficient. Let us have a look at a few examples of operations on an abstract stack of integers. This is what our stack looks like at the beginning: (do not be confused by the "Bottom of the stack" part, the bottom does not serve a special purpose and it is just marked to clarify things)
Description Value Top of the stack 4 17 Bottom of the stack 2 We are going to execute a series of operations on this stack now to observe the changes. First we will push an 8 onto the top. The C++ pseudo code equivalent for this action shall be:
push(8);
Description Value Top of the stack 8 4 17 Bottom of the stack 2 As you can see the 4 is no longer the top of the stack and we value we just pushed onto the stack is now at the top. Next pseudocode to execute:
int a = pop(); int b = pop();
Description Value Top of the stack 4 17 Bottom of the stack 2
Description Value Top of the stack 17 Bottom of the stack 2 The pseudo function
pop()
pops one value off the stack and returns its value. So the two variables we used in the pseudocode we just executed now have the valuesa == 8
andb == 4
. So what is the point of this whole stack? What does it get used for in the x86 architecture and other microprocessor architectures? It is an essential data structure which is used to:
- Pass arguments to functions
- Store return addresses
- Allocate local variables which are too large for the registers
We will get back to what a return address is and how we allocate space for local variables later. At first I would like to discuss the necessity of the stack to pass arguments to functions. You might wonder: "Why do you not simply pass the arguments inside registers?" There actually is a special type of calling convention for functions which does use this mechanism and it is known as
fastcall
- because it is extremely fast since it uses registers only for the arguments. This, too, is not yet the point to discuss calling conventions, though. The problem with having all your arguments in registers is that you have a very limited amount of registers but a large amount of functions which call each other in complicated nested ways so you will eventually run out of registers to use in one way or the other. Recursive function calls of arbitrary call depth without storing the arguments somewhere in memory is physically not possible. This is why we pass most arguments on the stack. We push the arguments on the "top", call our function and remove them after that. The x86 architecture stores the current pointer to the top of the stack, the stack pointer, in theesp
register. Somov eax, [esp]
performs a memory access and copies the value of the top element of the stack into theeax
register. We are dealing with a 32-bit architecture so all elements on the stack are of the same size which is 4 bytes. This is whyesp
will usually have a value which is a multiple of 4. You might expect thatmov eax, [esp - 4]
might access the element right "below" the top one - but that is not so. On the x86 architecture the next element below the top is actually stored atesp + 4
so we have to usemov eax, [esp + 4]
to copy its value toeax
. The third element would be atesp + 8
, the fourth atesp + 12
and so on. I am not entirely sure why they picked this somewhat counter-intuitive order for the stack - probably because they thought that addition is more comfortable to deal with to access arguments. The x86 instruction to push a new argument on top of the stack ispush
. It takes one argument which can be a register or an immediate value, for example. An immediate value is any constant value like2
or0x00ff1234
which is directly encoded into the instruction. So whatpush eax
does is:
sub esp, 4
(decrease the stack pointer by the word size of the microprocessor to make space for the new top element)mov [esp], eax
(write the value ofeax
to the memory at the address contained inesp
- the top of the stack is now equal toeax
)You can basically replace
push
with those two instructions all the time but x86 is a CISC architecture so we save space by usingpush
instead.sub
is short for subtract and as you might have guessed already, it performs a subtraction likeadd
performs an addition. The arguments are intuitive,sub register1, register2
is equivalent to the C++ coderegister1 -= register;
orregister1 = register1 - register2;
. The x86 instruction to pop an element off the stack is calledpop
. Likepush
, it takes one argument, too. Usually it is a register so let us check out whatpop ebx
does:
mov ebx, [esp]
(read 4 bytes from the memory at the address specified by the registeresp
and store them inebx
)add esp, 4
(add 4 to the stack pointer so it makes the element "below" the top one the new top element of the stack)So, to get back to the piece of x86 ASM we were previously discussing: What
mov ebx, [esp + 4]
really does is copying the first argument passed to the function into theebx
register. The first argument is the second element of the stack, the second argument is the third and so on. So what does the top of the stack contain? What is stored at[esp]
? This is the so called return address which is necessary for a function to know where it was originally called from soeip
, the instruction pointer, can be set to the right value after the function returns. We want our program to resume execution with the instruction after our call, after all.The
start
label is not really relevant for the understanding of the principle but I simply wanted to point out where the execution actually started in case somebody got confused by the stream of x86 ASM instructions. The assembly codepush 20
,call calculate_sum
first pushes the immediate value 20 on top of the stack and then calls the function. Remember that we only see this label in our assembly language code. It does not exist is such in the actual binary file. In the executable file you will simply see the call with an address after it. This address points to the first instruction of the function we marked with the labelcalculate_sum
. In this case it is the instructionmov eax, 1
. This is whatcall address
actually does:
esp -= 4;
(make space for the return address on the stack)*esp = address_of_next_instruction;
(stores the address of the instruction after the call on the stack)eip = address;
(modify the instruction pointer so it resumes execution in the body of the function)As you can see I did not use x86 ASM instructions to describe what this function does. Instead, I resorted to the C++ pseudo code I use occasionally. This is because we cannot modify
eip
like a normal register. It takes a jump or a call (which is similar to a jump) to achieve that. What the combination ofpush 20
andcall calculate_sum
really translates to in C++ iscalculate_sum(20);
. The functioncalculate_sum
is called with the argument 20 - which is what we originally wanted. You will see the push/call construct quite frequently. usually functions have even more than one argument so you will see multiple pushes prior to a call. Perhaps you have noticed that another element has been added to the function. Theret
instruction is used to return from a function after it has been called. Additionally it can take an argument which in the case ofret 4
is equal to 4. Let us have a look at what thisret argument
instruction does:
eip = *esp
(copy the instruction pointer from the stack back to the instruction pointer to resume execution at the instruction after the call)esp += 4 + argument;
(remove the return address from the stack and, if necessary, arguments that were pushed onto the stack for the function)If a function takes \(n \in \mathbb{N}\) arguments you will usually return from it with the instruction \(\mbox{ ret } 4 \cdot n\). If the function is nullary (if it takes no arguments), you can simply use
ret
instead ofret 0
. You also might have noticed the additionalmov eax, ecx
I inserted prior to theret 4
. I did this because it is a common calling convention to return the result of a function in theeax
register which is the first of the 8 (or rather 7 withoutesp
) general purpose registers. Of course you do not have to do this but if you interface with C/C++ code you will have to know this of course. Now that we have covered the basics of calling and returning we can move on to the topic of calling conventions. You might remember me talking about thefastcall
convention - it is one of them. If you are an experienced C/C++ programmer you should already be familiar with a few of them. They differ in the following regards:
- How do arguments get passed? On the stack? In registers?
- In what order are arguments passed?
- Who is responsible for fixing the stack after the function has processed the arguments?
- What registers must be preserved by the function?
Let us take a look at the most common calling conventions:
Calling convention Where are the arguments stored? In what order are the arguments passed Who is responsible for cleaning up the stack? Registers that have to be preserved cdecl (C declaration) Stack Right to left Calling function ebx edi esi ebp
stdcall (standard call) Stack Right to left Called function ebx edi esi ebp
fastcall (fast call) Registers and stack 1. ecx
2.edx
3. and further: stack from right to leftCalled function ebx edi esi ebp
You will find the
cdecl
calling convention in pretty much all C based interfaces, especially obviously in the world of UNIX-like operating systems. Windows uses thestdcall
convention for most of its API. Right to left means that the rightmost argument in conventional notation is pushed first in the assembly code. The order does obviously not make any difference if there is only a single argument as in the code we were discussing. If the calling function is responsible for fixing the stack - that is, getting rid of the arguments below the return address which were passed to the function - it simply does what our code does: it uses theret
instruction with the right argument to clean up the stack. If the calling function is responsible for cleaning up the stack, the function in question usually simply callsret
and leaves all the work for whoever called him. Let us clarify by example:C++
void __cdecl test1(int a,int b,int c) { } void __stdcall test2(int a,int b,int c) { } test1(1 ,2 ,3 ); test2(1 ,2 ,3 );
This C++ code would look something like this in x86 ASM:
Assembly
test1 : ret test2 : ret 12 start : push 3 push 2 push 1 call test1 add esp ,12 push 3 push 2 push 1 call test2
So the differences between these two function calls are:
Calling convention Description cdecl (C declaration) Uses ret
to return and forces the calling function toadd esp, 12
to fix the stack.stdcall (standard call) Uses ret 12
to return so the calling function does not have to fix the stack.
stdcall
is more efficient thancdecl
because it requires one instruction less to execute. Do not be fooled by the number of instructions - there are tons of examples in which code which uses more instructions to solve the same problem is actually faster but in this case it happens to be intuitively right because all these simple instructions get executed at about the same speed. However, it is not yet the time to talk optimisations in general so let us return to the matter at hand. Now that we have covered some of the fundamentals of x86 assembly programming we can finally start writing simple programs that actually interact with the API of the underlying OS. We are going to implement the same basic command line program on different operating systems using various assembler programs in the next section of this document. This program will simply be an extension of what we have already written.Writing Windows Applications Using x86 32-bit Assembly
At first we are going to take a look at how to write assembly for the popular Windows operating system. We are going to check out the Microsoft (R) Macro Assembler Version 9.00.21022.08 (short MASM), the great open source program flat assembler 1.67.27 for Windows (short FASM) and the netwide assembler 2.04 RC 1 (short NASM).
MASM
This is the most used assembler for writing assembly code for the Microsoft Windows operating systems. Before I am going to just throw the code right into your face I would like to dedicate a few words to what is known as MASM32 - a great source of confusion and frustration.
MASM32 (from movsd.com/masm32.com) does not have anything to do with Microsoft or MASM as such. It is a package mostly made by an Australian known as Steve Hutchesson (aka hutch). This package contains lots of headers and example code and it uses the Microsoft ml/link binaries version 6.x from 1998 or so. It is absolutely useless for any modern x86 32-bit Windows assembly programming. Never make the mistake of mixing up MASM and MASM32. I see a lot of people making this mistake. I have even seen people give up on MASM because they thought that MASM32 was it. It is really sickening. As for the actual package itself, the example code may be useful for new ASM programmers but you should stay away from the MASM32 "library" and the outdated headers which give you parsing errors in new versions of MASM.
Let us get back to the actual code. We will go through it line by line:
Assembly
;Arithmetic sum program .686p .model flat ,stdcall option epilogue :none option prologue :none GetStdHandle proto :dword wsprintfA protoc :vararg WriteConsoleA proto :dword , :dword , :dword , :dword , :dword ExitProcess proto :dword STD_OUTPUT_HANDLE equ -11 .data sum_string db "Arithmetic sum for the limit %d is %d" ,10 ,0 .data? buffer db 64dup (? ) bytes_written dd ? .code calculate_sum proc limit :dword xor esi ,esi mov edi ,1 jmp loop_start sum_loop : add esi ,edi inc edi loop_start : cmp edi , [esp +4 ] jle sum_loop ret calculate_sum endp public main main proc invoke GetStdHandle ,STD_OUTPUT_HANDLE mov ebp ,eax xor ebx ,ebx main_loop : invoke calculate_sum ,ebx invoke wsprintfA ,addr buffer ,addr sum_string ,ebx ,esi invoke WriteConsoleA ,ebp ,addr buffer ,eax ,addr bytes_written ,0 inc ebx cmp ebx ,20 jle main_loop invoke ExitProcess ,0 main endp end
;Arithmetic sum program
is a comment. In MASM comments are made using the semicolon character. There are only single line comments so it is pretty much the equivalent of the C/C++//
. Unluckily there is no multi line comment feature..686p
specifies what instructions are allowed to be used in the assembly code. There are numerous other ones but this is a nobrainer really..686p
is simply the most "modern" one..model flat, stdcall
specifies two things. For one it specifies the memory model to be used. Theflat
memory model is the only one supported by modern Windows operating systems so it is hardly worth discussing. Thestdcall
part specifies the default calling convention. Since we are dealing with Windows API code we definitely should usestdcall
.option epilogue: none
andoption prologue: none
tells the assembler that we do not want to use the default MASM epilogues/prologues. These are instructions that are automatically appended at the front of your function and whereever you callret
. It is a nice feature to support but it is very annoying that it actually puts some instructions there you do not see in your code - instructions you never expected to be there until you run your binary in a debugger to find out that some strange feature is totally breaking your assembly! In the case of MASM the default code is establishing a so called frame stack. This is mostly a nasty relic of the past which is still used by unenlightened copycats who never bother to actually check whether such features actually make sense or not. I am just going to demonstrate this dumb practice to you once:Assembly
;non-sensical function some_function proc xor eax ,eax jmp return ret return : mov eax ,1 ret
This function does not perform any meaningful task, it is merely an example. With the default epilogue/prologue code this will assemble to:
Assembly
push ebp mov ebp ,esp xor eax ,eax jmp some_constant;I do not know the immediate value that will be used here so I just made up some name. pop ebp ret mov eax ,1 pop ebp ret
So as you can see every
proc
(a fancy way of declaring a function, short for "procedure") will generate an additionalpush ebp
,mov ebp, esp
and an additionalpop ebp
prior to every singleret
. So at first the value ofebp
is supposed to get preserved by being copied to the stack. Thenesp
is copied toebp
. In the end the original value ofebp
gets restored bypop ebp
. So, what is the point of all this? When you manipulateesp
in your function body you will have to watch what offset your original arguments which were passed to your function are at. The idea of the stack frame is to preserve the original value so you do not have to do any thinking to come up with the offset. The truth is that stack frames are useless unless you are performing operations which do not let you predict the argument offset at assembly time. It is a waste of instructions. All you usually achieve is being able to write stuff likemov eax, [ebp + 4]
instead ofmov eax, [esp + 4]
which is really idiotic in general. So do not use this feature. Now that we have cleared this up, let us get back to the code at hand.
GetStdHandle proto :dword
declares the prototype (proto
) of a function calledGetStdHandle
which uses the default calling convention we specified. We specifiedstdcall
in the.model
directive. We could even have usedGetStdHandle proto stdcall :dword
to be even more explicit about it but it is unnecessary in this case since we already specified a default calling convention. If you do not know what a particular function does you should definitely look it up on MSDN. You might even know some of these Windows API calls but you are probably surprised by these -A prefixes some of these have. The Windows API uses the -A prefix for "ASCII" and the -W prefix for "Wide chars" - Unicode. They are redirected according to the preprocessor definitions in the C interface. The -A variants usually simply call the -W ones after a few string conversions. Windows handles most strings internally as Unicode. It is more convenient to use ASCII in this case, though, so we use the -A interfaces. On the next few lines a few more prototypes are declared and the majority of them are just like this one but the thewsprintfA
one stands out because of thevararg
part. This nasty function takes a variable amount of arguments and it is up to us to fix them later. It uses the C calling convention, hence thec
part. These prototypes basically work like the function declarations in C/C++. We definitely need to use them for external calls so we need to declare them as prototypes using theproto
directive first before we can use them.STD_OUTPUT_HANDLE equ -11
is a directive which will replace all future occurences of the nameSTD_OUTPUT_HANDLE
with the constant -11. We need this one forGetStdHandle
. The.data
directive declares the start of the data section. This is where we store static data/global variables, so to say.sum_string db "Arithmetic sum for the limit %d is %d", 10, 0
declares a zero terminated string "Arithmetic sum for the limit %d is %d\n&ququot; by the name offormat_string
.db
is short for "declare byte(s)", I believe. Consecutive bytes can simply be declared using the comma notation you can see in those lines. The.data?
directive introduces the uninitialised data section. We store static variables which have no predefined value in this part of the program.buffer db 64 dup (?)
declares a range of 64 bytes which have no defined value as specified in the parenthesis (?
).dup
is short for "duplicate", or so, I believe.bytes_written dd ?
uses thedd
directive which is short for "declare DWORD" which has no predefined value. At this point I should probably explain the Microsoft/Intel terms for different byte counts:
Term Full name Size in bytes BYTE
Byte 1 WORD
Word 2 DWORD
Double word 4 QWORD
Quad word 8 Actually, "word" usually refers to whatever the normal size of the general purpose registers of a microprocessor may be which in the case of the x86 32-bit architecture happens to be 4 bytes. Intel and Microsoft somewhat hijacked these terms so you have to be careful not to mix this stuff up.
.code
represents the begining of the code section where all the instructions are stored.calculate_sum proc limit:dword
declares the beginning of a function (procedure) by the name ofcalculate_sum
which takes one argument which happens to go by the name oflimit
which is for documentative purpose mostly, though. MASM does support statements likemov eax, limit
but in this case it would translate tomov eax, [ebp + 4]
because this inferior software simply assumes that you use the stupid frame stack thing I previously explained. So just stick to accessing the arguments usingesp
directly - it is much more comfortable and efficient on the long run. The ret of the function is actually pretty much the same - except for theret
. As you can see there no longer is any argument on the return instruction. This is managed by MASM automatically since we specified the number of arguments in theproc
line.calculate_sum endp
declares the end of the procedure. Requiring the programmer to add the name of the function there is really redundant and I have no idea why they coded it this way but we do not have much of a choice if we want to use theproc
feature in MASM.end
is short for "end of procedure" I suppose.public main
makes the label/functionmain
a public symbol so we can later specify it as the entry point of our program in the linker command line.main proc
declares the beginning of the main procedure. I chose to call itmain
because of the meaning of the name in C/C++ but you could pretty much give it any name you want really. The main procedure contains the code that gets executed first.invoke
is not an actual x86 instruction - it is a MASM macro which makes calling functions with one more arguments more comfortable. As you can see it allows you to call functions pretty much as in C/C++. Theaddr
directive gets the address of a named object inside the data section.invoke wsprintfA, addr buffer, addr sum_string, ebx, esi
calls the user32.dll functionwsprintfA
which has a variable number of arguments and uses the C calling convention. We use it to fill the static buffer with the formatted string output with the appropriate values plugged in where the%d
are but you should be familiar with this nasty stuff from C and possibly other high level programming languages. The rest of the program should be fairly obvious with the knowledge you now possess so let us find out how to assemble, link and run this stuff.At first you will need to get your hands on the relevant MASM binaries. mspdb80.dll and link.exe come with Microsoft Visual C++ Express 9 which can be downloaded for free from http://www.microsoft.com/express/Downloads/. I recommend that you get the offline installation instead of the annoying web installer - it simply gives you more control. I do not know where to get the current ml.exe from for free. You can get the 8.0 one at http://www.microsoft.com/downloads/details.aspx?FamilyID=7a1c9da0-0510-44a2-b042-7ef370530c64&displaylang=en apparently but that's outdated. If you already own a legal or illegal copy of Microsoft Visual C++ 9 you will not need to download it, obviously. We only care about three files out of the installation anyways. They are:
- ml.exe (771 KiB)
- link.exe (349 KiB)
- mspdb80.dll (188 KiB)
Ok, actually we can use more of the files - the lib and dll files. Here is the commandline I used to assemble and link this program:
D:\Code\Assembly\calculate_sum\source>"D:\Code\Assembly\binaries\32\ml.exe" /c /Cx /coff /nologo calculate_sum.asm Assembling: calculate_sum.asm D:\Code\Assembly\calculate_sum\binary>"D:\Code\Assembly\binaries\32\link.exe" /MACHINE:X86 /NODEFAULTLIB /NOLOGO /ENTRY:main /OUT:calculate_sum.exe /SUBSYSTEM:CONSOLE calculate_sum.obj /LIBPATH:"C:\Program Files (x86)\Microsoft Visual Studio 8\VC\PlatformSDK\Lib" kernel32.lib "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\lib\msvcrt.lib"
ml.exe is the actual Microsoft Macro Assembler which we use to produce object files. link.exe is the Microsoft linker which produces the actual binaries from the object files. Let us first quickly go through the arguments of each program.
/c
tells it that you want to generate a single object file without having ml produce an .exe directly./Cx
makes it case sensitive: "Preserve case in publics, externs"./coff
makes it generate COFF files which is the standard object format Windows uses./nologo
suppresses the Microsoft copyright message which increases the file size for no particular reason.calculate_sum.asm
is simply the name of the source file in which I stored the assembly code.Mow for the linker arguments.
/MACHINE:X86
specifies the x86 machine as target platform./NODEFAULTLIB
specifies that it must not use default libraries./NOLOGO
does the same thing as before./ENTRY:main
specifies the main entry point of the program - this is the place where the execution starts first. This is directly related to thepublic main
in our assembly code./OUT:calculate_sum.exe
specifies the output file name which is our binary -calculate_sum.exe
./SUBSYSTEM:CONSOLE
specifies the subsystem to use. In this case we want a console by default so we're using the Windows PE Loader Console setting.calculate_sum.obj
is the output from ml.exe./LIBPATH:"C:\Program Files (x86)\Microsoft Visual Studio 8\VC\PlatformSDK\Lib"
specifies the default path where link.exe should look for library files to link with.kernel32.lib
is an essential library which we need for theExitProcess
call."C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\lib\msvcrt.lib"
links the object file with the Microsoft Visual C runtime. We need this to use all the C library functions we are using. So after executing both lines we should have a working small 2.5 KiB executable which we are going to run now:D:\Code\Assembly\calculate_sum\binary>calculate_sum.exe Arithmetic sum for the limit 0 is 0 Arithmetic sum for the limit 1 is 1 Arithmetic sum for the limit 2 is 3 Arithmetic sum for the limit 3 is 6 Arithmetic sum for the limit 4 is 10 Arithmetic sum for the limit 5 is 15 Arithmetic sum for the limit 6 is 21 Arithmetic sum for the limit 7 is 28 Arithmetic sum for the limit 8 is 36 Arithmetic sum for the limit 9 is 45 Arithmetic sum for the limit 10 is 55 Arithmetic sum for the limit 11 is 66 Arithmetic sum for the limit 12 is 78 Arithmetic sum for the limit 13 is 91 Arithmetic sum for the limit 14 is 105 Arithmetic sum for the limit 15 is 120 Arithmetic sum for the limit 16 is 136 Arithmetic sum for the limit 17 is 153 Arithmetic sum for the limit 18 is 171 Arithmetic sum for the limit 19 is 190 Arithmetic sum for the limit 20 is 210 D:\Code\Assembly\calculate_sum\binary>
FASM
Flat assembler is a great multi platform open source program which you can download at http://flatassembler.net/download.php. Here is the equivalent code:
Assembly
format PE Console 4.0 entry start include 'D:\Downloads\fasmw16727\INCLUDE\win32a.inc' include 'D:\Downloads\fasmw16727\INCLUDE\macro\masm.inc' option prologue :none option epilogue :none section '.data' data readable writeable sum_string db 'Arithmetic sum for the limit %d is %d' ,10 ,0 section '.bss' readable writeable buffer db 64dup (? ) bytes_written dd ? section '.code' code readable executable proc calculate_sum limit xor esi ,esi mov edi ,1 jmp loop_start sum_loop : add esi ,edi inc edi loop_start : cmp edi , [esp +4 ] jle sum_loop ret 4 endp start : invoke GetStdHandle ,STD_OUTPUT_HANDLE mov ebp ,eax xor ebx ,ebx main_loop : stdcall calculate_sum ,ebx invoke wsprintfA ,buffer ,sum_string ,ebx ,esi invoke WriteConsoleA ,ebp ,buffer ,eax ,bytes_written ,0 inc ebx cmp ebx ,20 jle main_loop invoke ExitProcess ,0 section '.idata' import data readable library \ kernel32 ,'kernel32.dll' ,\ user32 ,'user32.dll' import kernel32 ,\ GetStdHandle ,'GetStdHandle' ,\ WriteConsoleA ,'WriteConsoleA' ,\ ExitProcess ,'ExitProcess' import user32 ,\ wsprintfA ,'wsprintfA'
format PE Console 4.0
specifies the format of the output file. As you can see from this FASM is both an assembler and a linker. It specifies the portable executable format (PE) which is used by Windows and the console subsystem just like last time.entry start
specifies the entry point - in this case it is the labelstart
. Theinclude
directive includes another FASM file. In this case I included files from the FASM installation which contain information about the Win32 API and a file which contains macros. We require those macros for the next two lines:option prologue:none
,option epilogue:none
are basically equivalent to the MASM directives. Unluckily FASM, too, uses a stack frame for procs by default - what a shame. So we need to use those directives to get rid of the stuff again. Thesection
directives declare new sections, their names and their properties. The following lines do not require any special explanations with two exceptions. The.bss
section is used for uninitialised data and it is basically equivalent to MASM's.data?
.stdcall calculate_sum, ebx
is different, aswell. For some peculiar reason you cannot use the invoke macro in that case and I really have no idea why. It will give you an assembler error if you do. The only difference between between those calls is the fact that theinvoke
lines operate on imported functions whereascalculate_sum
was defined by us. Another difference toMASM
is thesection '.idata' import data readable
part which actually defines an import section. This part is not visible in MASM as such. We define which functions to import from dynamic Windows libraries (DLL) in this part of the code so we can call and use them. The command line is unspectacular:D:\Downloads\fasmw16727>FASM.EXE D:\code\Assembly\calculate_sum_fasm\calculate_sum.asm D:\code\Assembly\calculate_sum_fasm\calculate_sum.exe flat assembler version 1.67.27 (1207647 kilobytes memory) 3 passes, 2048 bytes. D:\Downloads\fasmw16727>
The output of the executable is the same as before so we do not need to go over it again.
NASM
The netwide assembler is another multi platform open source assembler which I do not have much experience with. Windows include files come in a separate package known as NASMX and unluckily it is outdated and does not work well with the current version of NASM (produces warnings, possibly errors). I tried to use it at first for this small example but eventually I gave up on it and went the hard way. I shall link NASMX here aswell in case it gets updated at some point but I cannot recommend getting it right now except as example code. Go to http://nasm.sourceforge.net/ to download NASM and http://www.asmcommunity.net/projects/nasmx/ to get the outdated NASMX. I am using the Microsoft linker with the NASM object file output in this example but you can use other linkers aswell. The NASMX package comes with a different linker but I will stick with Microsoft's current linker which comes with Visual Studio 2008. Here is the equivalent code in NASM which is "rawer" than the MASM and FASM one:
Assembly
%define STD_OUTPUT_HANDLE -11 extern _GetStdHandle4 extern _WriteConsoleA20 extern _wsprintfA extern _ExitProcess4 global _start [section .data ] format_string db 'Arithmetic sum for the limit %d is %d' ,10 ,0 [section .bss ] buffer resb 64 bytes_written resd 1 [section .text ] calculate_sum : xor esi ,esi mov edi ,1 jmp loop_start sum_loop : add esi ,edi inc edi loop_start : &nbsnbsp;cmp edi , [esp +4 ] jle sum_loop ret _start : push STD_OUTPUT_HANDLE call _GetStdHandle4 mov ebp ,eax xor ebx ,ebx main_loop : push ebx call calculate_sum push esi push ebx push format_string push buffer call _wsprintfA add esp ,16 push 0 push bytes_written push eax push buffer push ebp call _WriteConsoleA20 inc ebx cmp ebx ,20 jle main_loop push 0 call _ExitProcess4
%define STD_OUTPUT_HANDLE -11
simply defines the constantSTD_OUTPUT_HANDLE
again, just with a more C'ish feeling on it.extern _GetStdHandle4
declares the external symbol_GetStdHandle4
which is the actual name ofGetStdHandle
inside the Microsoft librarykernel32.lib
. MASM handles those on its own and can use the names as they actually appear in the Windows DLLs. The number behind the ] is the size of the arguments in bytes so it is basically the number of arguments times 4 mostly.global _start
declares a symbol which gets exported. We require this in order do define_start
as an entrypoint in the linker settings later on. The underscore prefix is required by the Microsoft linker as you will see later. The section declaration syntax is a bit different as you can see but the names are still the same.buffer resb 64
declares 64 uninitialised bytes in the.bss
section. Theb
inresb
stands for byte.bytes_written resd 1
declares a single uninitialised "dword", a 4 byte integer, hence thed
inresd
. The calls below show that NASM does not come with any fancyinvoke
(MASM, FASM) orstdcall
(FASM) macros by default. NASMX has those but I do not use it for aforementioned reasons. So we have topush
all the arguments manually as necessary. Theadd esp, 16
stack fix is necessary becausewsprintfA
uses the C calling convention so it does not clean up after itself. We pushed 4 arguments onto the stack so we have to add \(4 \cdot 4 = 16\) toesp
. Now for the commandline arguments to assemble and link this program:d:\Downloads\nasm-2.04rc1-win32\nasm-2.04rc1\nasm.exe -f win32 D:\code\Assembly\calculate_sum_nasm\calculate_sum.asm -o D:\code\Assembly\calculate_sum_nasm\calculate_sum.obj D:\Code\Assembly\binaries\32\link.exe /MACHINE:X86 /NODEFAULTLIB /NOLOGO /ENTRY:start /OUT:calculate_sum.exe /SUBSYSTEM:CONSOLE d:\Code\Assembly\calculate_sum_nasm\calculate_sum.obj /LIBPATH:"C:\Program Files (x86)\Microsoft Visual Studio 8\VC\PlatformSDK\Lib" kernel32.lib user32.lib
This does not require much explanation really since it is the same stuff all over again. One notable thing is the
/ENTRY:start
part. The Microsoft linker prefixes all those names with an underscore (_
) when it actually looks them up inside the objects files which is why your entry point must have this prefix inside the code, hence the name_start
in the source code. The output is the same as usual so we should move on.Writing Linux/BSD Applications Using x86 32-bit Assembly
Now that we have covered Windows x86 programming with various assemblers it is time to find out how to implement this program on Linux/BSD, too. FASM runs on x86 Linux/BSD and NASM will run on any Linux/BSD aswell but I have already demonstrated the syntax for those so I am just going to cover the infamous GNU assembler version 2.17 in this section.
as
as
is the command to call the GNU assembler (shortgas
) on Linux/BSD systems. It comes with any sane distro so I really do not need to tell you where to get it. Consultapt-get
oremerge
otherwise. It is generally agreed upon that thegas
syntax is very archaic and esoteric. It uses the AT&T syntax (as opposed to the Intel syntax I have been using so far) in which basically the order of all arguments for the x86 instructions is reversed. Additionally it uses customised instruction names with -b, -w, -l suffixes to imply byte/word/long (1 byte, 2 bytes, 4 bytes). Additionally it introduces terribly redundant and complicated notations for offsets and cryptic register and number prefixes. Luckily, a feature was introduced which enables crude Intel syntax support. I am going to use this feature in the following code. Unluckily I had to use a silly hack to make it work properly because I could not find proper documentation on replacements for standard operations which are not possible in the Intel syntax mode using the conventional ways. I am going to show you some examples of conventionalgas
code later. For the record: I am usingGNU assembler version 2.17 (i486-linux-gnu) using BFD version 2.17 Debian GNU/Linux
andcollect2 version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) (i386 Linux/ELF)
in this example. Here is my code:Assembly
.code32 .intel_syntax noprefix .section .data format_string : .string "Arithmetic sum for the limit %d is %d\n" .section .text calculate_sum : xor esi ,esi mov edi ,1 jmp loop_start sum_loop : add esi ,edi inc edi loop_start : cmp edi , [esp +4 ] jle sum_loop ret 4 .global main main : xor ebx ,ebx main_loop : push ebx call calculate_sum push esi push ebx lea edi ,format_string push edi call printf add esp ,12 add ebx ,1 cmp ebx ,20 jle main_loop xor eax ,eax ret
You can probably figure out most of this stuff on your own by now, it is the same stuff over and over again. It is merely the syntax that differs, the idea is usually still the same.
.section
declare sections, etc..code32
forces the assembler to produce 32-bit code instead of 64-bit code. This is only necessary if you are running a 64-bit box..intel_syntax noprefix
does what I was talking about. It allows you to write code which is more similar to what you are (hopefully) used to - the Intel syntax.gas
is pretty low level by default as you can see and it does not offer any fancy macros as such. I think it supports the C/C++ preprocessor specs, though. So you could write all kinds of crazy stuff using#define
and such. The "hack" I was talking about in the introduction islea edi, format_string
.lea
is short for "load effective address". It is used to store the address of some object in a register, so they say. In practice a simplelea
is hardly any different from whatmov
does. In this case we could have achieved the same result using amov
. The problem is just that I have not figured out how to get the address of a string when you are using the.intel_syntax noprefix
mode otherwise. In realitylea
is only used to perform simple arithmetic operations because it possesses curious multiplicative and additive capabilities which allow you to perform certain simple tasks much faster than with other native instructions but we will get back to that later. In this case it is really simply a silly unnecessary hack so ignore it for now. In the next line I just push the value of the register onto the stack but what I really wanted to do should be clear from the previous source code flavours you have seen so far - I simply wanted to push the address of that label onto the stack. The only peculiar property of Linux/BSD you can see in this code is that you do not use any special function to exit the program. You can simplyret
from the main function and it will return the program exit code ineax
.Figuring out how to assemble and link this application on my Debian box was a painful process. I recommend that you simply write a simple C program which performs something similar to what you intend to do first. Run
gcc
with-S
in order to produce ASM output. Check out the.s
-file (gas
ASM file extension) to get an idea. Proceed to rungcc
again, this time without-S
and with-v
. That switch will tell you every single command it actually executes. At first it produces the ASM code, then it assembles and links it. We can use this information to rip out the assembling and linking information to use with our project. Be sure to include all libraries you intend to use in your ASM program in your C test program - the linking information will pove invaluable. Here is how I assembled and linked this monstrosity:as -o calculate_sum_gas.o calculate_sum_gas.asm /usr/lib/gcc/i486-linux-gnu/4.1.2/collect2 --eh-frame-hdr -m elf_i386 -dynamic-linker /lib/ld-linux.so.2 -o sum_test /usr/lib/gcc/i486-linux-gnu/4.1.2/../../../../lib/crt1.o /usr/lib/gcc/i486-linux-gnu/4.1.2/../../../../lib/crti.o /usr/lib/gcc/i486-linux-gnu/4.1.2/crtbegin.o -L/usr/lib/gcc/i486-linux-gnu/4.1.2 -L/usr/lib/gcc/i486-linux-gnu/4.1.2 -L/usr/lib/gcc/i486-linux-gnu/4.1.2/../../../../lib -L/lib/../lib -L/usr/lib/../lib calculate_sum_gas.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed /usr/lib/gcc/i486-linux-gnu/4.1.2/crtend.o /usr/lib/gcc/i486-linux-gnu/4.1.2/../../../../lib/crtn.o
As you can see there is an excessive amount of arguments used to actually link this program. I am not sure how many of these are even necessary but I will not bother to find out as long as it works. If you have a 64-bit Linux/BSD this might prove problematic. You might need the right gcc library binaries etc to run this 32-bit example there. I previously claimed that the
gas
syntx is outrageously cryptic and difficult to understand. Here is the same same program using "normal"gas
syntax, without thelea
hack:Assembly
.code32 .section .data format_string : .string "Arithmetic sum for the limit %d is %d\n" .section .text calculate_sum : xorl %esi , %esi movl $1 , %edi jmp loop_start sum_loop : addl %edi , %esi incl %edi loop_start : cmpl 4 (%esp ), %edi jle sum_loop retl $4 .global main main : xorl %ebx , %ebx main_loop : pushl %ebx call calculate_sum pushl %esi pushl %ebx push $format_string call printf addl $12 , %esp addl $1 , %ebx cmp $20 , %ebx jle main_loop xorl %eax , %eax ret
The real
gas
syntax uses so many special characters that you get the feeling of looking at some esoteric programming language like brainfuck. As long as I have a choice between various assemblers I definitely would not pick this one.gcc
inline assembly has an even more outrageous and complicated syntax.Partial Registers
One thing I did not tell you about, on purpose, until now, are the partial registers of the x86 architecture. This is because they frequently do harm and slow down programs when used incorrectly. On the x86 architecture it is possible to access lower parts of all general purpose registers according to the following pattern which I am going to demonstrate on
eax
:
Bits 31 - 24 Bits 23 - 16 Bits 15 - 8 Bits 7 - 0 eax
ax
ah
al
So
eax
can contain 32-bits, alright, we already knew that. Buteax
also has the three partial registersax
,ah
andal
. These point to the dataeax
contains - they just show a smaller part of the same register. Soax
represents the lower 16 bits ofeax
.ah
represents the upper 8 bits ofax
andal
gives you the lower 8 bits of it. This has mostly backward compabitibility reasons.ax
is the original 16-bit accumulator register the x86 architecture started with. Thee
ineax
stands for "extended" since it is an extended version ofax
. Thel
and theh
inal
/ah
stand for low/high. Not all general purpose registers feature the "high"/"low" partial registers. They are only available oneax
,ebx
,ecx
andedx
. You can write to them and read from them just like normal registers. Here is a full list of all the other 32-bit registers and their partial registers:
Bits 31 - 24 Bits 23 - 16 Bits 15 - 8 Bits 7 - 0 eax
ax
ah
al
ebx
bx
bh
bl
ecx
cx
ch
cl
edx
dx
dh
dl
ebp
bp
esp
sp
esi
si
edi
di
The problem with using partial registers in 32-bit code is that they can cause so called partial register stalls. Reading from partial registers and writing to partial registers causes more dependencies than doing the same to the full registers. This can stall the execution pipelines of the processor which results in a slowdown of your code. Understanding what the execution pipeline is and how it works on modern CPUs requires a lot of explaining and it is not yet the time to cover this topic. So all you should keep in mind for now is that you should always avoid using partial registers. There usually is a faster solution which minimises partial register usage - usually totally eliminating it. Both the Intel Optimization Manual and the AMD Software Optimization Guide strongly discourage the use of partial registers for this very reason. Here are a few examples of how partial registers work:
Assembly
;in MASM syntax you use the following notation for hex numbers: mov eax ,012345678h ;this writes 0x12345678 to eax, the decisive parts are the 0-prefix and the h-suffix ;the x86 instruction movzx is usually used to move values from 1-byte or 2-byte sized operands to 32-bit registers ;it writes them to the lower part of the register and sets the rest of the register to 0 ;the "zx" part stands for "zero extend" ;read al, zero extend it and write it to ebx: movzx ebx ,al ;ebx now contains 0x00000078 ;read ah, zero extend it and write it to ebx: movzx ebx ,ah ;ebx now contains 0x00000056 ;read ax, zero extend it and write it to ebx: movzx ebx ,ax ;ebx now contains 0x00005678 ;normal mov: mov ebx ,eax ;ebx now contains 0x12345678
Addresses
The x86 architecture allows us to use fairly complex expressions when we are addressing memory. The pattern it supports is:
address = base + index * scale + offset
The variables used in the pattern have the following meaning:
base
: general purpose registerindex
: general purpose registerscale
:1
,2
,4
or8
offset
: 32-bit integersTo fully comprehend what this means you should take a look at the following example:
Assembly
;ebx <<= 3, shift it left by 3 bits shl ebx ,3 add eax ,ebx add eax ,17 ;the "byte ptr" signifies that we want to read a single byte only and not 2 or even 4 bytes ;I have to use movzx because ecx is a 32-bit register and the byte is only 8 bits in size movzx ecx ,byte ptr [eax ]
So what this basically does is reading a single byte from memory. This byte is stored at the address
eax + (ebx << 3) + 17
. Shifts can be expressed with multiplication/division with powers of two, in case you never heard about it. Shiftingebx
3 bits to the left is the same as multiplying it by 8 because two to the power of three is equal to 8. In generaln <<= a;
is equivalent ton *= static_cast<long>(std::pow(2.0, static_cast<double>(a)));
andn >>= a;
is equivalent ton /= static_cast<long>(std::pow(2.0, static_cast<double>(a)));
, assuming thata
andn
are both of typelong
. This is a direct result of the positional notation I formulated mathematically at the beginning of this document. By shifting you basically add zeroes or remove digits from the sequence of digits. This leads to multiplications by the powers of the base or integer divisions with the same. The great thing is that the x86 architecture allows us to do all of those things in a single step:Assembly
movzx ecx ,byte ptr [eax +ebx *8 +17 ]
It is a really useful feature and has lots of applications. In combination with
lea
you can even use it to accelerate simple arithmetic operations which would usually require several steps for multiplication and addition so do keep it in mind.Control Structures
In the introduction I stated that this document is intended for people who are already familiar with iterative high level programming languages such as C/C++. At the beginning it is difficult to stop thinking in terms of
if
,for
andwhile
and to start thinking in terms of labels and conditional jumps so I wanted to show you a few examples which demonstrate how to translate these standard control structures to x86 32-bit assembly. My apology to Linux/BSD users - but I am going to use Windows specific API calls in these examples occasionally.The first thing we are going to check out is the basic
if
-construct:C++
if (lstrlen(some_string) >32 ) return -1 ; return 0 ;
Here is the equivalent x86 32-bit ASM code:
Assembly
invoke lstrlen ,addr some_string cmp eax ,32 jle return_zero mov eax ,-1 ret return_zero : xor eax ,eax ret
So the original comparison operator was "greater than" but in x86 assembly we jump away from the
if
-body so we need to check for the exact opposite of the original condition. The logical negation of "greater than" is "less than or equal to" so we usejle
in this case. Here is a list of logical negations of standard operators:
Original operator Logical negation ==
!=
!=
==
<
>=
>
<=
<=
>
>=
<
The
if
-else
construct is next:C++
void divisor_check(int divisor) { if (divisor ==0 ) puts("Division by zero!" ); else puts("Valid division" ); puts("This gets executed in either case" ); }
The approximate equivalent in x86 ASM is:
Assembly
.data division_by_zero db "Division by zero!" ,0 valid_division db "Valid division" ,0 either_case db "This gets executed in either case" ,0 .text divisor_check proc divisor :dword cmp [esp +4 ],0 jne else_label invoke puts ,addr division_by_zero jmp end_of_if else_label : invoke puts ,addr valid_division end_of_if : invoke puts ,addr either_case ret
So if the result of the comparison implies that the
else
body must be executed we jump to theelse
-label and simply continue the execution after that because it gets us right where we are supposed to be anyways. In the other case the first jump is not taken but we must not fall through to theelse
-body so an unconditional jump is necessary to take us to the end of theif
-structure.Let us move on to the
while
-construct:C++
long i =0 ; while (data[i] >=128 || some_function(data[i])) i++;
Let us assume that
i
gets stored in the general purpose registereax
and thatdata
is a string/an array of bytes. then the approximate equivalent in x86 ASM is:Assembly
xor eax ,eax loop : movzx ebx ,byte ptr [array +eax ] cmp ebx ,128 jge iterate invoke some_function ,ebx test eax ,eax jz end_of_loop iterate : inc eax jmp loop end_of_loop :
I am using the new instruction
test
in this example.test
is to the logicalAND
whatcmp
is to subtraction. It performs the a logicalAND
using the operands behind the instruction as arguments for the logical function but it does not store the result of it anywhere. It is simply used to set the flags. If you look closely at the logical table forAND
you will realise that for all boolean valuesa
the terma & a
evaluates toa
. This is called idempotence. An idempotent operation can be applied to a variable over and over again but in the end it always evaluate to the same variable again. The logicalOR
operation is idempotent aswell. So whattest eax, eax
really does is checking the value ofeax
and setting the flags accordingly. In this case we use it to determine whether the register is equal to zero. In case you are not used to this particular feature of C/C++:while(data[i] >= 128 || some_function(data[i]))
is equivalent towhile(data[i] >= 128 || some_function(data[i]) != 0)
- which is why we jump out of the loop if that part of the boolean term fails, too. We could even optimise this loop by using an initial jump to enter the loop and putting the iteration at the beginning of the body:Assembly
xor eax ,eax jmp enter_loop iterate : inc eax enter_loop : movzx ebx ,byte ptr [array +eax ] cmp ebx ,128 jge iterate invoke some_function ,ebx test eax ,eax jnz iterate
We do have that additional jump at the beginning now but you always have to keep in mind that 99% of all CPU time is spent inside loops. Optimising the body of a loop greatly outweighs such a one-time operation. The new body looks unintuitive and does not remind one of the original C/C++ code at all so I did not want to start with it right away and used the less optimised version first.
Now that we have covered
if
andwhile
the next part is not a big deal. It is time for thefor
-statement:C++
for (long i =1 ; i <=10 ; i++) printf("Iteration step %d \n " , i);
Here is the equivalent x86 32-bit ASM code:
Assembly
.data iteration_string db "Iteration step %d\n" ,0 .text _start : mov ebx ,1 loop : invoke printf ,addr iteration_string ,ebx inc ebx cmp ebx ,10 jle loop ;...
What we actually do here is skipping the first comparison
1 <= 10
because we know that is true anyways. If you have a more complex iteration condition where you cannot tell whether the first iteration will always occur you can obviously not skip it so be careful with optimisations like these.Multiplication and Division of Integers
Multiplication and division of integers on the x86 architecture are not as intuitive as the addition and subtraction of integers. The largest unsigned integers you can store in a 32-bit general purpose register has the value \(2^{32} - 1\) so when we have two of those and multiply them we get \((2^{32} - 1) \cdot (2^{32} - 1) = 2^{64} - 2 \cdot 2^{32} + 1 = 2^{64} - 2^{33} + 1 = 18446744065119617025 = m\). In order to calculate how many bits this number requires to be represented in a standard binary positional system we use the formula \(\lceil \frac{\log{m}}{\log{b}} \rceil\) with a base of \(b = 2\): \(\lceil \frac{\log{18446744065119617025}}{\log{2}} \rceil \approx \lceil 63.9999999993281927699 \ldots \rceil = 64\). This means that the result of multiplications in general is at least 64 bits in size. But our x86 architecture has only 32-bit general purpose registers so how is this dealt with? The result is stored in two registers. The
mul reg
instruction is used for the multiplication of two unsigned 32-bit integers. It takes a single argument - a general purpose register. It useseax
as the other factor so what it calculates iseax * reg
. The result is stored inedx:eax
which means that the upper 32 bits get stored inedx
and the lower 32 bits of the result are stored ineax
. Example:Assembly
mov eax ,012345678h mov edx ,012345678h mul edx ;The result of this multiplication is: ;0x12345678 * 0x12345678 = 0x014B66DC1DF4D840 ;Which produces the following values of edx and eax: ;edx = 0x014B66DC ;eax = 0x1DF4D840
Multiplications used to be considerably more expensive than addition and subtraction. Multiplication is considerably more expensive than say addition or subtraction in terms of execution time. The
imul
instruction is used for the multiplication of signed integers and it supports more operands. Likemul
, it can store the result inedx:eax
, taking only a single argument. But it can also take multiple registers and even an immediate value but it will discard the upper 32-bits of the result then. Example:Assembly
mov eax ,-2 mov ebx ,-2 imul ebx ;edx:eax = (-2) * (-2) = 4 = 0x0000000000000004 ;Let us use the largest positive signed integers possible: mov edi ,2147483647 mov esi ,2147483647 ;this stores the result in edi and discards the upper 32 bits: imul edi ,esi ;2147483647 * 2147483647 = 4611686014132420609 = 0x3FFFFFFF00000001 ;edi = 0x00000001 ;0x3FFFFFFF is discarded mov eax ,10 mov ebx ,20 ;this simply adds another factor to the multiplication: imul eax ,ebx ,3 ;10 * 20 * 3 = 600 ;upper 32 bit are discarded but they are 0 anyways, so: ;eax = 600 = 0x00000258
Division is far more expensive than multiplication. The
div reg
instruction is used for unsigned division. It performs the divisionedx:eax / reg
and stores the quotient ineax
, the remainder inedx
. In case you do not even know what those words mean I am going to explain it with a small example: \(17 : 5 = 3 + \frac{2}{5}\). The result of the division17 / 5
is 3 and the remainder is 2 because \(3 \cdot 5 + 2 = 15 + 2 = 17\). You might wonder why the x86 architecture supports 64-bit dividends but only 32-bit divisors. I am not absolutely sure why this is, actually. But it was probably more convenient at the time and it is not a big deal to work around it. Usually you do not require such large numbers anyways. Enough of this primary school stuff, let us have a look at a few examples:Assembly
mov eax ,100 xor edx ,edx mov ebx ,10 div ebx ;edx:eax / ebx = 100 / 10 = 10 ;edx:eax % ebx = 0 ;Quotient: eax = 10 ;Remainder: edx = 0 mov eax ,21 div ebx ;edx:eax / ebx = 21 / 10 = 2 ;edx:eax % ebx = 21 % 10 = 1 ;Quotient: eax = 2 ;Remainder: edx = 1
Just like there are
mul
andimul
, there is theidiv
instruction for signed division. It works just likediv
, though. The remainder is always the same sign as the quotient:Assembly
;Set edx:eax to -100 mov eax ,-100 mov edx ,-1 ;-1 because we want all upper bits to be 1 - it is Two's complement! mov ebx ,10 idiv ebx ;edx:eax / ebx = - 100 / 10 = - 10 ;edx:eax % ebx = 0 ;Quotient: eax = 10 ;Remainder: edx = 0 mov eax ,-21 mov edx ,-1 div ebx ;edx:eax / ebx = - 21 / 10 = - 2 ;edx:eax % ebx = - 21 % 10 = - 1 ;Quotient: eax = - 2 ;Remainder: edx = - 1
Floating Point Operations
Now that we have covered the most important integer operations we should move on to floating point instructions. The CPU contains many different units which work together when a program is executed. Instructions like
add
,sub
,xor
,shl
,mul
andidiv
are all executed on the so called Arithmetic Logical Unit (ALU). The operations we are going to concern ourselves with are executed in a completely different unit. It is called theFloating Point Unit
(FPU). This is where stuff like0.942 - 7.183 = -6.241
is calculated. The x86 architecture implements the IEEE-754 standard ( http://en.wikipedia.org/wiki/IEEE_floating-point_standard) and offers support for three different precision formats:
- Single precision 32-bit floating point numbers (
float
in C/C++)- Double precision 64-bit floating point numbers (
double
in C/C++)- Extended precision 80-bit floating point numbers (
long double
in C/C++)We are going to explore the mathematical concept behind this system first. The real numbers \(\mathbb{R}\) know no bounds and even if we specify boundaries for them they are still not discrete. There are infinitely many real values between the numbers 0 and 1 so we have to make sacrifices to represent real numbers with boundaries and approximative capabilities that meet our requirements. We can only represent approximations of most real values using this concept. A few of them which are composed of sums of positive/negative powers of 2 can be represented flawlessly without error using IEEE-754. We are going to examine the smallest of the three formats in this part of the document only. The others are based on the same idea and just use different parameters really.
The IEEE-754 32-bit precision format divides the 32-bit value into 3 different sections:
Section Size Bit range Sign 1 bit 31 - 31 Biased exponent 8 bits 30 - 23 Mantissa 23 bits 22 - 0 The mantissa expresses a fraction of the form 1.x where x is expressed by the mantissa. The numbers are encoded using the Big Endian format (most significant bit first). Let \(s\) stand for the sign bit, \(be\) for the biased exponent and \(m_i\) with \(i \in \{1, 2 \ldots 23\}\) for the mantissa bits \(m_1 m_2 \ldots m_{23}\) (they are read from the left to the right), then the value most IEEE-754 encoded reals represent can be determined using the following formula:
\(\mbox{ value } = (-1)^s \cdot 2^{be - 127} \cdot (1 + \sum_{i = 1}^{23} m_i \cdot 2^{-i})\)The \((-1)^s\) part simply changes the sign according to the sign bit. The value is negative if the sign bit is set to 1 and positive if it is set to 0. \(2^{be - 127}\) is where the the power of 2 gets expressed through the actual exponent \(e = be - 127\). The "bias" I was talking about is a necessity to allow the exponent to be negative. Otherwise we would have positive/zero exponents only and we would be unable to express numbers \(r \in \mathbb{R} \land |r| < 1\) since this requires negative powers of 2. The mantisa represents the binary digits to the right of the point in a number like \(1.1101_2\) which had the leading "1." simply chopped off. Let us manually convert the number 149.8 to the IEEE-754 single precision format. At first we require a binary representation of this decimal number. We examine the part to the left of the point first and repeatedly perform divisions by 2, splitting the result into an integer part and a rational part after each step. Then we repeat the procedure with this integer part until it eventually becomes zero.
149 / 2 = 74 + 1 / 2 74 / 2 = 37 + 0 / 2 37 / 2 = 18 + 1 / 2 18 / 2 = 9 + 0 / 2 9 / 2 = 4 + 1 / 2 4 / 2 = 2 + 0 / 2 2 / 2 = 1 + 0 / 2 1 / 2 = 0 + 1 / 2
You get the binary representation of the left part of the number by reading the "bit stream" on the right bottom-up, which gives us the binary number 10010101 which is indeed equal to the decimal representation 149. The part to the right of the of the point works in a similar fashion but it can be tricky at times because some numbers cannot be expressed with a finite number of digits in another base. We chop off the left part of the number and replace it with 0. We repeatedly multiply it by 2, chopping off the leading digit to replace it with 0 each step. We are done until there is a 0 left or until we discover a periodic pattern or until we simply hit the limit for the precision:
0.8 * 2 = 1.6 => 1 <--+ 0.6 * 2 = 1.2 => 1 | periodic pattern detected 0.2 * 2 = 0.4 => 0 | 0.4 * 2 = 0.8 => 0 ---+
We discovered a periodic pattern which means that 0.8 cannot be written in binary notation in a finite number of digits without any loss of precision. We conclude:
\(149.8_{10} = 10010101.\overline{1100}_2\)We can prove this mathematically by making use of geometric series with \(a \in \mathbb{R} \land 0 < a < 1\):
\(\sum_{i = 0}^{\infty} a^i = \frac{1}{1 - a}\)All we need to do is to perform some index shifts and splitting up of sums to show the desired result:
\(0.\overline{1100}_2 = \sum_{i = 1}^{\infty} (1 \cdot 2^{- (4i - 3)} + 1 \cdot 2^{- (4i - 2)} + 0 \cdot 2^{- (4i - 1)} + 0 \cdot 2^{- (4i)}) = \frac{8}{15} + \frac{4}{15} = \frac{4}{5} = 0.8\)The next step in the conversion process is shifting our binary representation of the number to the left or to the right such that it has the form "1.x". We have to keep track of the number of shifting steps we performed and of the direction aswell. We store this information in the exponent of a power of two:
\(10010101.\overline{1100}_2 = 2^7 \cdot 1.0010101\overline{1100}_2 = (-1)^0 \cdot 2^{134 - 127} \cdot (1 + 0.0010101\overline{1100}_2)\)As you can see I algebraically manipulated the term so it matches the pattern in the original formula I showed to you. Now we can simply read off the values:
\(s = 0 \land be = 134 \land m = (m_1, m_2 \ldots m_{23}) = (0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0 \ldots)\)We are nearly done, we simply need to convert the biased exponent to its 8-bit binary representation now:
134 / 2 = 67 + 0 / 2 67 / 2 = 33 + 1 / 2 33 / 2 = 16 + 1 / 2 16 / 2 = 8 + 0 / 2 8 / 2 = 4 + 0 / 2 4 / 2 = 2 + 0 / 2 2 / 2 = 1 + 0 / 2 1 / 2 = 0 + 1 / 2
This is the same algorithm we used before and it tells us that the binary representation is 10000110. So the IEEE-754 representation of the decimal value 149.8 is:
Sign : biased exponent : mantissa 0 : 10000110 : 00101011100110011001100 = 01000011 00010101 11001100 11001100
In the case of periodic binary representations you simply repeat the periodic digit group until you hit the limits imposed by the 23 bit boundary for the mantissa. Let us get back to the x86 architecture. The other formats simply use different biased exponent and mantissa sizes which also forces them to use different bias values, obviously. There are certain special cases for invalid numbers, infinity and zero and such but it is not relevant for this document, really. Check out http://en.wikipedia.org/wiki/IEEE_floating-point_standard if you are interested in the topic.
The FPU contains 8 80-bit elements called
st(0)
,st(1)
, ...,st(7)
. I did not say "register" because they are implemented as a stack sost(0)
always refers to the top element,st(1)
to the below the top, etc. This makes it quite tricky to use. Honestly, I am not even quite sure why they implemented it this way. Maybe it makes some common operations shorter this way or something like that. FPU code is very fragile and often hard to read. I strongly recommend reading a good instruction set reference simultaneously when you are working with FPU code. There are lots of details you have to pay attention to and it is easy to make mistakes so be extra careful with FPU code. Let us start out with some easy examples to see how the FPU stack works. Here is the state of the FPU at the beginning of the program:
Position Value st(0)
undefined
st(1)
undefined
st(2)
undefined
st(3)
undefined
st(4)
undefined
st(5)
undefined
st(6)
undefined
st(7)
undefined
All FPU instructions have the
f-
prefix. If they have a-p
suffix it means that they pop the top element off the stack at the end of their execution. We are going to execute the following instructions on the FPU now:Assembly
;... ;push one on top of the stack fld1 ;push zero on top of the stack fldz ;add them up and store the result in st(0) fadd st(0) ,st(1) ;add the top two elements up and store the result in st(1), pop the stack faddp st(1) ,st(0) ;...
Let us go through the changes in the FPU stack, instruction by instruction. Changes to registers are marked in red. We start out with
fld1
:
Position Value st(0)
1.0 st(1)
undefined
st(2) - st(7)
undefined
fldz
adds another well known constant to the top of the stack:
Position Value st(0)
0.0 st(1)
1.0
st(2) - st(7)
undefined
Perform the addition
st(0) = st(0) + st(1);
with the specified stack elements usingfadd
:
Position Value st(0)
1.0 st(1)
1.0
st(2) - st(7)
undefined
As you can see the rest of the stack remains the same. We are going to look at the next one in two steps because it basically does two things. We have the FPU perform the addition
st(1) = st(1) + st(0);
usingfaddp
, which discards the top element after the addition:
Position Value st(0)
1.0
st(1)
2.0 st(2) - st(7)
undefined
Position Value st(0)
2.0
st(1)
undefined
st(2) - st(7)
undefined
So how do you load floating point values from memory onto the FPU stack and how do you store them in memory again? Take a look at the following snippet:
Assembly
.data some_float dd 0.123 some_double dq 4.556456823 .data? float_storage dd ? double_storage dq ? .text ;... ;load float from memory and push it on the top of the stack: fld some_float ;load double from memory and push it on the top of the stack: fld some_double ;add them up and pop the stack: faddp st(1) ,st(0) ;store the result as a float: fst float_storage ;store the result as a double (in case we need more precision) and pop the stack so it's empty again: fstp double_storage ;...
The first new thing in this code is the
dq
(declare quadword) assembler directive which we use to initialise double precision floats (64 bits, 8 bytes). In this case I use it to define static values which I load onto the FPU stack. Let us go through the changes again, instruction by instruction. In the beginning the stack is totally empty:
Position Value st(0) - st(7)
undefined
fld some_float
pushes the approximation of 0.123 onto the stack:
Position Value st(0)
0.123 st(1) - st(7)
undefined
fld some_double
does the same again, just with a double precision value which is an approximation of 4.556456823:
Position Value st(0)
4.556456823 st(1)
0.123
st(2) - st(7)
undefined
faddp st(1), st(0)
adds them up and pops the stack, just like we have seen before in the previous example:
Position Value st(0)
4.679456826 st(1) - st(7)
undefined
fst float_storage
stores the top of the stack as a 32-bit single precision value in memory at the specified address. It does not modify the FPU stack in any way, though.fstp double_storage
does, though. After storing the double precision value indouble_storage
it pops the top off the stack:
Position Value st(0) - st(7)
undefined
Now that we have covered the basics of the FPU stack is time to move on to a more elaborate example of how to use the FPU. Here is the original C++ code we are going to translate to x86 32-bit assembly:
C++
double f(double x) { char buffer[10 ]; return std::sin(x * x +1.0 ) /2.0 ; } int main() { for (double i = 0.0 ; i <1.0 ; i += 0.1 ) { double value = f(i); printf("f( %.1f ) = %.4f , first two digits: %ld \n " , i, value,static_cast <long >(100 * } std::cin.get(); return 0 ; }
The FPU example code I came up with is quite nasty I must say. I apologise in advance. Here is my translation to x86 ASM:
Assembly
.686p .model flat ,stdcall option epilogue :none option prologue :none GetStdHandle proto :dword wsprintfA protoc :vararg WriteConsoleA proto :dword , :dword , :dword , :dword , :dword ExitProcess proto :dword STD_OUTPUT_HANDLE equ -11 .data format_string db "f(%s) = %s, first two digits: %d" ,10 ,0 iteration_step dq 0.1 hundred dq 100.0 ten dq 10.0 half dq 0.5 one_two_three dq 123.456789 .data? number_buffer db 16dup (? ) number_buffer2 db 16dup (? ) buffer db 64dup (? ) bytes_written dd ? iterator dq ? .code f proc fmul st(0) ,st(0) fld1 faddp st(1) ,st(0) fsin fmul half ret f endp write_double proc double_buffer :dword ,precision :dword fstqword ptr [esp -12 ] mov esi , [esp +4 ] movbyte ptr [esi ],'.' fld st(0) fisttpdword ptr [esp -4 ] mov eax , [esp -4 ] mov ecx ,10 write_left_digits : xor edx ,edx div ecx add edx ,'0' dec esi movbyte ptr [esi ],dl test eax ,eax jnz write_left_digits mov ecx , [esp +4 ] mov eax , [esp +8 ] write_right_digits : filddword ptr [esp -4 ] fsubp st(1) ,st(0) fmul ten fld st(0) fisttpdword ptr [esp -4 ] mov ebx , [esp -4 ] add ebx ,'0' inc ecx movbyte ptr [ecx ],bl dec eax jnz write_right_digits movbyte ptr [ecx +1 ],0 fstp st(0) fldqword ptr [esp -12 ] ret 8 write_double endp public main main : invoke GetStdHandle ,STD_OUTPUT_HANDLE mov ebp ,eax fldz main_loop : fst iterator invoke write_double ,addr number_buffer +8 ,4 mov edi ,esi call f invoke write_double ,addr number_buffer2 +8 ,6 fld st(0) fmul hundred fisttpdword ptr [esp -4 ] mov eax , [esp -4 ] invoke wsprintfA ,addr buffer ,addr format_string ,edi ,esi ,eax invoke WriteConsoleA ,ebp ,addr buffer ,eax ,addr bytes_written ,0 fstp st(0) fld iterator fadd iteration_step fld1 fcomip st(0) ,st(1) jnb main_loop invoke ExitProcess ,0 end
Let us start with the
main
code.fldz
pushes the zero constant on the stack. So after execution this instructionst(0)
contains zero while all other FPU registers are still undefined/empty. If you use such an undefined register in an FPU operation you will cause an exception. If you try to pop an element off an empty stack - meaning, all values are undefined - you will get an exception, too. Another thing you have to watch out for is stack overflow. You cannot just keep on adding more elements onto the stack. If all registers have been set and you push an additional element onto the stack you will cause an FPU stack overflow so keep track of the height of your FPU stack at all times.fst iterator
reads the 64-bit value fromst(0)
and stores it at the memory locationiterator
- the stack remains the same. On the next linewrite_double
gets called for the first time. This is a stupid workaround I had to come up with because the Windows API does not offer any functions which allow you to convert floating point numbers to strings to my knowledge. I tried to use the Microsoft Visual C Runtime but those attempts did not bear any fruit either because usingmsvcrt
in ASM is pretty tricky nowadays. You have to link the application with those silly manifest files and annoying stuff like that so I eventually gave up on making, say,printf
work and wrote my own double -> string converter.The
write_double
body starts out withfst dword ptr [esp - 12]
. This is not any different from the previousfst
store call really, it simply does not use an address which is known as compile time. Instead we use unused bytes on the stack to store it temporarily locally. Theqword ptr
part is necessary because the assembler cannot magically know what format we actually want to store. It could be single precision, double precision or even extended precision - it simply cannot tell. With the priorfst
call this was not necessary because the size of that object is defined in the data section. Thefld
loads values and pushes them on top of the stack. This value can be a location in memory or even an FPU register. In this case we are usingfld st(0)
which basically just duplicates the top of the stack.fisttp dword ptr [esp - 4]
performs adouble
toint
conversion and stores the result at the specified memory location. This function simply truncates the digits to the right of the decimal point away. So we would convert something like2.913
to2
. In the following loop we perform a conventional integer to decimal ASCII number conversion by repeatedly dividing the integer by 10, adding0x30
to it, and writing the lower byte of the register to a buffer. This buffer gets first filled from right to left, starting with the decimal point. In thewrite_right_digits
section we read the decimal digits to the right of the decimal point from the number by repeatedly multiplying it by 10, converting it to an integer, converting the integer to a decimal digit which gets added to the buffer and by eventually subtracting the integer from the double again to get rid of the digit to the left of the decimal point. The buffer gets filled from the left to the right, starting with the byte to the right of the decimal point. It starts withfild dword ptr [esp - 4]
. This instruction loads an integer from memory, converts it to a double and pushes it onto the FPU stack so we can access it viast(0)
.fsubp st(1), st(0)
performs the calculationst(1) - st(0)
and stores the result instd(1)
. Additionally it pops off the top of the stack so what was oncest(1)
is nowst(0)
. You can tell this from the-p
suffix in the mnemonic.fmul ten
multipliesst(0)
with the value10.0
which is stored in memory. The result of the operation gets stored instd(0)
again. It is followed by anotherfld st(0)
which duplicates the top of the stack. It is necessary because we are going to usefisttp
on the next line. Unlickily this instruction does not feature any non-p
version so we are forced to perform that somewhat redundant additional FPU stack push. So what this loop does is simply writing a certain number of digits to the right of the decimal point. The number is specified by the second agumentprecision
. At the end of the function you can see afstp st(0)
. This storesst(0)
inst(0)
and pops the top element of the stack (which isst(0)
. So it basically works like a pop - with the exception of being somewhat unintuitive and cryptic. At the very endfld qword ptr [esp - 12]
is called. This restores the original valuest(0)
had when the function was first called because we wrote it to that part of the memory in the beginning.After returning from
write_double
for the first time the functionf
is called. This is pretty much a direct translation of the corresponing C/C++ function we defined earlier.fmul st(0), st(0)
is basically equivalent tost(0) = st(0) * st(0);
in C/C++ pseudo code.fld1
is another "load constant" function, just likeldz
. It pushes the value zero on top of the stack.faddp st(1), st(0)
addsst(1)
tost(0)
and stores the reslt inst(1)
. After this step the top element of the stack gets popped off so the result is inst(0)
.fsin
performsst(0) = sin(st(0))
. And last but not least we divide the resulting value by 2.0 usingfmul half
. This actually multipliesst(0)
with 0.5 but it is obviously the same as dividing by zero since \(x / r = x \cdot \frac{1}{r}\). The value off
is returned inst(0)
and written into a buffer. Most of the following instructions have already been explained, with the exception offcomip st(0), st(1)
. This compares the values ofst(0)
andst(1)
and sets the flags Zero/Parity/Carry flags according to a particular pattern. More importantly, it specifiesCF = 1
if and only ifst(0) < st(i)
- which is exactly what we want. Thejnb
instruction jumps only if the iterator is still smaller than 1.0. Here is the actual output of the function:f(0.0000) = 0.420735, first two digits: 42 f(0.1000) = 0.423415, first two digits: 42 f(0.2000) = 0.431202, first two digits: 43 f(0.3000) = 0.443313, first two digits: 44 f(0.4000) = 0.458401, first two digits: 45 f(0.5000) = 0.474492, first two digits: 47 f(0.6000) = 0.488932, first two digits: 48 f(0.7000) = 0.498368, first two digits: 49 f(0.7999) = 0.498803, first two digits: 49 f(0.9000) = 0.485763, first two digits: 48 f(0.9999) = 0.454648, first two digits: 45
As you can see there are a few errors in the iterator such as
0.7999
and0.9999
. You should probably already be familiar with the problem of rounding errors on floating point numbers from imperative high level programming languages. It is simply a result of the system we use. Not every number can be perfectly represented. Most of them have certain small errors and the results of arithmetic operations on them are inaccurate. There are many FPU instructions with extremely specific applications and I am not going to cover them here. Rule of thumb: If you want to know how something works, write it in C/C++, turn on assembly listings and read the code in combination with an x86 Instruction Set Reference.