The x86 Architecture

This architecture was originally introduced with the 8086 processor in 1978. It was a 16 bit architecture with a 20 bit address space giving a maximum of 1Megabytes of memory. A less expensive version using 8 bit bus was introduced the next year. The instruction sets for the two processors were identical. The only difference was the glue logic on the motherboard. However, the use of segments (each of which was 64k in size) soon became limiting and a version with a flat memory model (no segments) and virtual memory support (the 80286) was introduced in 1982.

The 80386 changed the architecture in 1985 to a 32 bit architecture with some backward compatability modes. This is the essentially the same instruction architecture that has remained in use until today. The hardware implementing the instructions has changed, and some extensions have been made such as adding MMX registers. More extensive descreiptions of the 386 instruction set are avilable on the net.

To start with, the architecture is a 32 bit architecture with 32 bit registers and a 32 bit memory space. It can be used in a segmented version or a flat mode. Linux uses the flat model. Therfore, so will we. In this model, the entire virtual address space is visible as a single sequence of bytes.

Registers

The x86 model consists of 3 sets of registers. These are the data registers, the index registers and the segment registers. We will only be using the first two. There are four data registers, EAX, EBX, ECX and EDX. They can be subdivided into smaller registers. The low 16 bits of each register is access by dropping the E. That is AX, BX, CX and DX (these were the original names on the 8086). The lower 8 bits of the register is accessed by changing the X to a L while the upper 8 bits of the lower 16 bits is H. So to move the lower 8 bits of the A register to the upper 8 bits of the low word of the B register is to move from AL to BH. Figure 1 shows a diagram that may help explain this:

Figure 1 - Register subdivision

The 4 index registers are ESI, EDI, EBP and ESP. The last two are already in use by the system. ESP is the stack pointer. EBP is the base pointer which is used to implement local variables in high level languages such as C and C++. We will see how it is used very shortly. The other two are available for general use and stand for the Source Index (ESI) and Destination Index (EDI). If you want to use a pointer to memory, you have to use one of these registers. You can access the lower 16 bits as SI, DI, BP and SP. We will not be using these lower bits.

Instructions and Assembler Syntax

Unlike the 68HC11, the x86 instruction set is not accumulator based. Most instructions have two operands, the source and the destination. One of these operands must be a register. Some operations such as jmp and call (jump to subroutine) obviously only have single operands, while others such as nop (no operation) have no operands.

You can look at the assembly language that is generated by the compiler by using the -S flag. For example,

cc -S -c lab4.c

produce the file lab4.s. The second file will contain the pentium assembly language that the compiler has produced. There is one issue, the issue of assembly syntax. The assembler used by linux (the gnu assembler) is designed to work on multiple systems and uses a syntax originaly created by DEC for the PDP series of mini-computers. That syntax was modified used for the early versions of Unix, and is now called AT&T syntax. In this syntax, the source is the first operand and the destination is the second operand. Use the command

info as

to find out more. One problem with the built in assembler on linux (at least for this lab) is that the machine code that is generated by the assembler is not fully optimized and has redundant null bytes which cause problems when transmitted across the network. We will be using the NASM assembler which uses the original Intel syntax for this architecture. Once you have downloaded and installed the nasm package on your system, the command

info nasm

will give you more information about the assembler. We will give a short tutorial for the necessary elements of the assembler.

Let's start with an example, the line

mov ax,bx

will move the lower 16 bits of the B register to the lower 16 bits of the A register. The line

mov al,[ebp-20]

Will subtract 20 from the 32 bit contents of the BP (base pointer) and use it as a pointer to access a single byte value from memory and copy it to the low 8 bits of the A register. Literal values are used directly. For example:

mov eax,0x42

will take the hexidecimal value 42, and move it into the 32 bit A register. The size of the operation is inferred by the perands. The first move was 16 bits since the operands are 16 bit registers (ax and bx). The second move was 8 bits since it uses the 8 bit version of the A register. The ebp is a pointer to memory and the AL determines how much memory is copied. The last move example is a 32 bit move since the eax register is the destingation. Here is a table of the instructions you will need for the assignment:

nop No operation, used to use up space
jmp loc Transfer control to location (i.e. move address of location into instruction pointer)
call loc Push the address of the next instruction on the stack and jump to the new location
pop dest pop a value from the stack and store it in dest
int code Generate a trap with the code given by code. Example "int 0x80"
mov dest,src move src to dest
xor dest,src Perform a xor operation using the src and dest operands and store the result in the dest operand
lea dst,src src must be a location in memory, compute the address of src and move it to dst. Dst must be an index register

Several opcodes are available. The ones we might use in this lab are:

Name Description Example
literal The actual value of the operand 0x80
register The name of a register eax
immediate A location in memory [0xbffff5d0]
index and displacement Use an index register as a pointer and add/subtract a displacement. The displacement can be an expression [esi+endstart+1]

Finally, NASM provies a set of assembler directives. The ones we need are:

Name Description Example
bits Set execution mode of code bits 32
db allocate one or more bytes db "foobar"
db 0x76,0x86
dd allocate a 4 byte word db end-start

Since the roots of this architecture are in the 8 bit era when memory was at a premium, instructions are only a single byte with extra bytes for memory location or for the second operand. For example, the nop instruction is assembled as the single byte 0x90. The line

mov ax,bx

is assembled as the two byte sequence 0x89, 0xD8. By default displacements involving expressions are assembled as 32 byte quantities

mov [esi+arg2-dstart-1],al

becomes 0x88,0x86,0x0B,0x00,0x00,0x00. The actual value of the displacement is 0x0B (11 decimal). The reason for a 32 bit value (0x0000000B) is because the assembler doesn't know the value of all labels until the end of the first pass, so it uses a conservative allocation. We can specifically tell the assembler that the value will fit into a byte using the following:

mov [byte esi+arg2-dstart-1],al

This addition tells the assembler that the value of "arg2-dstart-1" will fit into a single byte. The resulting byes are 0x88,0x46,0x0B.