Friday, April 30, 2010

assembler tutorial: 2 - goodbye, mr. chips


and bade them curt 'hello,' and then 'good-bye.'


for our second heavily-annotated example, we'll modify the 'hello world' program a bit. beyond simply saying "goodbye" this new app will allow the user to provide a name on the command line to which we may bid adieu. if no argument is provided, the program will simply say goodbye to mr. chips. this time we will be assembling a dos com file.

in dosbox, move to the mycode directory and launch editv to create a new file called goodbye.asm.

comment, comment, comment. note that this file will not require a linker.


1 ; goodbye.asm
2 ; This variation on the 'Hello World' program expands upon the original
3 ; a little by accepting a name from the command line and printing that
4 ; in the message. If no parameter is supplied, the program will use a
5 ; default value. This one is written as a COM file, so it doesn't
6 ; require linking.
7 ; To assemble: nasm -o goodbye.com goodbye.asm
8 ;
9 ; Robert Ritter
10 ; 25 Apr 2010
11


we begin with a directive for the assembler. org is not a machine instruction, but rather a note to the assembler so that it can configure all of the segment addresses for you. in a dos com file the entire program fits into a single 64kb segment, so loading up all those segment registers seems rather silly. org 100h tells nasm that this is a com file, so all segments are at exactly the same address, and the program starts at offset 100h. why this offset? well the first 256 bytes (0h through ffh) makes up the program segment prefix, or psp. byte 100h is the first place real code can be loaded. since i'm defining my data at the top of my source file, i'm putting a jmp (unconditional jump) at this address to tell the system to skip right to the good stuff. jmp works like the much-maligned goto statement in other languages: it transfers execution to the code found at the given label. we're going to let the program flow jump to the label called 'Start' and we'll catch up in a bit.


12 ; ----------------------------------------------------------------------
13 org 100h
14 ; We set up a COM file by defining the address of the program location
15 ; in memory, which will always be 100h. Then we jump to the start of
16 ; the code block.
17 ;
18 jmp Start
19


you've seen data before, and you will recognize db from our last program. one new thing here is the equ directive. this creates a constant. data defined with db may be modified during program execution, but data defined with equ cannot. any attempt to change the value stored in endMsgLen in this program will cause the assembler to balk with the message that the label has been redefined.

another new thing is the use of the dollar sign outside the quotation marks. what does that mean? well, we're going to be copying strings into a buffer and we'll need to tell the cpu exactly how many bytes to copy. it's easy to find the size of the beginMsg string: we subtract its address from the address of defaultMsg.
defaultMsg - beginMsg
will give us the length of beginMsg. remember that labels are just aliases for memory addresses. we can use the same technique to find the length of defaultMsg. to find the length of the last string, endMsg, we subtract endMsg from $. the dollar sign in line 31 means "this byte right here." so endMsgLen will contain the difference between endMsgLen and endMsg. that's pretty cool.


20 ; ----------------------------------------------------------------------
21 section .data
22 ; DOS COM files don't use segmented memory. The whole program fits
23 ; into a single 64KB block, so there's no need to worry about segments
24 ; at all. The assembler still expects to find defined data and code
25 ; sections, though, and it helps us to organize our source if we keep
26 ; things compartmentalized like this.
27 ;
28 beginMsg db 'Goodbye, '
29 defaultMsg db 'Mr. Chips'
30 endMsg db '!', 0dh, 0ah, '$'
31 endMsgLen equ $ - endMsg
32


the bss section (so named for historic reasons) contains data that is not initialized to a specific value; at least, no value that we care about. here we're creating a working buffer to which we may copy the elements of our final string before we send it to standard output.


33 ; ----------------------------------------------------------------------
34 section .bss
35 ; This section contains unintialized storage space. We allocate space
36 ; here for data that we won't have until runtime. COM files don't
37 ; require an explicit STACK section. The assembler will take care of
38 ; the stack for us.
39 ;
40 fullMsg resb 1024 ; This is the message we will print.
41 ; We'll assemble it from parts and
42 ; copy each part into this memory area.
43


the rep movsb instructions copy a sequence of bytes from one place in memory to another. the number of bytes that get copied is found in the cx register. so what we're doing here is concatenating strings and storing the result in fullMsg. first strings first...

 
44 ; ----------------------------------------------------------------------
45 section .code
46
47 Start:
48 ; First we'll copy the beginning of the message, 'Goodbye,' to our
49 ; allocated memory. The number of bytes to copy (the length of our
50 ; data) goes into CX.
51 mov cx, defaultMsg - beginMsg
52 ; The address of the data goes into SI (think Source Index) and the
53 ; address of the allocated memory into DI (as in Destination Index.)
54 mov si, beginMsg
55 mov di, fullMsg
56 rep movsb ; REP MOVSB copies CX bytes from SI to DI.
57 ; DI is automatically incremented.
58


remember that psp? the first 128 bytes (00h through 7fh) is full of stuff that we're really not interested in, but the second 128 bytes (80h through ffh) contains information from the command line that we used to run our program. since we want to get a name from the parameter list on the command line, we want to read this part of the psp. byte 80h tells us how long the parameter list is, so if it's zero (the program was run without any parameters) we'll say goodbye to the default name; otherwise we'll read the parameter list and take our name from there.

throughout our sojourn in assembler we've been working with memory addresses. the programming savvy among you may have said to yourself, "ah, these are pointers." most of what we work with in assembler is addresses, or pointers. if you learned in computer programming class that pointers were hard, then you learned them incorrectly; but that's a rant for another post. suffice it to say that pointers are the way to manipulate data in assembler. however, sometimes we need to get at the data in a memory location directly. on line 71 we need to compare zero to the value at address 80h, not the address itself. nasm makes this pretty easy: we use square brackets around an address to access the value inside. the instruction in line 69 means, "copy the value stored at address 80h into the cl register." there, you've just dereferenced a pointer. no big deal.

this section contains a couple of logic branches using the cmp operator to compare two values, a jz operator to jump if zero to a particular label, and an unconditional jmp instruction to skip parts of the program that won't be used if a name was given on the command line. remember that labels are just memory addresses in assembler.


59 ; Next we'll copy the command line parameter into our allocated memory.
60 ; When we start a program, DOS creates a data structure for it called
61 ; the PSP (Program Segment Prefix) that loads ahead of it in the first
62 ; 256 (100h) bytes of memory. (This is why the COM file has to point to
63 ; address 100h to start.) The first 128 bytes of the PSP is "stuff," so
64 ; we won't worry about that. The last half of the PSP contains the
65 ; parameter string. Byte 80h contains the length of the string and the
66 ; remaining bytes contain the parameter string terminated by a carriage
67 ; return (0dh.)
68 xor cx, cx ; Set CX to zero.
69 mov cl, [80h] ; Put the parameter length
70 ; into CL.
71 cmp cl, 0 ; Test CL to see if it's zero.
72 jz NoParam ; If CL contains zero jump to
73 ; another part of the program.
74 ; If the JZ (Jump if Zero) wasn't executed, then the user ran the
75 ; program with a command line parameter. CX now contains the number of
76 ; bytes in the string, but the first byte is always a space, so we'll
77 ; decrement CX and start copying the string from byte 82h. DI already
78 ; points to the end of the last thing we copied to memory.
79 dec cx
80 mov si, 82h
81 rep movsb
82 jmp FinishString ; Skip the NoParam part since
83 ; there was a parameter.
84
85 NoParam:
86 ; No parameter was given on the command line. We'll use the default
87 ; goodbye message.
88 mov cx, endMsg - defaultMsg
89 mov si, defaultMsg
90 rep movsb
91
92 FinishString:
93 ; Now we copy the last part of the message to memory.
94 mov cx, endMsgLen
95 mov si, endMsg
96 rep movsb
97
98 ; Use the DOS service call to print the string that is in memory.
99 mov dx, fullMsg
100 mov ah, 09h
101 int 21h
102
103 ; Exit with no error code.
104 mov ax, 4c00h
105 int 21h
106


save the file and exit editv. assemble the program directly into a com file:
nasm -o goodbye.com goodbye.asm
run the program with and without parameters.

now that was a pretty sophisticated program. you fetched data from the command line and used a condition to branch to a specific part of your program, much like if..then logic found in so-called high-level languages, and you used a default parameter if one wasn't provided. your skills are progressing nicely, padawan. you're training is almost complete.

Monday, April 12, 2010

assembler tutorial: 1 - hello, world!


but thunder interrupted all their fears


now that you're ready to begin writing we'll dig right in. here is our first heavily-annotated program, hello.exe.



open up your dosbox and change to the mycode directory.
cd mycode
run editv with a new file, hello.asm.
editv hello.asm
i strongly advise you to turn on line numbering. you can do this by pressing ctrl-o followed by b, or you can just select the options menu with alt-o, scroll down to line numbers and press enter.

we begin with comments. if you've ever read a book on programming it has probably emphasized the need for good comments. in assembler comments are even more important because the language syntax is so terse. a comment begins with a semi-colon and continues to the end of the line of text.


1 ; hello.asm
2 ; Demonstrates how to write an assembly program for DOS with NASM using
3 ; the ubiquitous 'Hello World' string. To create an EXE file we'll
4 ; first assemble an OBJ file then link it. I'm using the public domain
5 ; linker WarpLink.
6 ; To assemble: nasm -f obj hello.asm
7 ; To link: warplink hello.obj
8 ;
9 ; Robert Ritter <rritter@centriq.com>
10 ; 12 Apr 2010
11


remember that dos addresses memory in segments. the first thing we'll need to do is reserve some memory for these segments. since this program doesn't work with a lot of data our data segment is pretty small. it defines a label, message, that will be the memory address of the first byte of the message that we're going to print out. the db (define byte) operator identifies a sequence of bytes that make up our data. there is also a dw (define word) for 16-bit values, and dd (define double word) for 32-bit values. most of the time you'll probably just treat data as a sequence of bytes, so you'll likely use db more than the others.

you may have noticed the characters that follow the obvious string, "Hello World." if we want to advance our output to the next line we must insert a newline character. this is like pressing the enter key on a keyboard. in high-level languages like c we use a string like "\n" to represent a newline, but in dos this is actually a two-byte sequence: 0dh (carriage return) and 0ah (linefeed.) the dollar sign character is a terminator that marks the end of the string. not all strings must be terminated with a dollar sign, but the dos printing service that we're going to use requires it.

notice that the characters that make up a string are enclosed in quotes. double or single quotes, it makes no difference. those characters outside the quotes are treated as literal bytes.


12 ; ----------------------------------------------------------------------
13 segment data
14 ; DOS EXE files use segmented memory which allows them to address more
15 ; than 64KB at a time. Here we define the data segment to store the
16 ; message that we're going to print on the screen.
17 ;
18 message db 'Hello World', 0dh, 0ah, '$'
19


the next thing that we want to do is reserve some memory for our stack segment. the resb (reserve byte) operator is used to set aside an uninitialized piece of memory of a given size. there is also a resw (reserve word) for 16-bit values and resd (reserve double word) for 32-bit values. we're going to allocate a 64-byte hunk'o'ram for the stack and set the label stackTop to point to the address immediately following the stack. for more info on how the stack works, see my previous post.


20 ; ----------------------------------------------------------------------
21 segment stack stack
22 ; The stack is used as temporary storage for values during the
23 ; program's execution. Sometimes we use it in our code, and sometimes
24 ; DOS uses it, especially when we call DOS interrupts. We'll set up a
25 ; small but serviceable stack for this program since we're going to be
26 ; calling on DOS services.
27 ;
28 resb 64
29 stackTop ; The label 'stackTop' is the address of the end (top)
30 ; of the stack. We'll need this to initialize the
31 ; stack pointer in the CPU.
32


the code segment is where the cool stuff happens. remember that a dos exe file may have more than one code segment to get around that pesky 64kb barrier we discussed last time. though multiple code segments are allowed, only one can be the actual entry point of our program. this is defined with a special label, ..start. note that i used a colon at the end of this label. a label may end with a colon, but this is not required. you may find code examples that are pretty inconsistent on the use of colons in labels. even examples in the official nasm documentation waffle a little on this. personally, i choose to use a colon when the label refers to a block of code, and to forgo the colon when the label refers to data. remember, though, that to the assembler they're all just addresses.

we're giving the mov operator a real workout here. the instruction
mov dest, src
tells the assembler to copy the data at src into dest. yes, it goes right to left, but you get used to it pretty quickly. in this instance we're loading segment addresses into their respective cpu registers. since we can't copy immediate data directly into a segment register, we'll use ax for temporary storage.


33 ; ----------------------------------------------------------------------
34 segment code
35 ; The code segment is where our program actually does stuff. Executable
36 ; instructions go here.
37 ;
38 ..start:
39 ; First we need to do some housekeeping. Our program needs to know at
40 ; what addresses its segments can be found. The Intel CPU contains some
41 ; special registers just to hold this information, so we'll load them
42 ; up now. Since we can't put addresses directly into these registers,
43 ; we'll copy them to the AX general purpose register first.
44 mov ax, data
45 mov ds, ax ; DS: data segment register
46 mov ax, stack
47 mov ss, ax ; SS: stack segment register
48 mov sp, stackTop ; SP: stack pointer register
49


now we're going to call on dos to print our message on the screen. dos and the system bios have several services that they offer to our programs. these are accessed by triggering an interrupt with the int instruction. each service has its own requirements, so we need to look up the particular service we want in our handy dos developer's guide in order to properly use it. the dos service we're using here is service 09h of the general purpose interrupt 21h. to use it we place the address of a dollar-sign-terminated string into register dx, place the service id 09h into register ah, then call interrupt 21h.


50 ; We're going to use a DOS service to write a string to the screen.
51 ; The documentation for this service says that we have to terminate the
52 ; string we want to print with a dollar sign (see how we did this in
53 ; the data segment above) and we must put the address of the string
54 ; into the DX register and call the service. DOS interrupt 21h provides
55 ; all kinds of cool services. To use it we place the service ID in
56 ; register AH and call INT 21h.
57 mov dx, message
58 mov ah, 09h
59 int 21h
60


finally we exit the program. we'll use service 4ch of dos interrupt 21h. if you have a specific exit code (for example, to signal an error) you place it into register al. just as before, we put the service id into ah and call the interrupt. since we have no error condition we'll do a clean exit. here we load al and ah at the same time by putting 4c00h into ax.


61 ; We also use INT 21h to exit our program. The exit function is 4Ch,
62 ; which goes into AH. The exit code that is used to report errors back
63 ; to the operating system goes into AL. We'll just load both at the
64 ; same time, then call INT 21h.
65 mov ax, 4c00h
66 int 21h
67


you have just written a program in assembly language. save the file and exit editv. assemble the file with nasm:
nasm -f obj hello.asm
this will create an object file suitable for linking into a dos exe. link the file with warplink:
warplink hello.obj
this creates the file hello.exe. notice that i included these instructions in the comments at the top of the source file. this is useful if you come back to the program at a later time and want to make changes. now run your program and bask in the warmth of the knowledge that you have made this cpu do your explicit bidding. a little bit more of this and you'll be ready for live minions.

next time we'll pass command-line parameters into our program, and we'll shake things up a bit with the dos com file format.

Monday, April 5, 2010

assembler tutorial: intro 2 of 2


the lovers moved to flee from heaven's tears


warning: this post contains frequent references to explicit hex, and may be inappropriate for readers under the age of 11h

last time i said, "before we begin in earnest, two things." here comes thing number two.



second, a word about how assembler works. you are no doubt aware that your computer has long-term storage (disks) and short-term storage (random access memory, or ram.) if we use an office allegory to describe a computer, we might say that the disks are like the filing cabinets in the back room: they can hold lots and lots of stuff, and are generally pretty well organized, but inconvenient. constantly going to them to fetch new work or to put something away would be a chore, so we tend to use them only when we need to grab something we plan to use soon or to put something away when we won't be using it for a good long while. ram is like the in/out trays on my desk: i can stack all kinds of stuff there (though much less than i can put in the filing cabinets) and my work is quickly and easily accessible. the cpu is like my desktop, where all the work actually happens. to do some work i have to take it from the trays and move it to the desktop, and to clear the desk for some other task i need to move what's on the desktop back to the trays. so where in the cpu do we store this really temporary stuff while working on it?

registers

cpus have built-in memory storage spaces called registers. in the intel x86 architecture, 16-bit general purpose registers go by the names ax, bx, cx and dx. each 16-bit register can be broken into two parts, a high-order byte and a low-order byte. for register ax, these would be called ah and al respectively. the specific meanings of high- and low-order aren't too important right now, and the topic delves deep into ancient religious wars of cpu design, but suffice it to say that putting the 16-bit word c725h into register ax will load c7h into ah and 25h into al.

in modern cpus each 16-bit register is only half of one of the 32-bit registers, which bear the names eax, ebx, ecx and edx. there are other special purpose registers that we'll talk about as we move along, but you get the idea.

so writing an assembly language program is like shuffling paperwork around. you copy data into a register, you tell the cpu to process it, then you do something (or nothing, if you wish) with the result. here is a simple set of instructions that you'll see frequently in assembler. we'll talk about what it does next time.

1 ; These instructions are commonly found in DOS programs.
2 mov ax, 4c00h
3 int 21h


segments

another thing you must know is how dos accesses memory. to mov (copy) data to or from ram you need an address. since dos uses 16-bit registers, the largest address it can work with is 16 bits long, so dos can address up to 65,536 (64k) bytes of ram. that's it. a long time ago 64k was a lot. remember all the great programs we ran on the commodore 64? but as consumers demanded more from their applications 64k became a barrier. dos handles this by viewing memory as a series of segments, each 64kb in size. a program can contain many segments of code and data so long as none of them exceeds 64kb.

to address memory, then, we need two registers: a special segment register for the segment address and a normal general purpose register for the offset within that segment. if the ds register, which points to a data segment, contains 24a0h and we mov 0fh into register dx, then ds:dx refers to the 16th byte of that segment, written as 24a0:000fh. if we later load ds with 4110h we'll find that ds:dx now points to 4110:000fh. it's not too complicated, but it's up to the programmer to keep track of which segment he's using at any point in time. fortunately you need not know the exact addresses of your segments (dos actually determines that at runtime, so there is no way you could know as you're writing your source.) in assembler we use labels, friendly names to refer to addresses. so you may see code like the following to initialize the data segment register:

4 ; Load the DS register with the address of the data segment.
5 mov ax, data ; "data" is the address of our
6 mov ds, ax ; data segment


stack

finally, there is the stack. this is a handy little place in ram to put things temporarily, such as when you want to pass data from one procedure to another. we push data onto the stack to store it, and we pop data off of the stack to retrieve it. the stack is like that little cart with the clean plates at the head of a buffet line. the most recently cleaned plates are warm, damp and on the top of the stack, and the ones that have been there awhile and are much drier are at the bottom. when you take a plate off the top, you're taking the one that was most recently placed on the stack.

data stacks work the same way. the topmost item is the most recently pushed data, the oldest data is at the bottom. data is always popped in reverse order from how it was pushed onto the stack.

of course, we all know what happens when you put too many plates in a stack on one of those carts. bad, loud things happen. if we were to overfill our stack segment in our program, we could overwrite some other segment, or worse, some other program's segment. this could also lead to bad, loud things, so the intel cpu does a funny thing when it sets up the stack: it fills it backward, from the top down. you need to see this to get it…

let's say that you decide to create a stack segment for your program that is only four bytes long (don't use such a small stack in real life.) the ss register (stack segment) will contain the address of the beginning, or bottom of the stack. the first byte would be at ss:0h, the second at ss:1h, the third at ss:2h and the fourth at ss:3h. the stack pointer (another register called sp) will point to the top of the stack, 4h.

"wait!" you cry. "4h isn't in the stack segment, because it's only four bytes long!" you're right. what's at the address that sp is pointing to right now? we don't know for sure. "isn't that dangerous?" you ask. perhaps, but wait until you see how the thing comes off.

when we push a byte onto the stack, sp is first decremented, so now it points to 3h. then the pushed data is copied to 3h. see? everything is okay, because we don't actually write to 4h, so we don't corrupt some other program's stuff. when we push a second byte, sp is decremented and the new data is written to 2h. when we pop the data off of the stack, the data that sp points to at 2h is copied and sp is incremented to point to 3h. but what if we try to push more than four bytes onto the stack? well, consider that when the fourth byte is pushed sp has been decremented four times, so it now points to 0h. any attempt to decrement sp again will set the overflow flag in the flags register, and dos will crash your program with a stack overflow error. your program valiantly falls on its own sword to keep from doing bad things to other programs. of course, there is nothing to keep you from popping the data at 4h before you've pushed anything onto the stack. just expect really bad things to happen when you try to use that unknown value. it's best to not go there.

why use the stack if it's so much potential trouble? remember that programs love it, more than programmers love their buffets. sometimes procedures pass parameters by the stack so that they can do things. after branching to another line of execution the stack can act like a trail of breadcrumbs, helping your code to wend its way back to where it came from. even if you never consciously use it, the services you request through dos or bios interrupts will use the stack. buck up, young padawan: you can't escape your destiny.

now then, you're all set to write your first program in assembly language…