My first x86-64 assembly in macOS
I am reading a book Coders at work. Each chapter consists of interview with an experts in the field of computer programming, engineering and science. Most of them had started programming during 70’s-80's, the time when there wasn’t many choice of programming languages available like modern day. Most of them were programming in low-level like assembly or binary. I was thrilled and excited from reading of their journey. It motivated me to learn small things about assembly and learn more about machine closely. I am writing this blog post on my journey of learning assembly.
I have a macbook pro. It’s a 64-bit machine. Intel based. I started collecting information on resources and tips to get started. My machine is x86-64 so it is x86 assembly. I found little resources on starting assembly on mac, and among what I found was for linux. Throughout the search I have collected good blogs and resources that I’ll be sharing. This post is more about what I am discovering or learning rather than tutorial.
I started with evergreen “Hello world” program. Here is the short program that i will try to explain what i have figured it so far.
.data str: .asciz "Hello World\n" .section __TEXT,__text .globl _main _main: pushq %rbp movq %rsp, %rbp subq $32, %rsp leaq str(%rip), %rdi callq _printf addq $32, %rsp popq %rbp retq
Lets talk about registers. General definition is:
The registers are like variables built in the processor.
Let’s start with available registers in x86(i386) -
In our hello world code we have used registers like rbp, rsp etc. In above diagram we see ebp, esp etc they are 32 bit variant - In other hand we used 64 variant whose inital first letter starts with r. Naming pattern seems similar except the initial first letter. I found this link where short detailed information on each register is given. Each register has their specific role of task to perform but after studying from various sources it seems that most registers are capable of being used in other tasks when become available.
Our hello world code assembly instructions used AT&T syntax. I have found assembly syntax in 2 flavors: Intel and At&T. Here is the diagran showing brief difference in syntax:
I did find myself comfortable with AT&T syntax. Let’s continue with it.
.data is directive that define section or regions where variables can be declared.
I will declare two string variables named str1 and str2. I will also declare one long type variable called num1
.data str1: .asciz “string variable 1\n” str2: .asciz “string variable 2\n” num1: .long 10
Information from various resources explains that variable inside
.data are treated as global variable. I am not much sure on how globals are treated in assembly. I may be wrong of my understanding of this directive so please do rectify me.
In our hello world sample,
.section __TEXT, __text can be replaced with
.text built in directive. This is because according to documentation:
This is equivalent to .section __TEXT, __text, regular, pure_instructions when the default -dynamic flag is in effect and equivalent to .section __TEXT, __text, regular when the -static flag is specified.
.text directive has been used mostly in many examples that I have found. It gives the impression that our main instruction goes inside it.
Now moving further we see
.global _main. To make an executable program of our assembly code two process is required: compile and link. We need a start entry point for the program to execute from like main() function in C. We could say that
.global _main act as a main function. If we don’t declare entry point with .global we get the following error:
Undefined symbols for architecture x86_64 “_main”, referenced from: implicit entry/start for main executable ld: symbol(s) not found for architecture x86_64
Next line in our sample starts instructions, and we need a stack for calling instructions.
pushq %rbp movq %rsp, %rbp subq $32, %rsp
To maintain the stack, we first push the base register %rbp onto the stack by
pushq %rbp, then we copy the stack register %rsp to the base register.
The %rbp base pointer is used by convention as a point of reference for finding parameters and local variables on the stack. When a subroutine is executing, the base pointer holds a copy of the stack pointer value from when the subroutine started executing. Parameters and local variables will always be located at known, constant offsets away from the base pointer value. We push the old base pointer value at the beginning of the subroutine so that we can later restore the appropriate base pointer value for the caller when the subroutine returns. Remember, the caller is not expecting the subroutine to change the value of the base pointer. We then move the stack pointer into EBP to obtain our point of reference for accessing parameters and local variables.
subq $32, %rspis interesting, we are reserving 32 bytes for stack and lowering the stackpointer (which is in %rsp) the address range from %rsp to -31(%rsp) is at our disposal for storing data. Note that the instruction
movq %rsp, %rbpcopies %rsp to %rbp so we can access stack location from %rbp to -32(%rbp). Remember our stack grows bottom to top. See the diagram example of stack:
Now we display Hello World string:
leaq str(%rip), %rdi callq _printf
At the moment I don’t have too much information on lea but documentation says
The lea instruction places the address specified by source into the register specified by destination.
leaq str(%rip), %rdicopies address not the value of %rip to %rdi.
str(%rip)can be considered as
[%rip + str], It says that we are trying to access a location in %rip register at “str” offset from start point. I am not too sure but it could mean that “str” offset value means bytes consumed by str variable that we declared. Secondly, global variables can be accessed from %rip only hence it seems our str value is residing in %rdi register. Why are we copying address to %rdi register? I don’t have proper explanation so i tried copying into other register and it failed. Documentation says:
Used for string, memory array copying and setting and for far pointer addressing with ES
callq _printf calls printf from C language. In mac it is prefixed with underscore unlike in linux(As I mentioned most code sample are for linux in web). It seems that value inside %rdi register simply gets printed after the call. Again I apologise for my depth of info here.
We clean up our stack after end of instructions
addq $32, %rsp popq %rbp
addq $32, %rspcloses the address range that we earlier created.
popq %rbpremoves the base pointer from the stack.
Time to compile and link our program:
as hello.s -o helloas.o
ld -e _main -macosx_version_min 10.10 -arch x86_64 helloas.o -o helloas
-etells start/main entry point
-macosx_version_minspecifies the minimum version of mac (if not provided then current os version is taken)
-archmention architecture that you confirming to
-osaving executable with meaningful/custom name
I got an error:
Undefined symbols for architecture x86_64 “_printf”, referenced from: _main in hello2as.o ld: symbol(s) not found for architecture x86_64
If we try to understand it looks printf function is not found and it makes sense because it is not assembly. We need to link c library too.
ld -macosx_version_min 10.10 -arch x86_64 /usr/lib/libc.dylib helloas.o -o helloas
We are able to crate and run our executable:
I apologise if post seems inadequate in depth. There are lots of area that I am still discovering. I am trying to learn small steps of assembly and want to share with others who also are looking into dive.
Writing 64 bit assembly on mac os x - idryman.org
x86 Registers - eecg.toronto
mac OS X assembly - fabiensanglard.net
Understanding C by learning assembly - recurse.com
Intel and AT&T assembly syntax - imada.sdu.dk
Guide to x86 - cs.virginia.edu