My first x86-64 assembly in macOS - Part 1
byUpdate: Second part of series is available.
I am reading a book Coders at work. Each chapter consists of interview with an experts in the field of computer programming, engineering and science. Most of them had started programming during 70’s-80's, the time when there wasn’t many choice of programming languages available like modern day. Most of them were programming in low-level like assembly or binary. I was thrilled and excited from reading of their journey. It motivated me to learn small things about assembly and learn more about machine closely. I am writing this blog post on my journey of learning assembly.
I have a macbook pro. It’s a 64-bit machine. Intel based. I started collecting information on resources and tips to get started. My machine is x86-64 so it is x86 assembly. I found little resources on starting assembly on mac, and among what I found was for linux. Throughout the search I have collected good blogs and resources that I’ll be sharing. This post is more about what I am discovering or learning rather than tutorial.
I started with evergreen “Hello world” program. Here is the short program that i will try to explain what i have figured it so far.
.data
str: .asciz "Hello World\n"
.section __TEXT,__text
.globl _main
_main:
pushq %rbp
movq %rsp, %rbp
subq $32, %rsp
leaq str(%rip), %rdi
callq _printf
addq $32, %rsp
popq %rbp
retq
Lets talk about registers. General definition is:
The registers are like variables built in the processor.
Let’s start with available registers in x86(i386) -
In our hello world code we have used registers like rbp, rsp etc. In above diagram we see ebp, esp etc they are 32 bit variant - In other hand we used 64 variant whose inital first letter starts with r. Naming pattern seems similar except the initial first letter. I found this link where short detailed information on each register is given. Each register has their specific role of task to perform but after studying from various sources it seems that most registers are capable of being used in other tasks when become available.
You might also like: CSS Hex Code Colors With Alpha Values
Our hello world code assembly instructions used AT&T syntax. I have found assembly syntax in 2 flavors: Intel and At&T. Here is the diagran showing brief difference in syntax:
I did find myself comfortable with AT&T syntax. Let’s continue with it.
.data
is directive that define section or regions where variables can be declared.
I will declare two string variables named str1 and str2. I will also declare one long type variable called num1
.data
str1: .asciz “string variable 1\n”
str2: .asciz “string variable 2\n”
num1: .long 10
Information from various resources explains that variable inside .data
are treated as global variable. I am not much sure on how globals are treated in assembly. I may be wrong of my understanding of this directive so please do rectify me.
In our hello world sample, .section __TEXT, __text
can be replaced with .text
built in directive. This is because according to documentation:
.text
This is equivalent to .section __TEXT, __text, regular, pure_instructions when the default -dynamic flag is in effect and equivalent to .section __TEXT, __text, regular when the -static flag is specified.
.text
directive has been used mostly in many examples that I have found. It gives the impression that our main instruction goes inside it.
Now moving further we see .global _main
. To make an executable program of our assembly code two process is required: compile and link. We need a start entry point for the program to execute from like main() function in C. We could say that .global _main
act as a main function. If we don’t declare entry point with .global we get the following error:
Undefined symbols for architecture x86_64
“_main”, referenced from:
implicit entry/start for main executable
ld: symbol(s) not found for architecture x86_64
pushq %rbp
movq %rsp, %rbp
subq $32, %rsp
To maintain the stack, we first push the base register %rbp onto the stack by
pushq %rbp
, then we copy the stack register %rsp to the base register.%rbp
The %rbp base pointer is used by convention as a point of reference for finding parameters and local variables on the stack. When a subroutine is executing, the base pointer holds a copy of the stack pointer value from when the subroutine started executing. Parameters and local variables will always be located at known, constant offsets away from the base pointer value. We push the old base pointer value at the beginning of the subroutine so that we can later restore the appropriate base pointer value for the caller when the subroutine returns. Remember, the caller is not expecting the subroutine to change the value of the base pointer. We then move the stack pointer into EBP to obtain our point of reference for accessing parameters and local variables.
The instruction
subq $32, %rsp
is interesting, we are reserving 32 bytes for stack and lowering the stackpointer (which is in %rsp) the address range from %rsp to -31(%rsp) is at our disposal for storing data. Note that the instruction movq %rsp, %rbp
copies %rsp to %rbp so we can access stack location from %rbp to -32(%rbp). Remember our stack grows bottom to top. See the diagram example of stack:You might also like: Adding two numbers in macOS x86-64 Assembly - Part 2
Now we display Hello World string:
leaq str(%rip), %rdi
callq _printf
At the moment I don't have too much information on lea but documentation says
lea
The lea instruction places the address specified by source into the register specified by destination.
It looks
leaq str(%rip), %rdi
copies address not the value of %rip to %rdi. str(%rip)
can be considered as [%rip + str]
, It says that we are trying to access a location in %rip register at “str” offset from start point. I am not too sure but it could mean that “str” offset value means bytes consumed by str variable that we declared. Secondly, global variables can be accessed from %rip only hence it seems our str value is residing in %rdi register. Why are we copying address to %rdi register? I don’t have proper explanation so i tried copying into other register and it failed. Documentation says:%rdi
Used for string, memory array copying and setting and for far pointer addressing with ES
callq _printf
calls printf from C language. In mac it is prefixed with underscore unlike in linux(As I mentioned most code sample are for linux in web). It seems that value inside %rdi register simply gets printed after the call. Again I apologise for my depth of info here.
We clean up our stack after end of instructions
addq $32, %rsp
popq %rbp
addq $32, %rsp
closes the address range that we earlier created.
popq %rbp
removes the base pointer from the stack.
Time to compile and link our program:as hello.s -o helloas.o
ld -e _main -macosx_version_min 10.10 -arch x86_64 helloas.o -o helloas
-e
tells start/main entry point-macosx_version_min
specifies the minimum version of mac (if not provided then current os version is taken)-arch
mention architecture that you confirming to-o
saving executable with meaningful/custom nameYou might also like: Adding two numbers in macOS x86-64 Assembly - Part 2
I got an error:
Undefined symbols for architecture x86_64
“_printf”, referenced from:
_main in hello2as.o
ld: symbol(s) not found for architecture x86_64
If we try to understand it looks printf function is not found and it makes sense because it is not assembly. We need to link c library too.
ld -macosx_version_min 10.10 -arch x86_64 /usr/lib/libc.dylib helloas.o -o helloas
We are able to crate and run our executable:
./helloas
I apologise if post seems inadequate in depth. There are lots of area that I am still discovering. I am trying to learn small steps of assembly and want to share with others who also are looking into dive.
You might also like: AVAudioPlayer not playing any sound
References:
Writing 64 bit assembly on mac os x - idryman.org
x86 Registers - eecg.toronto
mac OS X assembly - fabiensanglard.net
Understanding C by learning assembly - recurse.com
Intel and AT&T assembly syntax - imada.sdu.dk
Guide to x86 - cs.virginia.edu