Linux System Calls

How a read() becomes a kernel transition and back.

At a Glance

One instruction — on x86-64, the SYSCALL instruction raises privilege and jumps to a kernel-provided entry. Number in rax, arguments in rdi, rsi, rdx, r10, r8, r9.
Linear path — user SYSCALL → kernel entry_SYSCALL_64 → do_syscall_64 → sys_call_table[nr] → SYSRET. On the way back the kernel may also handle signals, reschedule, or touch TLS before returning to userspace.
Kernel returns -errno — the raw return in rax is −EFOO for errors (the range [−4095, −1]). Libc wrappers translate to -1 and set errno; raw syscall(2) users must check themselves.
SYSCALL clobbers rcx and r11 — the hardware uses them to save user RIP and RFLAGS. That's why the 4th argument is in r10 instead of rcx like the C calling convention.
vDSO skips the transition — vDSO pages are mapped into every process; clock_gettime, gettimeofday, getcpu, and time run entirely in userspace by reading kernel-maintained memory.
Most apps never touch SYSCALL — glibc/musl wrappers do the register setup, translate errno, and handle the cancellation points for pthread_cancel. Go skips libc and emits SYSCALL directly.
Per-architecture — arm64 uses SVC with the number in x8, args in x0..x5. Legacy 32-bit x86 used int 0x80; modern 32-bit uses sysenter. Numbers differ per arch — see Filippo's table.
Fully observable — strace via ptrace, bpftrace via tracepoints, and seccomp-bpf for filtering.

x86-64 SYSCALL ABI

Where each piece of the call lives. The syscall ABI overlaps the C calling convention but differs in two places: rcx is replaced by r10, and there are six argument registers, not seven.

Register	Role	C ABI role (for comparison)
`rax`	syscall number (from syscall_64.tbl); return value on exit	return value
`rdi`	arg 1	arg 1
`rsi`	arg 2	arg 2
`rdx`	arg 3	arg 3
`r10`	arg 4 — not `rcx`	(`rcx` is arg 4 in C)
`r8`	arg 5	arg 5
`r9`	arg 6	arg 6
`rcx`	clobbered — holds saved user `RIP`	arg 4
`r11`	clobbered — holds saved user `RFLAGS`	caller-saved scratch
`r12`–`r15`, `rbx`, `rbp`, `rsp`	preserved across the syscall	callee-saved

Minimal call from hand-written assembly

; write(1, "hi\n", 3)
mov rax, 1          ; __NR_write
mov rdi, 1          ; fd = stdout
lea rsi, [rel msg]  ; buf
mov rdx, 3          ; count
syscall             ; rax = return value or -errno
; rcx and r11 now clobbered

End-to-End Sequence

What happens between the user's SYSCALL and the instruction right after it, for a typical read(fd, buf, n).

sequenceDiagram participant U as User code participant C as CPU participant K as Kernel (entry_SYSCALL_64) participant H as sys_read handler participant F as VFS / fs / driver U->>C: SYSCALL C->>C: load RIP from MSR_LSTAR (kernel entry) C->>C: save user RIP to RCX, RFLAGS to R11 C->>C: mask RFLAGS via MSR_FMASK C->>C: set CS/SS from MSR_STAR, CPL=0 C->>K: jump to entry_SYSCALL_64 K->>K: swapgs (per-CPU data) K->>K: switch to task's kernel stack K->>K: push user regs (pt_regs) K->>K: sys_call_table[rax] = sys_read K->>H: call sys_read(fd, buf, count) H->>F: vfs_read → file->f_op->read_iter → driver F-->>H: bytes read or -EIO / -EAGAIN / ... H-->>K: return in rax (positive = n, negative = -errno) K->>K: syscall_exit_work: signals? resched? notify? K->>K: restore user regs, swapgs K->>C: SYSRET (or IRET if work was pending) C->>U: resume at user RIP, one past SYSCALL

What SYSCALL Does (Hardware Level)

SYSCALL is a one-instruction privilege transition. It's fast because it bypasses segmentation descriptor loads; all the per-CPU state it needs lives in model-specific registers programmed once at boot.

MSR	Set at boot to	SYSCALL reads it to…
`MSR_LSTAR`	`entry_SYSCALL_64`	…load new `RIP`
`MSR_STAR`	kernel CS/SS + user CS/SS	…set CS/SS selectors and CPL=0
`MSR_FMASK`	bits to mask out of `RFLAGS`	…disable IF/DF/TF/AC before entry
`MSR_GS_BASE` / `MSR_KERNEL_GS_BASE`	per-CPU area base	…swapgs flips these so `%gs` points to per-CPU data

Effect in one bullet list:

RCX ← RIP, R11 ← RFLAGS, RFLAGS &= ~MSR_FMASK.
CS ← MSR_STAR[47:32], SS ← CS + 8, both CPL=0.
RIP ← MSR_LSTAR. Kernel is now executing, still on the user stack for a few instructions until it explicitly switches.
SYSRET undoes all of this: RIP ← RCX, RFLAGS ← R11, selectors restored from MSR_STAR[63:48], CPL=3.

Kernel Entry Path

Simplified view of what entry_SYSCALL_64 does before handing off to a C-level handler. The real assembly lives in arch/x86/entry/entry_64.S.

entry_SYSCALL_64:
    swapgs                          ; %gs now points at per-CPU area
    mov    %rsp, PER_CPU_VAR(cpu_tss + TSS_sp2)  ; save user %rsp
    mov    PER_CPU_VAR(cpu_current_top_of_stack), %rsp  ; kernel stack
    pushq  $__USER_DS                                ; build pt_regs
    pushq  PER_CPU_VAR(cpu_tss + TSS_sp2)            ; user %rsp
    pushq  %r11                                      ; saved RFLAGS
    pushq  $__USER_CS                                ; user CS
    pushq  %rcx                                      ; saved RIP
    pushq  %rax                                      ; syscall nr
    pushq  %rdi ... %r15                             ; all GPRs
    mov    %rsp, %rdi                                ; pt_regs *
    call   do_syscall_64                             ; C land
    ; ... exit work (signals, resched) ...
    sysretq                                          ; or iretq

do_syscall_64(regs, nr) dispatches via the sys_call_table:

// kernel/entry/common.c, simplified
__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr) {
    if (likely(nr < NR_syscalls)) {
        nr = array_index_nospec(nr, NR_syscalls);  // Spectre-v1 gadget guard
        regs->ax = sys_call_table[nr](regs);       // e.g. __x64_sys_read
    } else {
        regs->ax = -ENOSYS;
    }
    syscall_exit_to_user_mode(regs);
}

Why `r10`, Not `rcx`

A subtle trap when calling syscalls directly.

The CPU uses rcx to save the user RIP, and r11 for RFLAGS. So the syscall ABI can't pass the 4th argument in rcx the way the C ABI does — it would be overwritten the instant SYSCALL executes. The Linux ABI shifts arg 4 to r10 instead. A glibc wrapper for a 4+ arg syscall therefore looks like:

mov    %rcx, %r10     ; slot the C-ABI arg-4 into the syscall-ABI arg-4 register
mov    $SYS_foo, %eax
syscall
; rcx is garbage (== RIP), r11 is garbage (== RFLAGS), all other callee-saved regs intact

The libc Wrapper

Why read() in your C program is a function call, not an instruction. Everything below is glibc/musl boilerplate; Go emits syscalls inline.

Copy args into the syscall registers (with the rcx → r10 dance if needed).
Execute SYSCALL.
Check if the return is in [−4095, −1]. If so, negate and store in __errno_location(), return -1. Otherwise return the value as-is.
If the thread is in a cancellation region (POSIX cancellation points), test pthread_cancel flags before and after the call.
On some syscalls, fall back to a legacy number on old kernels (e.g. statx → stat).

For a syscall with no wrapper (or to bypass one), use the generic syscall(2):

#include <sys/syscall.h>
#include <unistd.h>
#include <errno.h>

long ret = syscall(SYS_gettid);
if (ret == -1) perror("gettid");

The `errno` Convention

There is no error flag. The sign of rax is the error flag.

The kernel returns -errno in rax. -EAGAIN is -11, -EINVAL is -22, etc.
To detect an error, libc checks whether (unsigned long)rax >= -4095UL — equivalently, whether rax is in the range [−4095, −1]. That carve-out at the top of the address space is the reason pointers returned by mmap can never collide with error codes.
Error: set errno = -rax, return -1. Success: return rax unchanged (could be a large positive value, could be a pointer for brk/mmap).
A handful of syscalls (getpid, getuid, …) cannot fail and their wrappers never set errno.

vDSO: Syscalls That Aren't Syscalls

For a few hot calls the kernel publishes user-callable code + data pages; the call happens entirely in userspace.

$ cat /proc/self/maps | grep vdso
7fff14b88000-7fff14b8a000 r-xp 00000000 00:00 0   [vdso]

What's mapped — a shared object (linux-vdso.so.1) containing thin implementations of __vdso_clock_gettime, __vdso_gettimeofday, __vdso_getcpu, __vdso_time (on x86-64). Plus a read-only data page the kernel updates with the current timekeeper state and CPU ID.
How libc finds it — the kernel passes the vDSO base via the auxv entry AT_SYSINFO_EHDR. Libc walks the ELF dynsym table and caches function pointers.
Why it's a win — clock_gettime(CLOCK_MONOTONIC) in vDSO reads TSC + timekeeper data, runs ~10–20 ns. The same call via real syscall costs ~1 µs (~50× slower).
Fallback — if the clocksource isn't vDSO-safe (e.g. running under a hypervisor without TSC, or on an unsupported clock id), the vDSO stub falls through to a real syscall.

Returning to Userspace

The fast path out of the kernel is SYSRET; any pending thread-flag work forces a slow IRET.

flowchart TB A["sys_call handler returns"] --> B["syscall_exit_to_user_mode()"] B --> C{"any TIF_* work flags set?"} C -- no --> D["SYSRETQ (fast)"] C -- yes --> E["exit_to_user_mode_loop()"] E --> F["TIF_SIGPENDING: run signal handler setup"] E --> G["TIF_NEED_RESCHED: schedule()"] E --> H["TIF_NOTIFY_RESUME: rseq, uprobes, etc."] F --> I["IRETQ (slow)"] G --> I H --> I D --> Z["back in user code"] I --> Z

SYSRETQ — ~5 ns return, but requires that rcx and r11 still hold a valid user RIP/RFLAGS and the canonical bit is right. Any weird state (e.g. user set RFLAGS.TF for single-stepping, or RIP isn't canonical) forces IRETQ.
IRETQ — slower but can restore any architectural state. Always used when returning from an interrupt or when signal delivery modified the stack.

Other Architectures

Same shape, different letters. The syscall number is also per-arch: __NR_read is 0 on x86-64, 63 on arm64, 3 on x86-32. Sticking to SYS_read from <sys/syscall.h> keeps code portable.

Arch	Enter	Number in	Args in	Return in	Clobbers
x86-64	`SYSCALL`	`rax`	`rdi, rsi, rdx, r10, r8, r9`	`rax`	`rcx, r11`
arm64 (AArch64)	`SVC #0`	`x8`	`x0, x1, x2, x3, x4, x5`	`x0`	— (args in `x0…x5` preserved on return other than `x0`)
x86-32 (i386)	`int 0x80` or `sysenter`	`eax`	`ebx, ecx, edx, esi, edi, ebp`	`eax`	— (`int 0x80` preserves all GPRs)
RISC-V	`ecall`	`a7`	`a0, a1, a2, a3, a4, a5`	`a0`	—
ppc64le	`sc`	`r0`	`r3, r4, r5, r6, r7, r8`	`r3` (CR0.SO = error flag)	—

Observing and Filtering

Every syscall is a tracepoint. Between ptrace, tracepoints, and BPF, nothing slips past.

Tool	Mechanism	Typical use
`strace`	`ptrace`(PTRACE_SYSCALL) — stop the target on syscall entry + exit, decode registers.	Debugging: which syscall failed with what errno, with timing.
`perf trace`	`raw_syscalls:sys_enter` / `sys_exit` tracepoints via perf ring buffer.	Lower-overhead replacement for strace; aggregation across a whole host.
`bpftrace` / bcc	eBPF attached to the same tracepoints, with in-kernel filtering.	Production: count, histogram, or selectively capture with negligible overhead.
`seccomp-bpf`	A BPF program attached to the task; runs on every syscall and returns allow / errno / kill / trap / user_notify.	Sandboxing: Docker/containerd, systemd units, browsers, CRIU.
audit	Kernel auditing subsystem, rules configured via `auditctl`.	Compliance logging; more heavyweight than tracepoints.

Performance

A syscall is cheap but not free; in tight loops, avoiding or batching them matters.

Raw cost — ~1–2 µs on modern x86-64, dominated by pipeline flush and CR3 switch (KPTI). getpid() is the standard micro-benchmark.
KPTI (Meltdown mitigation) — each syscall switches page tables twice. Typical overhead: 10–30% on syscall-heavy workloads. Disable at boot with pti=off only if you know the hardware is safe.
Spectre-v1 array_index_nospec — the sys_call_table index is sanitized against speculation on every dispatch; a few cycles per call.
Batching APIs — readv/writev (scatter-gather), sendmmsg/recvmmsg, epoll.
io_uring — submit many I/O ops via a shared ring buffer, amortize transitions to zero in the fast path. unixism.net/loti is the canonical guide.
vDSO — as above, shaves 50× off the hot clock reads.

References

syscalls(2) — the master list of Linux system calls with per-arch availability.
syscall(2) — the generic wrapper; documents per-arch register conventions in one place.
arch/x86/entry/syscalls/syscall_64.tbl — authoritative x86-64 number table in the kernel tree.
Filippo's Linux syscall table — interactive cross-arch lookup.
arch/x86/entry/entry_64.S — the real entry assembly.
Intel SDM: SYSCALL / SYSRET (Cloutier's mirror) — precise semantics and MSR reads.
vdso(7) — which functions are exported, per arch.
LWN: The vDSO revisited — how the vDSO is built and mapped.
ptrace(2) — the introspection mechanism behind strace, gdb, and CRIU.
seccomp(2) — filtering/auditing syscalls via BPF.
System V AMD64 ABI — the C calling convention the syscall ABI borrows from.