Linux Process States
R, S, D, Z — what the letters in ps and top actually mean.
At a Glance
- Two fields, one story — a task's state lives in
task_struct->__state(runtime) andtask_struct->exit_state(post-do_exit). Tools likepscollapse both into a single letter. - The letters —
Rrunning/runnable,Sinterruptible sleep,Duninterruptible sleep,Iidle,Tstopped,ttraced,Zzombie,Xdead. Plus flag suffixes:<high-prio,Nlow-prio,ssession leader,lmulti-threaded,+foreground process group. - R is "runnable," not "on CPU" — it means "on a runqueue." A task shown as
Rmay not be executing right now; it's just eligible.psreports the scheduler's view, not the CPU's. - S vs D — both are blocked.
Swakes on a signal or event;Dignores signals entirely and waits on a specific completion (usually a storage driver or NFS server).Dis whykill -9sometimes "doesn't work." - K (TASK_KILLABLE) — the sane cousin of
D: same "don't wake on SIGTERM" guarantee, but SIGKILL does rouse it. Used by NFS and FUSE so stuck mounts don't become unkillable.psstill printsD. - I (TASK_IDLE) — added in 4.2. Behaves like
Dfor signals but doesn't count toward load average. Kernel threads waiting idly for work (NFSnfsd, XFS workers) use it to stop spuriously inflatinguptime. - Load average counts R + D — this is why a box with nothing on-CPU but a stuck NFS client can report load of 50. The number is "runnable + uninterruptibly sleeping," not "CPU busy."
- Z is not a leak — a zombie holds only its
task_structand exit code. It's waiting for the parent towait(). The file descriptors, memory, and threads are already gone. The problem is PID-space pressure, not RAM. - Inspect via /proc —
/proc/PID/statusgives the letter and name;/proc/PID/statis the one-line machine-readable form;/proc/PID/wchannames the kernel function the task is sleeping in;/proc/PID/stackdumps the in-kernel stack.
The Model
What a Linux kernel actually stores, and why tooling sometimes disagrees.
struct task_struct {
...
unsigned int __state; /* TASK_RUNNING, TASK_INTERRUPTIBLE, ... */
unsigned int exit_state; /* EXIT_ZOMBIE, EXIT_DEAD */
...
};
/* include/linux/sched.h */
#define TASK_RUNNING 0x00000000
#define TASK_INTERRUPTIBLE 0x00000001
#define TASK_UNINTERRUPTIBLE 0x00000002
#define __TASK_STOPPED 0x00000004
#define __TASK_TRACED 0x00000008
#define TASK_DEAD 0x00000080
#define TASK_WAKEKILL 0x00000100 /* combined with D to form "killable" */
#define TASK_NOLOAD 0x00000400 /* combined with D to form TASK_IDLE */
#define TASK_KILLABLE (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
#define TASK_IDLE (TASK_NOLOAD | TASK_UNINTERRUPTIBLE)
#define EXIT_ZOMBIE 0x00000020
#define EXIT_DEAD 0x00000010 - Runtime vs exit — while the task is alive,
__stateis authoritative. Oncedo_exit()runs, the task is moved toexit_stateand won't be scheduled again. - Composable flags —
KILLABLEandIDLEare not separate states; they'reUNINTERRUPTIBLEplus a flag that changes wake-up and accounting behaviour. That's whypsstill printsDfor both. - ps codes are not kernel names —
pstranslates kernel constants into single letters (and sometimes flattens distinctions). For the raw truth, read/proc/PID/status→State:line.
The States
Every letter you'll see, its kernel-side name, and what it means for scheduling, signals, and load.
| Letter | Kernel constant | Meaning | Wakes on | Load avg? |
|---|---|---|---|---|
R | TASK_RUNNING | On a CPU or on a runqueue waiting for one. | already runnable | Yes |
S | TASK_INTERRUPTIBLE | Sleeping; most idle processes live here (read on a socket, epoll_wait, futex). | event or any unblocked signal | No |
D | TASK_UNINTERRUPTIBLE | Blocked on a specific completion, typically disk I/O or a driver. Signals are ignored. | only the awaited event | Yes |
D (K) | TASK_KILLABLE | D plus: SIGKILL can wake it. Used by NFS, FUSE, and anything that used to trap users in D forever. | awaited event or SIGKILL | Yes |
I | TASK_IDLE | D plus NOLOAD: excluded from load-average calculation. Used by kernel worker threads. | awaited event | No |
T | TASK_STOPPED | Paused by SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU. Resumes on SIGCONT. | SIGCONT | No |
t | TASK_TRACED | Stopped by a tracer (ptrace, gdb, strace) at a syscall, signal, or breakpoint. | tracer PTRACE_CONT | No |
Z | EXIT_ZOMBIE | Exited; task_struct kept around so the parent can read exit status. | parent wait() | No |
X | EXIT_DEAD / TASK_DEAD | Being reaped; the task_struct is on its way out. Rarely seen by tooling. | — | No |
State Diagram
The common transitions a userspace task goes through.
R — Running or Runnable
R means "the scheduler would pick this task if given the chance," not "this task is executing now."
- Runqueue membership — each CPU has a
struct rqwith red-black trees (for CFS/EEVDF) or FIFO lists (forSCHED_FIFO/SCHED_RR). A task in any of these isTASK_RUNNING. - On-CPU is a subset — only one task per CPU is actually on-CPU at any moment.
top's running count usually matches the number of CPUs minus idle;psjust shows everyRregardless. - No timeout — CFS/EEVDF preempts by virtual runtime, not a fixed timeslice. A task can stay
Rindefinitely while being periodically descheduled; it never transitions toSunless it blocks.
S vs D — The Two Sleeps
Both are off-CPU and blocked. The difference is what it takes to wake them.
| S — Interruptible | D — Uninterruptible | |
|---|---|---|
| Kernel helper | wait_event_interruptible() | wait_event() / io_schedule() |
| Wakes on signal? | Yes — syscall returns -EINTR | No — signal stays pending |
| Typical callers | sockets, pipes, futex, epoll, sleep(), wait() | block-layer I/O, page fault on disk, NFS, direct disk read |
| Counts toward load | No | Yes |
Can kill -9? | Yes | No — only SIGKILL + TASK_KILLABLE (see below) |
| Why it exists | Most things. Default for well-written drivers. | The task holds a kernel-allocated resource (buffer, lock, reference) that a signal handler cannot safely release mid-flight. |
TASK_KILLABLE — The Fix for Stuck D
Added in 2.6.25 specifically to escape "D forever on NFS."
The problem: an NFS server goes away while a client task is blocked on read(2). Inside the kernel, the task is in wait_event() with TASK_UNINTERRUPTIBLE. A SIGTERM (or even SIGKILL) cannot wake it because the wait is uninterruptible. The task is stuck for eternity, the PID leaks, the mount is unkillable.
The fix: wait_event_killable(), which sets TASK_UNINTERRUPTIBLE | TASK_WAKEKILL. SIGKILL (and only SIGKILL) wakes it; every other signal is still ignored. Callers preserve the "don't return -EINTR from a random syscall" guarantee while still letting the admin reap a truly stuck process.
ps reports D for both plain TASK_UNINTERRUPTIBLE and TASK_KILLABLE; you can't tell them apart from userspace without reading /proc/PID/stack and recognising the wait function.
TASK_IDLE — Why Your Load Average Doesn't Spike
A late addition (4.2) to stop kernel threads from inflating load.
Before 4.2, kernel worker threads like nfsd, loop*, and various XFS workers used TASK_UNINTERRUPTIBLE while waiting for work. That meant an idle file server with 16 NFS threads reported a load average of 16. "Load average" in Linux is runnable + uninterruptibly-sleeping, a heritage from when D was a rare, short-lived state.
TASK_IDLE = TASK_UNINTERRUPTIBLE | TASK_NOLOAD. The NOLOAD flag excludes the task from the load-average tick. Signal behaviour is unchanged: still uninterruptible, still not killable. New code should use wait_event_idle() / schedule_timeout_idle() for "I'm a kernel thread waiting patiently for work."
Z — Zombie
The task is dead. The PID is not yet freed.
- What's left — just
task_structand exit info (exit code, resource usage fromrusage, signal that killed it). Memory, open FDs, signal handlers, and threads are all already freed indo_exit(). - Why it exists — so the parent can call
wait()/waitpid()/waitid()and read the exit status. Without this, a fast-exiting child could disappear before the parent has a chance to look. - How it's reaped — parent calls one of the
waitfamily, or ignores SIGCHLD withSA_NOCLDWAIT, or sets SIGCHLD toSIG_IGN. Either way the kernel transitions the zombie toEXIT_DEADand releases thetask_struct. - Orphan handling — if the parent dies first, the child is reparented (to the nearest
PR_SET_CHILD_SUBREAPERancestor, or PID 1). That new parent is responsible for reaping. This is whyinit/systemdquietly reaps an unending stream of zombies. - When it's a bug — a long-running parent that forks children and never
wait()s. PIDs accumulate, the PID-space fills (/proc/sys/kernel/pid_max, default 4M), and eventuallyfork()starts failing withEAGAIN.
T and t — Stopped and Traced
Two related states with different causes.
| T (TASK_STOPPED) | t (TASK_TRACED) | |
|---|---|---|
| Cause | Job-control signal: SIGSTOP, SIGTSTP (^Z), SIGTTIN, SIGTTOU | Tracer attached (ptrace): stopped at a syscall boundary, signal, or breakpoint |
| Resume | SIGCONT | Tracer issues PTRACE_CONT, PTRACE_SYSCALL, etc. |
| Who can resume | anyone who can signal it | only the tracer |
| Typical tools | shell job control (fg, bg) | gdb, strace, ltrace, perf uprobes |
| ptrace interaction | a T task can still be attached by a tracer | cannot be signalled through normal kill except SIGKILL |
Load Average — What Actually Counts
The number is not "CPU busy %." It's nr_running + nr_uninterruptible, averaged with three exponential decays (1, 5, 15 minutes).
- R contributes — every task on any CPU's runqueue at the sampling tick (every 5 seconds).
- D contributes — every task in
TASK_UNINTERRUPTIBLEthat isn't flaggedNOLOAD. - S does not contribute — ordinary sleeping tasks are invisible to load.
- Why D is in there — the metric predates TASK_IDLE and was originally meant to capture "work queued up." Heavy disk I/O is work even if no CPU is busy, so Linux counts it.
- Consequence — on a system with a failed SAN or a slow NFS server, load can be huge while CPU is 100% idle. Cross-reference with
mpstat,iostat, orvmstatbefore concluding you need more CPU.
How to Inspect
Every answer ultimately comes from /proc/PID.
| Source | Gives you | Example |
|---|---|---|
/proc/PID/status | Human-readable; State: S (sleeping) | grep State /proc/1234/status |
/proc/PID/stat | One-line, tool-parseable; state letter is field 3 | awk '{print $3}' /proc/1234/stat |
/proc/PID/wchan | Kernel function the task is sleeping in | cat /proc/1234/wchan → futex_wait_queue |
/proc/PID/stack | Full in-kernel stack (needs CONFIG_STACKTRACE) | cat /proc/1234/stack |
ps -eo pid,stat,wchan,cmd | State + flags + sleep point in one line | see all D tasks: ps -eo stat,pid,cmd | awk '$1 ~ /^D/' |
top / htop | Live view; column S is the state letter | press t in htop to toggle task tree |
bpftrace / perf sched | Tracks transitions (sched_switch, sched_wakeup) with nanosecond timestamps | bpftrace -e 'tracepoint:sched:sched_switch { @[args->prev_state] = count(); }' |
ps State Suffix Flags
The state letter is often followed by one or more flag characters.
| Suffix | Meaning |
|---|---|
< | High-priority (negative nice) |
N | Low-priority (positive nice) |
L | Has pages locked into memory (mlock) |
s | Session leader |
l | Multi-threaded (uses CLONE_THREAD) |
+ | In the foreground process group of its tty |
So Ssl+ = interruptible sleep, session leader, multi-threaded, foreground. A very typical shell-launched server.
Gotchas
- "I sent SIGKILL and it's still there." — task is in plain
D(not killable). Check/proc/PID/wchan: if it's a network FS function, the server is gone. Your options are waiting for the server to come back or rebooting. - "Load is 80 but CPU is idle." — count your
Dtasks (ps -eo stat | grep -c '^D'). If that roughly matches load, it's disk/NFS, not CPU. - "Zombies piling up." — parent isn't reaping. Options: fix the parent to call
wait(); set SIGCHLD toSIG_IGN; useSA_NOCLDWAIT; or make a subreaper viaprctl(PR_SET_CHILD_SUBREAPER)so something else reaps. - "ps shows R but top shows 0% CPU." —
Ris "runnable." If it's not getting scheduled, it's waiting behind other runnable tasks or pinned off a CPU that's saturated. Look at/proc/PID/schedstatfor run-delay. - "Traced process won't respond to kill." — a
ttask can only be resumed by its tracer. If the tracer crashed without detaching, the tracee is stuck.kill -9still works; ordinary signals are held until detach. - "Process state changes mid-read of /proc." —
/proc/PIDfiles are snapshots taken at read time, not atomic. Between two reads you can seeR→S→R. For accurate cross-field views, read a single file once and parse it.
References
- ps(1) — see the PROCESS STATE CODES section for the full letter inventory.
- proc(5) — fields of
/proc/PID/stat,/proc/PID/status,/proc/loadavg. include/linux/sched.h— authoritative state definitions.- LWN: TASK_KILLABLE — the original article introducing killable uninterruptible sleep.
- LWN: TASK_IDLE — Peter Zijlstra's patch stopping kernel threads from inflating load.
- Brendan Gregg: Linux Load Averages — archaeology of why
Dis in the load metric. - wait(2) — how zombies get reaped.