An OS process and DB session which won’t die: a systematic troubleshooting approach

This week, I lent a hand to a friend who had a session stuck in the database. By the time he got in touch, he’d already terminated the session in both the database and the Operating System. He mentioned that neither the database session nor the OS process had actually died.

The database session was marked as “KILLED,” but in the OS, he couldn’t quite pinpoint it, though the process was still running.

He explained that this process was tied to an RMAN backup that was in the middle of running when the NFS server disconnected due to a network issue.

While others thought the process was stuck waiting for an I/O state (which was likely the case initially since the NFS got disconnected) suggesting a server reboot, I decided to take a systematic approach to find the real problem at that moment.

The issue was reported in a WhatsApp group at 16:14, and I jumped in at 16:31. By 16:44, we had identified the root cause. The troubleshooting was happening asynchronously, with me guiding the person and responding quickly. So, 13 minutes is actually a pretty good pace for that. No need for a reboot or anything else drastic, just pure troubleshooting to find the root cause.

Here is the process we followed:

Check the OS process state:

#> ps -o pid,ppid,stat,wchan,etime,cmd -p 26974
PID      PPID STAT WCHAN      ELAPSED CMD
26974       1 ts   ptrace  6-07:02:56 oracleorcl (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

Check the OS process stack:

#> cat /proc/26974/stack
[<ffffffffad0abb71>] ptrace_stop+0x171/0x295
[<ffffffffad0abd38>] ptrace_do_notify+0x98/0xbc
[<ffffffffad0ad53b>] ptrace_notify+0x5b/0x8b
[<ffffffffad0a06d6>] do_exit+0x9e6/0xc22
[<ffffffffad0a09df>] do_group_exit+0x3f/0xa1
[<ffffffffad0ad714>] get_signal+0x1a4/0x868
[<ffffffffad02fd17>] do_signal+0x37/0x68d
[<ffffffffad0034ff>] exit_to_usermode_loop+0xa0/0x116
[<ffffffffad003dc2>] do_syscall_64+0x192/0x1b9
[<ffffffffada001b1>] entry_SYSCALL_64_after_hwframe+0x16d/0x0
[<ffffffffffffffff>] 0xffffffffffffffff

The first line is the step the process is at:

[<ffffffffad0abb71>] ptrace_stop+0x171/0x295

Check the OS process status:

#> cat /proc/26974/status
Name:   oracle_26974_orcl
Umask:  0022
State:  t (tracing stop)
Tgid:   26974
Ngid:   0
Pid:    26974
PPid:   1
TracerPid:      25569
Uid:    1001    1001    1001    1001
Gid:    1001    1004    1004    1004
FDSize: 64
Groups: 1001 1002 1003 1006
NStgid: 26974
NSpid:  26974
NSpgid: 26974
NSsid:  26974
VmPeak: 268706624 kB
VmSize: 268706624 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:     31680 kB
VmRSS:     31680 kB
RssAnon:            5944 kB
RssFile:           25736 kB
RssShmem:              0 kB
VmData:    16052 kB
VmStk:       232 kB
VmExe:    386612 kB
VmLib:     28164 kB
VmPTE:       624 kB
VmPMD:       308 kB
VmSwap:        0 kB
HugetlbPages:      10240 kB
Threads:        1
SigQ:   3/2056876
SigPnd: 0000000000080800
ShdPnd: 0000000000000100
SigBlk: 0000000000000004
SigIgn: 0000000016000207
SigCgt: 00000003c9c87cf8
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        0
Speculation_Store_Bypass:       thread vulnerable
Cpus_allowed:   ffffffff,ffffffff
Cpus_allowed_list:      0-63
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:      0-1
voluntary_ctxt_switches:        6065
nonvoluntary_ctxt_switches:     3

We could easily identify the “TracerPid: 25569”:

UID        PID  PPID  C STIME TTY          TIME CMD
root     25569 13775  0 16:05 pts/3    00:00:00 strace -p 26974

Well, we can see it is an strace process (PID 25569) tracing that PID 26974. That strace process had the Parent PID 13775 which was the DBA itself who initiated the strace to try to figure out why RMAN was stuck.

In the end we concluded that the RMAN process was hung due to the NFS disconnection, but it was the strace initiated by the DBA which was keeping the process from dying in the OS. Once strace PID was killed everything else was released.

Hope you liked it!

Franky