Kubelet Streaming Server Port Closed Unexpectedly

Posted by Hao Liang's Blog on Saturday, July 13, 2024

1. Description

Kernel version: 5.4.241

kubelet version: 1.22.5

nvidia driver version: 535.161.08 and 535.154.05

After the kubelet process on the node is started, it listens to a random port (46127) in the range of ip_local_port_range

ss -lntpe |grep kubelet

code snippets:

After running for a while, the listen port suddenly disappeared

The corresponding fd (fd=13) is also closed, but the kubelet process still exists

2. Analysis

From the corresponding kubelet code snippets, we found that the streaming server is pulled up through a separate goroutine.

Use pprof to view the goroutine call stack. The corresponding streaming server goroutine still exists, which means that the server is still listening on the port after startup and has not exited.

Under normal circumstances, in the Linux kernel, the socket fd of process A can only be closed by process A itself calling close(). Process B cannot close the fd of process A without killing the process of A.

But the reality is that the kubelet process itself did not perform the operation of closing fd, but fd was closed unexpectedly So we began to suspect that it was a problem with the kernel itself.

To locate kernel problems, the first thing that comes to mind is to inject the ko module or intercept kernel-specific function calls through ebpf, and print out the call stack for analysis. Here we choose ebpf, which is simpler in implementation, and choose filp_close as the kprobe injection point. ***As long as the fd corresponding to the listening port of the kubelet process is deleted, its call stack will be printed out. ***

SEC ("kprobe/filp_close")
int trace_filp_close(struct pt_regs *ctx) 
{
    struct task_struct *cur = (struct task_struct *)bpf_get_current_task();
    u32 tgid = BPF_CORE_READ(cur, tgid);
    struct event event = 10;
    u32 zero_key = 0;
    struct bpf_arg *bpf_arg;
    struct file *file; 
    struct inode *f_inode; 
    unsigned long i _ino; 
    umode_t mode;

    bpf_arg = bpf_map_lookup_elem(&bpf_arg_map, &zero_key);
    if (Ibpf_arg || (bpf_arg-›tgid != 0 8& bpf_arg-›tgid != tgid))
        return 0;
        
    file = (struct file *)PT_REGS_PARM1(ctx);
    if(!file)
        return 0;
    f_inode = BPF_CORE_READ(file, f_inode);
    if(!f_inode)
        return 0;
    i_ino = BPF_CORE_READ(F_inode, i_ino);
    if (bpf_arg-›ino !=0 && i_ino != bpf_arg-›ino)
        return 0;
    mode = BPF_CORE_READ(f_inode, i_mode);
    if(!S_ISSOCK(mode))
        return 0;
    
    BPF_CORE_READ__STR_INTO(&event.task, cur, comm);
    event.pid = bpf_get_current_pid_tgid();
    event.tgid = tgid;
    event.ino = i_ino;
    event.u_stack_id = bpf_get_stackid(ctx, &user_stackmap, BPF_F_USER_STACK | BPF_F_REUSE_STACKID);
    event.k_stack_id = bpf_get_stackid(ctx, &kernel_stackmap, KERN_STACKID_FLAGS | BPF_F_REUSE_STACKID);
    
    bpf_perf_event_output(ct, &events, BPF_F_CURRENT_CPU, &event, sizeof (event));
    
    return 0;
}

When the problem recurred again, the call stack was captured:

3. Staged Conclusion

As can be seen from the call stack, the os_nv_cap_close_fd call of the nvidia driver is triggered when the kubelet’s cadvisor periodically accesses the /proc/$pid/fd directory of the container process. Finally, the kernel’s filp_close is called to close the fd.

Background supplement:

The two monitoring indicators of container_file_descriptors and container_sockets will be collected in kubelet’s cadvisor, which needs to make a getdents64 call on the /proc/$pid/fd directory of the container process.

To put it simply, when kubelet’s cadvisor collects container monitoring indicators, the getdents64 system call triggers the os_nv_cap_close_fd function of the nvidia.ko kernel module, which ultimately leads to the closure of fd.

Since the nvidia driver is not fully open source, the problem can only be reported to the nvidia manufacturer. The current feedback is: Users also reported a similar phenomenon a few months ago. Nvidia has not yet found a stable method or solution to reproduce it. For the time being, you can comment out the kubelet in cadvisor To avoid the related collection logic of container_file_descriptors and container_sockets indicators.