NVMe Performance on Zen TR4 Architecture

April 2, 2019    development nvme

The original objective was to observe and understand performance characteristics of nvme within early Linux kernels. We'll be investigating the new polling mechanism , as opposed to the interrupt model.


4 x NVMe Random Write 4K - Samsung Pro

1.4 million IOPS @ sub-millisecond latency

NVMe-4-way Random Write 4K

1 x NVMe Sequential Write 4K - Samsung Pro

1.2 GB/s with 4K record size

NVMe-1-way Sequential Write 4K


I couldn't even get that far. My priorities recently were fixing some KVM and LXC issues that would allow us to get off of early Xen architecture on SPARC. Unfortunately, will have to get back to this in a Part 2, to explain the changes required to the kernel to get NVMe devices to stop driving up the sys due to MSI-X interrupting.

For block access ,in the milliseconds of service time, polling would be wasteful. Majority of those polling events would be “waiting for IO to complete” (absent from completion queue). On the other hand, if service time is fairly low, like under 10us, polling avoids the context switch which can cost 2us.

This is interesting for very fast Flash SSDs and emerging NVM devices where device access latency is near or lower than context switch cost.

Hardware Considerations

PCIe Bandwidth Calculation - How we calculate PCIe bandwidth

Samsung 970 Pro - Datasheet

Patch up

If you're going to do NVMe kernel hacking, I highly recommend starting from an inert changeset: gentoo patchset

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9b35aff09f70..575f1ef7b159 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1500,6 +1500,7 @@ current_restore_flags(unsigned long orig_flags, unsigned long flags)
 extern int cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
 extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed);
+extern u64 get_cpu_rq_switches(int cpu);
 #ifdef CONFIG_SMP
 extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask);
 extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4a8e7207cafa..1a76f0e97c2d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1143,6 +1143,12 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
+u64 get_cpu_rq_switches(int cpu)
+	return cpu_rq(cpu)->nr_switches;
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)

It seems the blk mq layer will only impact the CPU that originated the request. Important to note that the IRQ isn't disabled, but when it executes, the ISR sees completion slot as already processed.

static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
struct nvme_queue *nvmeq = hctx->driver_data;
if (nvme_cqe_valid(nvmeq, nvmeq->cq_head, nvmeq->cq_phase)) { spin_lock_irq(&nvmeq->q_lock);
__nvme_process_cq(nvmeq, &tag); spin_unlock_irq(&nvmeq->q_lock);
if (tag == -1) return 1;
return 0; }

comments powered by Disqus