An introduction to timekeeping in Linux VMs

Let's take a second to get up to speed on timekeeping in Linux VMs.

Image by:

Matteo Ianeselli. Modified by Opensource.com. CC-BY-3.0.

Keeping time in Linux is not simple, and virtualization adds additional challenges and opportunities. In this article, I'll review KVM, Xen, and Hyper-V related time-keeping techniques and the corresponding parts of the Linux kernel.

Timekeeping is the process or activity of recording how long something takes. We need "instruments" to measure time. The Linux kernel has several abstractions to represent such devices:

Clocksource is a device that can give a timestamp whenever you need it. In other words, Clocksource is any ticking counter that allows you to get its value.
Clockevent device is an alarm clock—you ask the device to signal a time in the future (e.g., "wake me up in 1ms") and when the alarm is triggered, you get the signal.
sched_clock() function is similar to clocksource, but this particular one should be "cheap" to read (meaning that one can get its value fast), as sched_clock() is used for task-scheduling purposes and scheduling happens often. We're ready to sacrifice accuracy and other characteristics for speed.

Imagine you're writing an application and you need to get current time-of-day to do timestamping, for example. You may come up with something like this:

#include <stdio.h>
#include <time.h>

int timestamp_function(...)
{
 	struct timespec tp;
	int res = clock_gettime(CLOCK_REALTIME, &tp);

     … do something with the timestamp …
}

What is CLOCK_REALTIME and what other clocks do we have? man 2 clock_gettime gives the answer:

CLOCK_REALTIME clock gives the time passed since January 1, 1970. This clock is affected by NTP adjustments and can jump forward and backward when a system administrator adjusts system time.
CLOCK_MONOTONIC clock gives the time since a fixed starting point—usually since you booted the system. This clock is affected by NTP, but it can't jump backward.
CLOCK_MONOTONIC_RAW clock gives the same time as CLOCK_MONOTONIC, but this clock is not affected by NTP adjustments.
CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE are faster but less-accurate variants of CLOCK_REALTIME and CLOCK_MONOTONIC.

There are a couple of other clocks that relate to the running process or thread, but let's skip them for now.

Some applications do timestamping frequently, thousands of times per second, and for these applications we must be sure clock_gettime() is fast. How does it work in Linux? The algorithm is:

Call clock_gettime() from vDSO (virtual dynamic shared object). vDSO is a shared library provided to every application by the kernel, and it contains code that can run from userspace without switching to the kernel.
For CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE, give the immediate answer by reading the appropriate timekeeper struct, which the kernel provides to userspace for reading only.
For CLOCK_REALTIME and CLOCK_MONOTONIC, check whether the current clocksource in use can be read from userspace and, if it can, use its reading to extrapolate the appropriate timekeeper value.
If the clocksource in use can't be read from userspace, switch to the kernel by doing a system call and let the kernel read the current clocksource and extrapolate the appropriate timekeeper value.

The vDSO optimization when the clocksource is read directly from userspace is important. Here are my testing results for a test program that does 100-million clock_gettime() reads. These tests were performed on a KVM guest with and without vDSO optimization enabled:

kvmclock without vDSO:
# time ./clock_gettime_many 

real	0m15.606s
user	0m2.684s
sys	0m12.916s

kvmclock with vDSO:
# time ./clock_gettime_many 

real	0m2.365s
user	0m2.362s
sys	0m0.001s

The pure userspace method is seven times faster than doing a syscall. We definitely want this. But what makes a clocksource suitable for such fast read?

Clocksources

Generally we require the clocksource to:

Never go backward.
Never stop.
Avoid "jumps".
Have good resolution (frequency).
Be fast to read.
Be available for userspace code.

PC hardware has a number of legacy timekeeping devices, but these lack the above-mentioned characteristics. Namely:

PIT: Suitable for counting jiffies (system timer interrupts) only; low resolution.
CMOS RTC: Low resolution (1s) time-of-day clock, optional 32768Hz timer; can't be read from userspace.
ACPI (PM) timer: Frequent overflows; slow to read; can't be accessed from userspace.
HPET: Not always present; not necessarily fast to read.
LAPIC timer: Unknown frequency; can't be read from userspace.

All modern x86 hypervisors virtualize this hardware, but virtualization costs for all accesses are too high to use any of these devices as a reliable clocksource in Linux.

TSC

On bare x86 hardware the most commonly used clocksource today is TSC (Time Stamp Counter). TSC is a special auto-incremented CPU register that has a number of advantages over using legacy hardware mentioned previously. It has high precision and it can be read with one assembly instruction (rdtsc), even from userspace. TSC, however, has its own issues, including:

Its frequency is unknown, and it needs to be measured with PIT, CMOS, or ACPI timer.
The register is writable and the reading can differ on different CPUs.
TSC can stop in some low-power C states of the processor. This doesn't usually happen on modern hardware.
TSC getting out-of-sync on some big NUMA systems was observed in the past. Luckily, the number of such systems was limited.
SMI handlers may reset the counter.

Virtualization brings additional challenges. When a virtual machine is migrating to another host, its TSC value differs, so we see a "jump" in the value. Moreover, the frequency we measured at boot is no longer the actual TSC frequency. Two similar methods were introduced to cope with these issues: pvclock (para-virtualized clock) for Xen and KVM hypervisor guests, and TSC page for Hyper-V guests. These are fixed-frequency clocks, so in addition to reading TSC value, we need to do some math to get the reading.

pvclock

Xen and KVM hypervisors came up with so-called pvclock protocol to enhance TSC and make it suitable for virtualized guests. The protocol is based on a simple per-CPU structure that is shared between the host and the guest:

struct pvclock_vcpu_time_info {
	u32   version;
	u32   pad0;
	u64   tsc_timestamp;
	u64   system_time;
	u32   tsc_to_system_mul;
	s8    tsc_shift;
	u8    flags;
	u8    pad[2];
};

To get the current TSC reading, guests must do the following math:

PerCPUTime =  ((RDTSC() - tsc_timestamp) >> tsc_shift) * tsc_to_system_mul + system_time

The flags field indicates whether we can trust the reading to keep the monotonicity promise even when we do subsequent calls on different CPUs, and this determines our ability to use the clocksource from vDSO. In case monotonicity is not guaranteed, Linux needs to keep track of the last reading to make sure no application will see time going backward even when migrated from one CPU to another. Luckily, this doesn't happen often on modern hardware and our readings are fast.

Hyper-V TSC page

Microsoft reinvented the pv_clock protocol with their own TSC page proctol, which is similar to pv_clock but with a significant difference. TSC page is a single structure per virtual machine—not per CPU—so it can't compensate for the case when TSC gets out of sync on several CPUs. We don't know for sure, but the guess is that in this case the hypervisor will try synchronizing TSCs or disable the TSC page mechanism altogether.

The protocol for the TSC page is:

struct ms_hyperv_tsc_page {
        volatile u32 tsc_sequence;
        u32 reserved1;
        volatile u64 tsc_scale;
        volatile s64 tsc_offset;
        u64 reserved2[509];
};

To get the current TSC reading, guests must do the following math:

PerVMTime = ((VirtualTsc * tsc_scale) >> 64) + tsc_offset

Special value 0 in the tsc_sequence field indicates that the method is disabled and we should fall back to reading the value from another virtualized MSR (model-specific register) that Hyper-V provides. This is impossible from userspace code and generally is much slower.

Hardware extensions for virtualizing TSC

Since the early days of hardware-assisted virtualization, Intel was supplying an option to do TSC offsetting for virtual guests in hardware, which would mean that a guest's rdtsc reading will return a host's TSC value + offset. Unfortunately, this wasn't enough to support migration between different hosts because TSC frequency may differ, so pvclock and TSC page protocol were introduced. In late 2015, Intel introduced the TSC scaling feature (which was already present in AMD processors for several years) and, in theory, this is a game changer making pvclock and TSC page protocols redundant. However, an immediate switch to using plain TSC as a clocksource for virtualized guests seems impractical; one must be sure that all potential migration recipient hosts support the feature, but it is not yet widely available. Extensive testing also must be performed to make sure there are no drawbacks to switching from paravirtualized protocols.

Host-wide time synchronization

So we're running a virtualized guest on KVM and using kvmclock (which implements the pvclock protocol), or we're running a Hyper-V guest and have Hyper-V TSC page as a clocksource. Is our time-of-day in sync between the guest and the host (or, between different guests on the same host)? We're reading the same TSC value, so the resulting time should be the same, right? Well, not exactly. Both host's and guest's CLOCK_REALTIME clocks are affected by NTP adjustments and may diverge over time.

To solve the problem, a solution was introduced in Linux-4.11: PTP devices for KVM and Hyper-V. These devices are not actually related to the PTP time synchronization protocol and don't work with network devices, but they present themselves as PTP (/dev/ptp*) devices, so they're consumable by the existing time synchronization software.

To enable time synchronization with host we must do the following:

For KVM guest, we need to load ptp_kvm module. To make it load after reboot, we can do something like:
```
# echo ptp_kvm > /etc/modules-load.d/ptp_kvm.conf
```
(on Fedora/RHEL7). This is not required for Hyper-V guests, as the module implementing the device loads automatically.
Add /dev/ptp0 as a reference clock to the NTP daemon configuration. In case of chrony, it would be:
```
# echo "refclock PHC /dev/ptp0 poll 3 dpoll -2 offset 0" >> /etc/chrony.conf
```
Restart NTP server:
```
systemctl restart chronyd
```
Check time synchronization status:
```
# chronyc sources | grep PHC0
```

KVM guests are known to produce better results than Hyper-V guests as the mechanism behind the device is very different. Whereas Hyper-V host sends its guests time samples every five seconds, KVM guests have an option to do direct hypercall to the hypervisor to get its time.

Testing results on KVM (idle host, single guest):

# for f in `seq 1 5`; do chronyc sources | grep PHC0 ; sleep 10s; done
#* PHC0            0   3   377     4    -24ns[ -166ns] +/-   37ns
#* PHC0            0   3   377     6    +13ns[  +49ns] +/-   32ns
#* PHC0            0   3   377     8    +49ns[ +182ns] +/-   28ns
#* PHC0            0   3   377    11    -43ns[ -113ns] +/-   24ns
#* PHC0            0   3   377     4    +60ns[ +152ns] +/-   18ns

Testing results on Hyper-V (idle host, single guest):

# for f in `seq 1 5`; do chronyc sources | grep PHC0 ; sleep 10s; done
#* PHC0            0   3   377     7   +287ns[+2659ns] +/-  131ns
#* PHC0            0   3   377     7   +119ns[-3852ns] +/-  130ns
#* PHC0            0   3   377     9  +1648ns[+2449ns] +/-  156ns
#* PHC0            0   3   377     5   +898ns[ +613ns] +/-  142ns
#* PHC0            0   3   377     7   +288ns[ -403ns] +/-   98ns

Although the Hyper-V PTP device is less accurate than KVM, it still is very accurate compared to NTP. As you can see from the above, guest's system time usually stays within 10us from host's, which is a good result.

Vitaly Kuznetsov will be talking about timekeeping in Linux VMs at LinuxCon ContainerCon CloudOpen China in Beijing on June 20, 2017.