CPU serial numbers - Protected Processor Identification Number (PPIN)

Apparently Intel processors have had unique serial numbers embedded for a while now followed by this future planned effort by AMD.

We need to investigate if this is accessible to guests testing KVM and other hypervisors with the mcelog tool which is documented as the "only mean"s to access this info.

It would help if anyone volunteers with hist Intel machine to test.

I’ve asked the libvirt/qemu devs for comment:
https://www.redhat.com/archives/libvirt-users/2020-March/msg00062.html


EDIT by Patrick:

CPU serial numbers - Protected Processor Identification Number (PPIN) (this forum topic) is not to be confused with something related, CPUID.

CPUID:

2 Likes

This could use some instructions how to install mcelog. It’s no longer available in Debian buster:

But:

rasdaemon the replacement?

How can you help testing?
I have several machines on Intel.

2 Likes

Indeed. Good catch.

Hi, thanks for volunteering. You would install the rasdaemon package and then run sudo ras-mc-ctl --summary. If I understand correctly, it should display the cpuid only if a hardware error is being generated so I am not sure what happens with healthy machines. The EDAC Drivers requirement is for checking the RAM I think and is irrelevant here.

1 Like

We could disable MCE entirely which would probably hide this but security-misc uses MCE to make the kernel panic on uncorrected errors for security.

mcelog is deprecated and requires CONFIG_X86_MCELOG_LEGACY enabled which is disabled by default. Is mcelog the only way to get it or can rasdaemon get it too?

1 Like

Kernel panic on uncorrected errors is more important then CPU serial
numbers.

2 Likes

Rasdaemon works with the latest implementation of hardware errors infrastructure.

2 Likes

Here is a printout for rasdaemon on a healthy intel thinkpad:

sudo ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.
No MCE errors.

sudo ras-mc-ctl --mainboard
ras-mc-ctl : mainboard: LENOVO model XXXXXXXXXX
(this is just a generic model number that is common to all thinkpads in this category; example: all x220s are the same, all t580s are the same, all p53 are the same, and so on. It’s a model number basically)

sudo ras-mc-ctl --status
ras-mc-ctl: drivers not loaded

sudo ras-mc-ctl --print-labels
ras-mc-ctl: Error: no dimm labels for LENOVO model xxxxxxxxxx

sudo ras-mc-ctl --layout
ras-mc-ctl: Error: no memories found at via edac

sudo ras-mc-ctl --errors
(same output as --summary command)

The rasdaemon includes two systemd services as part of the software: rasdaemon.service and ras-mc-ctl.service. They install to start on boot but you can control that with sudo systemctl disable rasdaemon.service ras-mc-ctl.service and then start up manually whenever you want to test.

interesting sidenote: sometimes the “machine check exception” registers a hardware error when your cpu goes above a certain distro-specific thermal trip point. Quite harmless because the distro very often keeps this way lower than what would actually hurt the chipset. Many Intels can run rather hot at above 80C for bursts and not sustain damage until 95C according to documentation, but some distros have a trip point of 60C determined by several factors thermald for example, also tlp and related programs if installed. So if the machine trips 60 for any reason, an mce exception is logged.

3 Likes

I won’t be able to access my few machines until the weekend.
Then I will send the results of the same commands you gave:
sudo ras-mc-ctl --summary
sudo ras-mc-ctl --mainboard
sudo ras-mc-ctl --status
sudo ras-mc-ctl --print-labels
sudo ras-mc-ctl --layout
sudo ras-mc-ctl --errors

Should I do another one?

2 Likes

Thanks for testing. So one can induce a log entry with this data by running stress-ng in the background I guess?

2 Likes

You know, that’s a good question and i would imagine the answer is a solid yes. Reason is that yesterday, I got kind of curious after playing around with it and purposely compiled a small program from source to raise the cpu temp a bit. Sure enough it fluctuated up and down, fans went on and the highest was 70C. As soon as journalctl was checked there was a machine check exception error printout. Running rasdaemon again and it caught it in the log. The rasdaemon output was the same as journal because they both probably read the same mce log somewhere in the board: “CPU0 package temperature above thereshold…clock throttles (total events =2)…” Same for every core. Obviously not really an “error,” but a good workout for rasdaemon and mce anyway. So I’m sure stress-ng or something similar would do the same.

2 Likes

A couple more observations about ras-mc-ctl
The script gets its information from /sys, specifically, /sys/class/dmi/id folder. In that folder there are entries for: product_serial, board_serial, chassis_serial, and product_uuid, an a lot more too. Every entry appears to be owned by root. I suspect each manufacturer has their own specific values and also some are probably common to all machines. On the test machine, each of those files had values in them. product_serial and chassis_serial were the same value – the serial number of the machine. board_serial was the motherboard serial number. The product_uuid value representation was found by using dmidecode. it turned out to be the ID of the entire machine, not a single board or component.

2 Likes

That’s only for motherboard information. I don’t know where it gets MCE errors from. /sys/devices/system/edac/mc is used for something in the script but I don’t know perl so I can’t tell if it’s for MCE errors or something else. AFAIK the EDAC subsystem in Linux is what handles MCE errors so it could be this.

apparmor-profile-everything doesn’t allow access to /sys/devices/system/edac so that might solve this.

2 Likes

Apparently this is how it’s done:

2 Likes

Hi everyone
Sorry for my delay but I was in the hospital :frowning:
Covid-19 defeated! :slight_smile:
The results of my test on 3 machines:

sudo ras-mc-ctl --summary
No Memory errors.
No PCIe AER errors.
No Extlog errors.
No MCE errors.

sudo ras-mc-ctl --mainboard
ras-mc-ctl: mainboard: Dell Inc. model XXXXXXXXXX (generic model number common to the series D630)

sudo ras-mc-ctl --status
ras-mc-ctl: drivers not loaded.

sudo ras-mc-ctl --print-labels
ras-mc-ctl: Error: No dimm labels for Dell Inc. model XXXXXXXXXX

sudo ras-mc-ctl --layout
ras-mc-ctl: Error: No memories found at via edac.

sudo ras-mc-ctl --errors
No Memory errors.
No PCIe AER errors.
No Extlog errors.
No MCE errors.

3 Likes

Nice, glad you’re feeling better; looks like very standard results; I had the same. Seems nobody has dimm labels or a functioning edac driver (at least a readable one)

1 Like

Perhaps because we use decent old machines without spyware (Intel ME, UEFI, etc.) :slight_smile:

Neither of those are proven to be spyware.

Here’s some new info about the edac drivers; I did some research on edac and found this:
sudo dmesg | grep -E -i edac|northbridge
It returned:
EDAC MC: Ver: 3.0.0
So the controller exists, but it was not parsed by rasdaemon; prob because it is not in use. my theory is that since my machine does not use ecc memory, there are no errors reported; not that there necessarily are not any, just that any corrected errors are not possible.
Looking in /sys/devices/system/edac directory, there are a bunch of entries, among them “runtime_status” It says “unsupported” though.
And for the DIMM labels, I cannot find them anywhere. Very broadstroke descriptions only: “ChannelA-DIMM0” and “ChannelB-DIMM0”
Messing around in “dmidecode” and “cpuid” and the usual suspects

https://lore.kernel.org/lkml/20200311044409.2717587-1-wei.huang2@amd.com/