CPU serial numbers - Protected Processor Identification Number (PPIN)

HulaHoop · March 29, 2020, 2:47pm

Apparently Intel processors have had unique serial numbers embedded for a while now followed by this future planned effort by AMD.

We need to investigate if this is accessible to guests testing KVM and other hypervisors with the mcelog tool which is documented as the "only mean"s to access this info.

It would help if anyone volunteers with hist Intel machine to test.

I’ve asked the libvirt/qemu devs for comment:
https://www.redhat.com/archives/libvirt-users/2020-March/msg00062.html

EDIT by Patrick:

CPU serial numbers - Protected Processor Identification Number (PPIN) (this forum topic) is not to be confused with something related, CPUID.

CPUID:

Patrick · March 29, 2020, 6:09pm

This could use some instructions how to install mcelog. It’s no longer available in Debian buster:

Debian -- Details of package mcelog in stretch

But:

mcelog replaced with rasdaemon

rasdaemon the replacement?

Debian -- Details of package rasdaemon in buster

Helia · March 30, 2020, 12:27am

How can you help testing?
I have several machines on Intel.

HulaHoop · March 30, 2020, 1:53am

Indeed. Good catch.

Hi, thanks for volunteering. You would install the rasdaemon package and then run sudo ras-mc-ctl --summary. If I understand correctly, it should display the cpuid only if a hardware error is being generated so I am not sure what happens with healthy machines. The EDAC Drivers requirement is for checking the RAM I think and is irrelevant here.

madaidan · March 30, 2020, 6:46pm

We could disable MCE entirely which would probably hide this but security-misc uses MCE to make the kernel panic on uncorrected errors for security.

mcelog is deprecated and requires CONFIG_X86_MCELOG_LEGACY enabled which is disabled by default. Is mcelog the only way to get it or can rasdaemon get it too?

Patrick · March 30, 2020, 6:58pm

Kernel panic on uncorrected errors is more important then CPU serial
numbers.

HulaHoop · March 30, 2020, 10:08pm

Rasdaemon works with the latest implementation of hardware errors infrastructure.

anontor · March 31, 2020, 12:54am

Here is a printout for rasdaemon on a healthy intel thinkpad:

sudo ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.
No MCE errors.

sudo ras-mc-ctl --mainboard
ras-mc-ctl : mainboard: LENOVO model XXXXXXXXXX
(this is just a generic model number that is common to all thinkpads in this category; example: all x220s are the same, all t580s are the same, all p53 are the same, and so on. It’s a model number basically)

sudo ras-mc-ctl --status
ras-mc-ctl: drivers not loaded

sudo ras-mc-ctl --print-labels
ras-mc-ctl: Error: no dimm labels for LENOVO model xxxxxxxxxx

sudo ras-mc-ctl --layout
ras-mc-ctl: Error: no memories found at via edac

sudo ras-mc-ctl --errors
(same output as --summary command)

The rasdaemon includes two systemd services as part of the software: rasdaemon.service and ras-mc-ctl.service. They install to start on boot but you can control that with sudo systemctl disable rasdaemon.service ras-mc-ctl.service and then start up manually whenever you want to test.

interesting sidenote: sometimes the “machine check exception” registers a hardware error when your cpu goes above a certain distro-specific thermal trip point. Quite harmless because the distro very often keeps this way lower than what would actually hurt the chipset. Many Intels can run rather hot at above 80C for bursts and not sustain damage until 95C according to documentation, but some distros have a trip point of 60C determined by several factors thermald for example, also tlp and related programs if installed. So if the machine trips 60 for any reason, an mce exception is logged.

Helia · March 31, 2020, 12:09pm

I won’t be able to access my few machines until the weekend.
Then I will send the results of the same commands you gave:
sudo ras-mc-ctl --summary
sudo ras-mc-ctl --mainboard
sudo ras-mc-ctl --status
sudo ras-mc-ctl --print-labels
sudo ras-mc-ctl --layout
sudo ras-mc-ctl --errors

Should I do another one?

HulaHoop · March 31, 2020, 3:26pm

Thanks for testing. So one can induce a log entry with this data by running stress-ng in the background I guess?

anontor · March 31, 2020, 10:00pm

You know, that’s a good question and i would imagine the answer is a solid yes. Reason is that yesterday, I got kind of curious after playing around with it and purposely compiled a small program from source to raise the cpu temp a bit. Sure enough it fluctuated up and down, fans went on and the highest was 70C. As soon as journalctl was checked there was a machine check exception error printout. Running rasdaemon again and it caught it in the log. The rasdaemon output was the same as journal because they both probably read the same mce log somewhere in the board: “CPU0 package temperature above thereshold…clock throttles (total events =2)…” Same for every core. Obviously not really an “error,” but a good workout for rasdaemon and mce anyway. So I’m sure stress-ng or something similar would do the same.

anontor · April 1, 2020, 12:59am

A couple more observations about ras-mc-ctl
The script gets its information from /sys, specifically, /sys/class/dmi/id folder. In that folder there are entries for: product_serial, board_serial, chassis_serial, and product_uuid, an a lot more too. Every entry appears to be owned by root. I suspect each manufacturer has their own specific values and also some are probably common to all machines. On the test machine, each of those files had values in them. product_serial and chassis_serial were the same value – the serial number of the machine. board_serial was the motherboard serial number. The product_uuid value representation was found by using dmidecode. it turned out to be the ID of the entire machine, not a single board or component.

madaidan · April 1, 2020, 2:38pm

That’s only for motherboard information. I don’t know where it gets MCE errors from. /sys/devices/system/edac/mc is used for something in the script but I don’t know perl so I can’t tell if it’s for MCE errors or something else. AFAIK the EDAC subsystem in Linux is what handles MCE errors so it could be this.

apparmor-profile-everything doesn’t allow access to /sys/devices/system/edac so that might solve this.

HulaHoop · April 1, 2020, 4:34pm

Apparently this is how it’s done:

Helia · May 5, 2020, 4:24pm

Hi everyone
Sorry for my delay but I was in the hospital
Covid-19 defeated!
The results of my test on 3 machines:

sudo ras-mc-ctl --summary
No Memory errors.
No PCIe AER errors.
No Extlog errors.
No MCE errors.

sudo ras-mc-ctl --mainboard
ras-mc-ctl: mainboard: Dell Inc. model XXXXXXXXXX (generic model number common to the series D630)

sudo ras-mc-ctl --status
ras-mc-ctl: drivers not loaded.

sudo ras-mc-ctl --print-labels
ras-mc-ctl: Error: No dimm labels for Dell Inc. model XXXXXXXXXX

sudo ras-mc-ctl --layout
ras-mc-ctl: Error: No memories found at via edac.

sudo ras-mc-ctl --errors
No Memory errors.
No PCIe AER errors.
No Extlog errors.
No MCE errors.

anontor · May 6, 2020, 11:24pm

Nice, glad you’re feeling better; looks like very standard results; I had the same. Seems nobody has dimm labels or a functioning edac driver (at least a readable one)

Helia · May 11, 2020, 2:58pm

Perhaps because we use decent old machines without spyware (Intel ME, UEFI, etc.)

madaidan · May 11, 2020, 6:03pm

Neither of those are proven to be spyware.

anontor · May 11, 2020, 9:50pm

Here’s some new info about the edac drivers; I did some research on edac and found this:
sudo dmesg | grep -E -i edac|northbridge
It returned:
EDAC MC: Ver: 3.0.0
So the controller exists, but it was not parsed by rasdaemon; prob because it is not in use. my theory is that since my machine does not use ecc memory, there are no errors reported; not that there necessarily are not any, just that any corrected errors are not possible.
Looking in /sys/devices/system/edac directory, there are a bunch of entries, among them “runtime_status” It says “unsupported” though.
And for the DIMM labels, I cannot find them anywhere. Very broadstroke descriptions only: “ChannelA-DIMM0” and “ChannelB-DIMM0”
Messing around in “dmidecode” and “cpuid” and the usual suspects

Patrick · September 27, 2022, 3:34pm

https://lore.kernel.org/lkml/20200311044409.2717587-1-wei.huang2@amd.com/