Machine-check exception

A machine check exception (MCE) is a type of computer error that occurs when a problem involving the computer's hardware is detected. With most mass-market personal computers, an MCE indicates faulty or misconfigured hardware.

The nature and causes of MCEs can vary by architecture and generation of system. In some designs, an MCE is always an unrecoverable error, that halts the machine, requiring a reboot. In other architectures, some MCEs may be non-fatal, such as for single-bit errors corrected by ECC memory. On some architectures, such as PowerPC, certain software bugs can cause MCEs, such as an invalid memory access. On other architectures, such as x86, MCEs typically originate from hardware only.

Reporting

[edit]

IBM mainframe operating systems

[edit]

IBM System/360 Operating System (OS/360) records input/output errors in a dataset called SYS1.LOGREC. Since then IBM has coined the term error recording data set (ERDS) for successor versions that allow the installation to choose the name and for operating systems not derived from OS/360.[1]

OS/360

[edit]

In OS/360, the installation can choose several levels of support for handling machine checks. The most sophisticated, Machine Check Handler (MCH), records failure data on SYS1.LOGREC and attempts recovery. The installation can print those data using the Environmental Record Editing and Printing Program (EREP) service aid or the stand-alone version SEREP. The MCH can handle memory failures in refreshable nucleus control sections by reading a fresh copy from SYS1.ASRLIB and can handle memory errors in SVC transient areas by reading a fresh copy of the SVC module from SYS1.SVCLIB.

z/OS

[edit]

In z/OS the installation can either use an ERDS or can define a z/OS System Logger log stream[2] to hold the error data. As with OS/360, the installation uses EREP to print those data; SEREP is no longer available. The MCH is no longer optional, and handles many more failure modes than the OS/360 MCH.

Microsoft Windows

[edit]

On Microsoft Windows platforms, in the event of an unrecoverable MCE, the system generates a BugCheck — also called a STOP error, or a Blue Screen of Death.

More recent versions of Windows use the Windows Hardware Error Architecture (WHEA), and generate STOP code 0x124, WHEA_UNCORRECTABLE_ERROR. The four parameters (in parentheses) will vary, but the first is always 0x0 for an MCE.[3] Example:

   STOP: 0x00000124 (0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000) 

Older versions of Windows use the Machine Check Architecture, with STOP code 0x9C, MACHINE_CHECK_EXCEPTION.[4] Example:

   STOP: 0x0000009C (0x00000030, 0x00000002, 0x00000001, 0x80003CBA) 

Linux

[edit]

On Linux, the kernel writes messages about MCEs to the kernel message log and the system console. When the MCEs are not fatal, they will also typically be copied to the system log and/or systemd journal. For some systems, ECC and other correctable errors may be reported through MCE facilities.[5]

Example:

   CPU 0: Machine Check Exception: 0000000000000004    Bank 2: f200200000000863    Kernel panic: CPU context corrupt 

Problem types

[edit]

Some of the main hardware problems that cause MCEs include:

Possible causes

[edit]

Machine checks are a hardware problem, not a software problem. They are often the result of overclocking or overheating. In some cases, the CPU will shut itself off once passing a thermal limit to avoid permanent damage. But they can also be caused by bus errors introduced by other failing components, like memory or I/O devices. Possible causes include:

  • Poor CPU cooling due to a CPU heatsink and case fans (or filters) that's clogged with dust or has come loose.
  • Overclocking beyond the highest clock rate at which the CPU is still reliable.
  • Failing motherboard.
  • Failing processor.
  • Failing memory.
  • Failing I/O controllers, on either the motherboard or separate cards.
  • Failing I/O devices.
  • Inadequate or failing power supply.

Cooling problems are usually obvious upon inspection. A failing motherboard or processor can be identified by swapping them with functioning parts. Memory can be checked by booting to a diagnostic tool, like memtest86. Non-essential failing I/O devices and controllers can be identified by unplugging them if possible or disabling the devices to see if the problem disappears. If the failures typically only occur fairly soon after the OS is booted or not at all or not for days, it may be suggestive of a power supply issue. With a power supply problem, the failure often occurs when power demand peaks as the OS starts up any external devices for use.

Decoding MCEs

[edit]

For IA-32 and Intel 64 processors, consult the Intel 64 and IA-32 Architectures Software Developer's Manual[6] Chapter 15 (Machine-Check Architecture), or the Microsoft KB Article on Windows Exceptions.[7]

Programs to decode Intel and AMD MCEs

[edit]
  • rasdaemon[8] is a RAS (reliability, availability and serviceability) logging tool for Linux. It records memory errors, using the EDAC tracing events. EDAC is a Linux kernel subsystem that handles detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures. EDAC drivers for other architectures like arm also exists. It is recommended to use rasdaemon to gather MCE information on Linux systems because mcelog has been deprecated as of 2017.[9][10][11][12]
  • mcelog[13] is a Linux daemon by Andi Kleen to handle MCEs for x86 processors. mcelog can also decode machine checks. mcelog is considered functionally obsolete as of 2017.[11][12] The replacement of mcelog for Linux systems is rasdaemon.[9][10]
  • parsemce[14] is a Linux program by Dave Jones to decode MCEs from AMD K7 processors.
  • mced[15] (mcedaemon) is a Linux program by Tim Hockin to gather MCEs from the kernel and alert interested applications. Note that it does not try to interpret the MCE data, it simply alerts other programs.
  • mcat is a Windows command-line program from AMD to decode MCEs from AMD K8, Family 0x10 and 0x11 processors.

See also

[edit]

References

[edit]
  1. ^ "Chapter 1. Introducing EREP" (PDF). Environmental Record Editing and Printing Program (EREP) 3.5 - User's Guide (PDF). IBM. September 30, 2021. p. 1. GC35-0151-50. Retrieved February 20, 2023.
  2. ^ System Programmer's Guide to: z/OS System Logger (PDF) (Second ed.). IBM. July 2007. SG24-6898-01. Retrieved February 20, 2023. {{cite book}}: |work= ignored (help)
  3. ^ "Bug Check 0x124: WHEA_UNCORRECTABLE_ERROR". Microsoft. 2022-11-03. Retrieved 2022-12-11.
  4. ^ "Bug Check 0x9C: MACHINE_CHECK_EXCEPTION". Microsoft. 2021-12-14. Retrieved 2022-12-11.
  5. ^ "mcelog not working with AMD processor family 16 and above on SLES11 SP3". SuSE. 2022-09-27. Retrieved 2022-12-11.
  6. ^ "Machine Check Architecture". Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3B: System Programming Guide, Part 2. Intel Corporation. November 2018.
  7. ^ "Stop error message in Windows XP that you may receive: "0x0000009C (0x00000004, 0x00000000, 0xb2000000, 0x00020151)"". MSDN. 2015-12-07. Retrieved 2017-07-13.
  8. ^ Mauro Carvalho Chehab (mchehab) (2023-02-20). "rasdaemon is a RAS (Reliability, Availability and Serviceability) logging tool". github.com. Retrieved 2023-02-20.
  9. ^ a b "Machine-check exception". wiki.archlinux.org. 2021-05-08. Retrieved 2023-02-21.
  10. ^ a b "ECC RAM". wiki.gentoo.org. 2022-12-30. Retrieved 2023-02-21.
  11. ^ a b "x86/mce: Factor out and deprecate the /dev/mcelog driver". git.kernel.org. 2017-03-28. Retrieved 2023-02-21.
  12. ^ a b "x86/mce: Factor out and deprecate the /dev/mcelog driver". github.com/torvalds/linux/. 2017-03-28. Retrieved 2023-02-21.
  13. ^ "mcelog: Advanced hardware error handling for x86 Linux". 2015-04-20. Retrieved 2017-07-13.
  14. ^ "parsemce: Linux Machine check exception handler parser". 2003-07-22. Retrieved 2017-07-13.
  15. ^ mcedaemon on GitHub
[edit]