2016-06-29

Working with more than 64 CPUs in Powershell

Wrote this several months ago but was too busy to publish :-/

As noted in one of the previous blog post, I will use following terminology:
  • "Processor" is a piece of hardware you connect to a socket on the motherboard.
  • "Physical Core" is a physical computing unit built into the "Processor".
  • "Virtual Core" is a virtual computing unit built on top of "Physical Core" (i.e. HT is ON).
  • "CPU" is a computing unit inside the "Processor", either physical or virtual.

After a series of blogs on Windows performance counters and after releasing sysb.ps1 testing/benchmarking framework version 0.9RC (dbt2-0.37.50.10) I set out to eliminate some unknowns from the testing. First to tackle was Kernel scheduler in an effort to run processes, from inside the Powershell script, on controlled subset of CPUs much like TASKSET does on Linux. Also worth noting is that proximity rocks, on occasion, meaning you can get up to 20% better results when the workload is distributed perfectly. However, this is hard to achieve thus I'm more going after consistency in test environment.
This posed quite a bit of challenges; knowing the details of hardware, NUMA node assignments, finding out and evaluating various ways of controlling the CPU pinning to calculating CPU affinity mask for more than 64 CPUs.
One interesting challenge was to calculate the CPU indexes for MySQL Cluster thread config.
As a first step, I had to find out as much as possible about my hardware.

Know your hardware:

PS> Get-CimInstance Win32_BIOS
SMBIOSBIOSVersion : 11018100   
Manufacturer      : American Megatrends Inc.
Name              : Default System BIOS
SerialNumber      : 1207FMA00C            
Version           : SUN    - 20151001

PS> Get-CimInstance Win32_ComputerSystem | FL *
Status                      : OK
Name                        : HEL01
Roles                       : {LM_Workstation, LM_Server, NT, Server_NT}
AutomaticManagedPagefile    : False
DomainRole                  : 3
HypervisorPresent           : False
Manufacturer                : Oracle Corporation 
Model                       : Sun Fire X4800
NetworkServerModeEnabled    : True
NumberOfLogicalProcessors   : 96
NumberOfProcessors          : 8
PartOfDomain                : True
SystemType                  : x64-based PC
TotalPhysicalMemory         : 549746266112

PS> Get-CimInstance Win32_ComputerSystemProcessor | FL *
GroupComponent        : Win32_ComputerSystem (Name = "HEL01")
PartComponent         : Win32_Processor (DeviceID = "CPU0")
CimClass              : root/cimv2:Win32_ComputerSystemProcessor
CimInstanceProperties : {GroupComponent, PartComponent}
...
PartComponent         : Win32_Processor (DeviceID = "CPU1")
PartComponent         : Win32_Processor (DeviceID = "CPU2")
PartComponent         : Win32_Processor (DeviceID = "CPU3")
PartComponent         : Win32_Processor (DeviceID = "CPU4")
PartComponent         : Win32_Processor (DeviceID = "CPU5")
PartComponent         : Win32_Processor (DeviceID = "CPU6")
PartComponent         : Win32_Processor (DeviceID = "CPU7")

PS> Get-CimInstance Win32_PerfFormattedData_PerfOS_NUMANodeMemory
Name                      : 0
AvailableMBytes           : 64530
FreeAndZeroPageListMBytes : 63989
StandbyListMBytes         : 541
TotalMBytes               : 65526
...
Name                      : 7
AvailableMBytes           : 64600
FreeAndZeroPageListMBytes : 64387
StandbyListMBytes         : 213
TotalMBytes               : 65536

PS> Get-CimInstance Win32_SystemSlot
SlotDesignation : EM00 PCIExp
Tag             : System Slot 0
SupportsHotPlug : True
Status          : OK
Shared          : True
PMESignal       : True
MaxDataWidth    : 8
...
SlotDesignation : EM01 PCIExp
Tag             : System Slot 1
SlotDesignation : EM30 PCIExp
Tag             : System Slot 2
SlotDesignation : EM31 PCIExp
Tag             : System Slot 3
SlotDesignation : EM10 PCIExp
Tag             : System Slot 4
SlotDesignation : EM11 PCIExp
Tag             : System Slot 5
SlotDesignation : EM20 PCIExp
Tag             : System Slot 6
SlotDesignation : EM21 PCIExp
Tag             : System Slot 7

PS> Get-CimInstance Win32_PerfFormattedData_Counters_ProcessorInformation
Name                        : 0,0
PercentofMaximumFrequency   : 100
PercentPerformanceLimit     : 100
PercentProcessorPerformance : 69
ProcessorFrequency          : 2001
...
Name                        : 0,11
---
Name                        : 7,0
PercentofMaximumFrequency   : 100
PercentPerformanceLimit     : 100
PercentProcessorPerformance : 72
ProcessorFrequency          : 2001
...
Name                        : 7,11
Or, in short, my test box has 2 Processor groups with 48 CPUs each. This makes for Max. CPU affinity mask of 281474976710655d (or 111111111111111111111111111111111111111111111111b). The total number of CPUs is 96, total number of sockets and NUMA nodes is 8.

Note: Notice there are exactly 48 "1" in Max CPU Affinity mask which is the number of CPUs in each Processor group. This implies you can only set process affinity mask on per Processor group basis, not machine-wide! This limitation is caused by CPUs affinity mask being 64 bits long.
Groups, NUMA nodes etc. assignments are not chiseled in stone. Please see MSDN for details on how to manipulate these settings.

Once done playing with WMI, you can turn to coreinfo from Sysinternals suite as it's extremely informative:
Intel(R) Xeon(R) CPU           E7540  @ 2.00GHz
Intel64 Family 6 Model 46 Stepping 6, GenuineIntel
Microcode signature: 00000009
HTT        * Hyperthreading enabled
HYPERVISOR - Hypervisor is present
VMX        * Supports Intel hardware-assisted virtualization
SVM        - Supports AMD hardware-assisted virtualization
X64        * Supports 64-bit mode

SMX        - Supports Intel trusted execution
SKINIT     - Supports AMD SKINIT
...
Important to notice is that, in my configuration, Sockets map to NUMA nodes 1-1:
Logical Processor to Socket Map:                  Logical Processor to NUMA Node Map:
Socket 0:                                         NUMA Node 0:
************------------------------------------  ************------------------------------------
------------------------------------------------  ------------------------------------------------  
Socket 1:                                         NUMA Node 1:
------------------------------------------------  ------------------------------------------------
************------------------------------------  ************------------------------------------
Socket 2:                                         NUMA Node 2:
------------************------------------------  ------------************------------------------
------------------------------------------------  ------------------------------------------------
Socket 3:                                         NUMA Node 3:
------------------------------------------------  ------------------------------------------------
------------************------------------------  ------------************------------------------
Socket 4:                                         NUMA Node 4:
------------------------************------------  ------------------------************------------
------------------------------------------------  ------------------------------------------------
Socket 5:                                         NUMA Node 5:
------------------------------------------------  ------------------------------------------------
------------------------************------------  ------------------------************------------
Socket 6:                                         NUMA Node 6:
------------------------------------************  ------------------------------------************
------------------------------------------------  ------------------------------------------------
Socket 7:                                         NUMA Node 7:
------------------------------------------------  ------------------------------------------------
------------------------------------************  ------------------------------------************
so I can use Processor/Socket/NUMA node as though they are synonyms. Also, notice that NUMA node/Socket 0 and even ones are in Processor group 0 while odd sockets are in Processor group 1. Here is how CPU utilization looks like in Task manager/Performance tab when just ProcessorGroup 0 is used:

Logical Processor to Group Map:
Group 0:                                          Group 1:
************************************************  ------------------------------------------------
------------------------------------------------  ************************************************
Note: Coreinfo provides NUMA nodes latency too:
Approximate Cross-NUMA Node Access Cost (relative to fastest):
     00  01  02  03  04  05  06  07
00: 1.4 1.7 2.1 1.7 1.7 2.1 2.2 2.1
01: 1.7 1.4 1.7 2.1 2.1 1.7 2.0 1.3
02: 2.1 1.7 1.4 1.7 2.1 2.1 1.6 1.2
03: 1.8 2.1 1.7 1.4 2.1 2.1 2.0 1.1
04: 1.7 2.1 2.1 2.1 1.4 1.7 1.7 1.4
05: 2.1 1.7 2.1 2.1 1.7 1.4 2.0 1.0
06: 2.1 2.1 1.7 2.1 1.7 2.1 1.4 1.3
07: 2.1 2.1 2.1 1.7 2.1 1.7 1.6 1.0

The software:

Primary tool used is sysb.ps1 Powershell script version 1.0 (not available for download atm). Version 0.9x RC is available for download and placed in dbt2-0.37.50.10.tar.gz\dbt2-0.37.50.10.tar\dbt2-0.37.50.10\windows_scripts\sysb-script\ directory.

OS details:
PS:518 [HEL01]> Get-CimInstance Win32_OperatingSystem | FL *
Status                                    : OK
Name                                      : Microsoft Windows Server 2012 R2 Standard
FreePhysicalMemory                        : 528660256
FreeSpaceInPagingFiles                    : 8388608
FreeVirtualMemory                         : 537242324
Distributed                               : False
MaxNumberOfProcesses                      : 4294967295
MaxProcessMemorySize                      : 137438953344
OSType                                    : 18
SizeStoredInPagingFiles                   : 8388608
TotalSwapSpaceSize                        : 
TotalVirtualMemorySize                    : 545250196
TotalVisibleMemorySize                    : 536861588
Version                                   : 6.3.9600
BootDevice                                : \Device\HarddiskVolume1
BuildNumber                               : 9600
BuildType                                 : Multiprocessor Free
CodeSet                                   : 1252
DataExecutionPrevention_32BitApplications : True
DataExecutionPrevention_Available         : True
DataExecutionPrevention_Drivers           : True
DataExecutionPrevention_SupportPolicy     : 3
Debug                                     : False
ForegroundApplicationBoost                : 2
LargeSystemCache                          : 
Manufacturer                              : Microsoft Corporation
OperatingSystemSKU                        : 7
OSArchitecture                            : 64-bit
PAEEnabled                                : 
ServicePackMajorVersion                   : 0
ServicePackMinorVersion                   : 0

So how do the Windows work?

Process is just a container for threads doing the work providing you with fancy name, PID etc. This effectively means you can not calculate "System load" like on Linux. This also explains why there is no ProcessorGroup member attached to Process class while there is one for Threads. This also makes all sorts of problems regarding CPU utilization as described in previous blogs here and here.
Processor group is a collection of up to 64 CPUs as explained here and here.
Thread is a basic unit of execution. Setting the Thread affinity will influence the Process class and dictate what you can do with it. There is a great paper on this you can download from MSDN to figure it out. The focus of this blog is on scripting.


Know the OS pitfalls:

The setup: I have a script acting as testing/benchmarking framework. Script controls the way processes are launched, collects data from running processes and generally helps me do part of my job of identifying performance issues and testing solutions.
The problem: Windows is thread based OS and I can not control the threads in binary from within the script.
Next, .NET System.Diagnostics.Process class does not expose Processor group bit. This means there is no way to control Processor group and thus no way to guarantee the kernel scheduler will start all of your processes inside the Processor group you want :-/ I consider this a bug and not deficiency in Windows because of the following scenario:
   "ProcessA" is pinned, by scheduler, to Processor group 0 with ability to run on all CPUs within that group.
   "ProcessB" is pinned, by scheduler, to Processor group 1 with ability to run on all CPUs within that group.
   ProcessorAffinity member of System.Diagnostics.Process class is the same in both cases!
  $procA = Get-Process -Name ProcessA
  $procA.ProcessorAffinity
  281474976710655 #For my 48 CPUs in each Processor group.

  $procB = Get-Process -Name ProcessB
  $procB.ProcessorAffinity
  281474976710655 #For my 48 CPUs in each Processor group.
This leads you to believe that both processes run in the same Processor group, which might not be true as the information is ambiguous. I have set up mysqld to run on 1st NUMA node and part of second (12 + 8 CPUs). At the same time, Sysbench is pinned to NUMA node 0, last 4 CPUs. When scheduler decides to run mysqld on Processor group 1, the CPU load distribution is like this:
NUMA #0, last 4 CPUs lit up by Sysbench. NUMA #1 and part of 3, lit up by mysqld.

Using the same(!) Process.ProcessorAffinity for mysqld for subsequent run but this time the scheduler decides it will run mysqld on Processor group 0:
NUMA #0, last 4 CPUs lit up by Sysbench and mysqld.
NUMA #2 in part lit up by mysqld.

It is obvious how later case will most likely produce much lower results since mysqld is competing with Sysbench (on last 4 CPUs of the NUMA node 0) and Windows (first 2 CPUs of NUMA node 0). This is indicative of 2 things:
  a) Microsoft rushed solution for big boxes (> 64 CPUs) and it is not mature nor will it scale.
  b) You can not trust Kernel scheduler to do the right thing on its own as it has no clue as to what will be your next move.
I might add here that even the display in Task manager lacks the ability to display CPU load per ProcessorGroup...

Before you send me to RTFM and do this the "proper" way, please notice that the CPU usage pattern for NUMA nodes 5 and 7 is the same in both runs. This is because our Cluster knows how to pin threads to CPUs "properly". Alas, I do not think this is possible from the Powershell.
Also notice the lack of ProcessorGroup member in System.Diagnostic.Process class. I expected at least ProcessorGroup with getter function (if not complete getter/setter) so I can break the run if scheduler makes the choice I'm not happy with.
The last problem to mention is late binding of Affinity mask :-/. The code might look like this:

    $sb_psi = New-object System.Diagnostics.ProcessStartInfo 
    $sb_psi.CreateNoWindow = $true 
    $sb_psi.UseShellExecute = $false 
    $sb_psi.RedirectStandardOutput = $true
    $sb_psi.RedirectStandardError = $true
    $sb_psi.FileName = "$PathToSB" + '\sysbench.exe '
    $sb_psi.Arguments = @("$sbArgList") 

    $sb_process = $null
    $sb_process = New-Object System.Diagnostics.Process 
    $sb_process.StartInfo = $sb_psi
    [void]$sb_process.Start() <<<<
    #Now you can set the Affinity mask:
    $sb_process.ProcessorAffinity = $SCRIPT:SP_BENCHMARK_CPU
    $sb_process.WaitForExit()
IMO, process.ProcessorAffinity should go to System.Diagnostics.ProcessStartInfo.
I can't help but to wonder what will happen if Intel decides to release single processor with 64+ CPUs?


What are our options in Powershell then?

Essentially, you can use 3 techniques to start the process in Powershell and bind it to CPUs but you have to bear in mind that this is not what Microsoft expects you to do so each approach has its pro's and con's:
1) Using START in cmd.exe (start /HIGH /NODE 2 /AFFINITY 0x4096 /B /WAIT E:\test\...\sysbench.exe --test=oltp...)
Settings:
 sysbench.conf:
  BENCHMARK_NODE=5
  BENCHMARK_CPU="111100000000" # Xeon E7540 has 12 CPUs per socket so I'm running on LAST 4 (9,10,11 and 12).
These options allow user to run Sysbench on certain NUMA node as well as certain CPUs within that NUMA node.

 autobench.conf:
  SERVER_NUMA_NODE=3
  SERVER_CPU="111111111" #(Or, 000111111111) Running on first 9 CPUs.
 It is not necessary to set CPUs to run on if you're running on entire dedicated NUMA node.

Pros: Works.
Cons: The process you're starting is not the expected one (say, benchmark) but rather cmd.exe START.
      Cumbersome.
      Not really "Powershell-way".
      Process is bound to just one NUMA node which is fine if it's not hungry for more CPU power.

2) Using .NET System.Diagnostics.Process (PS, C#):
 $process = Start-Process E:\test\mysql-cluster-7.5.0-winx64\bin\mysqld.exe -ArgumentList "--standalone --console
 --initialize-insecure" -WindowStyle Hidden -PassThru -Wait -RedirectStandardOutput e:\test\stdout.txt
 -RedirectStandardError e:\test\stderr.txt
 $process.ProcessorAffinity = 70368739983360

 Affinity mask means mysqld runs on NUMA node 7, 5 and part of 3 (0-based index)
 IF ProcessorGroup is set to 1 by Kernel scheduler:
 001111111111111111111111110000000000000000000000 = 70368739983360
 |___________________48 CPUs____________________|
 |__________||__________||__________||__________|
   NUMA #7      NUMA #5    NUMA #3      NUMA #1

Settings:
 Autobench.conf:
  SP_SERVER_CPU=70368739983360

 Sysbench.conf:
  SP_BENCHMARK_CPU=211106232532992
  #Run on NUMA node 7, last 2 CPUs, 110000000000000000000000000000000000000000000000b

Pros: Real "Powershell-way" of doing things.
      Process can span over more than 1 NUMA node.
      Good control of the process (HasExited, ExitTime, Kill, ID (PID) ...).
Cons: Late binding; i.e. process has to be up and running for you to pin it to CPUs. This presents a problem with processes
      that start running immediately.
      No way to control Processor group meaning there is no way to guarantee the kernel scheduler will start all of your
      processes inside the desired Processor group.
Note: Using -PassThru ensures you will get Process object. Otherwise, Start-Process cmdlet has no output. Also, you can start the process and then use Get-Process -Name... to accomplish the same.

Not available in Powershell AFAIK but important to understand if using MySQL Cluster:
3) Hook the threads to CPUs. Since this is not available from the "outside", I will use the Cluster code to do the work for me:
config.ini
----------
NoOfFragmentLogParts=10
ThreadConfig=ldm={count=10,cpubind=88-91,100-105},tc={count=4,cpubind=94-95,106-107},send={count=2,cpubind=92-93},
recv={count=2,cpubind=98,99},main={count=1,cpubind=109},rep={count=1,cpubind=109}

sysbench.conf
-------------
#NUMA node to run sysbench on.
BENCHMARK_NODE=0
#Zero based index.
#CPUs inside selected NUMA node to run sysbench on.
BENCHMARK_CPU="111100000000"
 000000001111
 |__________|
 |_12 CPUs__|
   NUMA #0
CPU0   CPU11

autobench.conf
--------------
SP_SERVER_CPU=1048575
 Affinity mask means mysqld runs on NUMA node 7, 5 and part of 3 (0-based index) IF ProcessorGroup is 1:
 001111111111111111111111110000000000000000000000 = 70368739983360d
 |___________________48 CPUs____________________|
 |__________||__________||__________||__________|
   NUMA #7      NUMA #5    NUMA #3      NUMA #1

Test image shows
 000000000000000000000000000011111111111111111111 = 1048575d
 |___________________48 CPUs____________________|
 |__________||__________||__________||__________|
   NUMA #7      NUMA #5    NUMA #3      NUMA #1
Sysbench is running on NUMA #0, last 4 CPUs.
MySQLd is running on NUMA #1 and last 8 CPUs of NUMA #3.
LDM threads are running on first 4 CPUS node #5 together with 2 TC, SEND and RCV threads.
LDM threads are running on first 6 CPUS node #7 together with 2 TC and 1 MAIN and REPL
threads with CPUs 107 and 110(Last one) not being used.


Calculating ProcessorAffinity mask for process is different depending on the function accepting the input.
1) For cmd.exe START, the actual number passed is in HEX notation. The binary mask is composed so that the highest index CPU comes first:
BENCHMARK_CPU="111100000000"
 000000001111
 |__________|
 |_12 CPUs__|
   NUMA #0
CPU0     CPU11
It is more convenient to provide the mask in binary so I convert setting to Hex value inside the script.
The NUMA node to run on is specified as decimal integer.
If you have provided the NUMA node # for the process to run on, not specifying ProcessorAffinity mask means "run on all CPUs within specified node".
If you provide the wrong mask, process will fail to start. For example, I have 12 CPUs per NUMA node (socket) so providing the mask like "11111111111000" will fail.
The approach works only on one NUMA node.

2) Start Process expects decimal integer for mask. The rightmost "1" indicates usage of CPU #0 within Processor group assigned by Kernel scheduler in Round-Robin manner.
 000000000000000000000000000011111111111111111111 = 1048575d
 |___________________48 CPUs____________________|
 |__________||__________||__________||__________|
   NUMA #7      NUMA #5    NUMA #3      NUMA #1
or, should the scheduler pick Processor group 0:
   NUMA #6      NUMA #4    NUMA #2      NUMA #0
Start process takes (and returns) decimal value for ProcessorAffinity.
It uses late binding so Process has to be up and running before assigning Affinity mask to it.
You have no control over ProcessorGroup meaning Kernel scheduler is free to pick any NUMA node in Round-Robin fashion.

3) Doing things "properly" (binding threads to CPUs). Or, how to calculate ThreadConfig for MySQL Cluster:
ThreadConfig=ldm={count=10,cpubind=88-91,100-105},tc={count=4,cpubind=94-95,106-107},send={count=2,cpubind=92-93},recv={count=2,cpubind=98,99},main={count=1,cpubind=109},rep={count=1,cpubind=109} shows CPU indexes above total number of CPUs available on my test system (2x48=96). This has to do with the maximum capacity of Processor group which is 64. The designer of this functionality treats each Processor group found on system as full meaning it occupies 64 places for CPU index. This makes sense if you are going from the box with 48 CPUs in group (like mine) to a box with 64 CPUs in group as your ThreadConfig line will continue to work as expected. However, it requires some math to come to CPU indexes:

Processor group 0                                              |Processor group 1
CPU#0                                    CPU#47          CPU#63CPU#64                                    CPU#110       CPU#127
|                  AVAILABLE                  |     RESERV    ||                   AVAILABLE                 |     RESERV    |
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRRRRRRRRRRRRRRRRXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRRRRRRRRRRRRRRRR
Now my ThreadConfig line makes sense:
LDM threads are running on first 4 CPUS node #5 (88-91) together with 2 TC (94,95), SEND (92,93) and RCV (98,99) threads.
LDM threads are running on first 6 CPUS node #7 (100-105) together with 2 TC (106,107) and 1 MAIN and REPL (109)
threads with CPUs 107 and 110(Last one) not being used.


Conclusion:

o Windows use notion of Processor group. Machines with less than 64 CPUs have 1 Processor group thus your application runs exactly as before.
  o Bug 1: Affinity mask is only 64-bit wide so there is no way to have continuous index of CPUs inside the big box such as mine.
o .NET System.Diagnostics.Process has no get/set of Processor group. At least a getter function was expected and a member of System.Diagnostics.Process disclosing this information.
  o Bug 2: Information on CPU Affinity mask obtained from .NET System.Diagnostics.Process is ambiguous.
o 1 + 2, bug 3: There is no way I found to script pinning to individual CPUs that is complete.
  o Feature request 1: .NET System.Diagnostics.Process allows only for late binding of Affinity mask. Move Affinity to .NET System.Diagnostics.ProcessStartInfo.
o Feature request 2, consolidate: The various approaches taken by Microsoft seem uncoordinated and incomplete. Even using START command requires decimal number for NUMA node index and hexadecimal number for Affinity mask. cmd.exe START and creation of thread objects allow for early binding of CPU Affinity mask while .NET System.Diagnostics.Process allows only late binding. And so on.
o Feature request 3, give us TASKSET complement: Given all of the above, it is impossible to script the replacement for Linux TASKSET.
o What will happen once single processors with more than 64 CPUs are available?
o Mysql Cluster counts CPUs as if every existing Processor group is complete (has 64 CPUs).




No comments:

Post a Comment