SCOM performance data on Grafana dashboards

The purpose of this post is to outline the technical considerations behind a solution that visualizes performance data collected by Microsoft System Center Operations Manager (SCOM) agents on Grafana dashboards.

At the moment there are several solutions to display SCOM data in Grafana:

All of these solutions use the direct connection to SCOM databases hosted on MS SQL instances as a data source for Grafana.

However I was challenged with another case.

Background

The company was finding SCOM performance views very inconvenient. Along with SCOM reporting it is the nightmare for people with unwillingness to dive into classes and instances of monitored objects. Also they were not happy with approximated data on charts, that was caused by default tolerance setting in performance collection rules. Another annoying thing is the performance of SCOM databases having hundreds gigabytes of data and serving thousands agents. Generally it’s the usual story.

At that moment the company had already been using VictoriaMetrics with Grafana to display data from some network devices. Grafana is the one of the best mean for visualization of performance data present in metrics from Prometheus-like database. Grafana has impressive performance in ability to show big amount data simultaneously and to perform transformation on the fly. So it was supposed I need to use VictoriaMetrics agent (VMagent) to push data into Victoria Metrics database with using authorization provided by VMagent.

OK, challenge accepted.

Schema

I selected SQL Exporter for Prometheus as the solution to transform performance data collected by SCOM agents into metrics. SQL Exporter is a configuration driven exporter that exposes metrics gathered from databases. In my case SQL Exporter had to expose data stored in SCOM warehouse database. Of course, it is possible to get data from SCOM operational database however a warehouse databases also contains fresh performance data because of the most of SCOM performance collecting rules have the second write action to publish data into a warehouse database and this fact allows us to avoid any impact to performance of operational database.

So, the scheme of data flow should look as:

Windows → SCOM agent → SCOM Warehouse database → SQL Exporter for Prometheus → VictoriaMetrics agent → Victoria database → Grafana dashboard.

🟡 WARNING
The technical details provided below are illustrative and not an exact copy of my actual solution. All examples are for explanation purposes only, as the real configuration includes business-specific settings and labels. Additionally, I used certain hard-coded values within the SQL queries to improve performance.

Development of metrics

It is important to follow naming rules defined in Data model and Metric and label naming articles in Prometheus documentation.
SCOM can provide data “ready to use” by doing calculation on agents, it allows us to avoid mathematics on Grafana dashboard and to use GAUGE type of metrics:

🟣 NOTE
“Percent Memory Used” and “System Uptime Collection Rule” are my custom performance collection rules to measure consuming of physical memory and system uptime accordingly.

Metric	Type	SCOM Object/Counter
windows_cpu_processor_performance_ratio	gauge	Processor Information/% Processor Time
windows_system_processor_queue_length	gauge	System/Processor Queue Length
windows_memory_physical_used_ratio	gauge	Memory/PercentMemoryUsed
windows_memory_committed_used_ratio	gauge	Memory/% Committed Bytes In Use
windows_logical_disk_free_ratio	gauge	LogicalDisk/% Free Space
windows_logical_disk_requests_queued	gauge	LogicalDisk/Current Disk Queue Length
windows_logical_disk_free_mbytes	gauge	LogicalDisk/Free Megabytes
windows_net_bytes_received_per_second	gauge	Network Adapter/Bytes Received/sec
windows_net_bytes_sent_per_second	gauge	Network Adapter/Bytes Sent/sec
windows_net_bytes_total_per_second	gauge	Network Adapter/Bytes Total/sec
windows_net_current_bandwidth_ratio	gauge	Network Adapter/ PercentBandwidthUsedTotal
windows_system_uptime_total	counter	System/Uptime

The second step is to define labels:

Label	Description
Path	The name of computer containing a monitored object.
Object	A monitored object.
Instance	Instance of a monitored object (in case of there are several instances on the same computer).
Counter	Performance counter or calculated field.
Domain	Active Directory domain (calculated from Path field).

SCOM adjustments

Firstly, it needs to tune performance collection rules (by overrides) to have all data points in a desired interval. It is reached by disabling tolerance and by adjusting frequency to a desired interval:

Rule	Object/Counter	Frequency	Maximum Sample Separation	Tolerance
Processor % Processor Time Total Windows Server 2016 and above	Processor Information/% Processor Time	60	1	0
System Processor Queue Length Windows Server 2016 and above	System/Processor Queue Length	60	1	0
Memory % Committed Bytes in Use Windows Server 2016 and above	Memory/% Committed Bytes In Use	300	1	0
Percent Memory Used	Memory/PercentMemoryUsed	60	1	0
% Logical Disk Free Space Windows Server 2016 and above	LogicalDisk/% Free Space	300	1	0
Collection rule for Current Disk Queue Length Windows Server 2016 and above	LogicalDisk/Current Disk Queue Length	300	1	0
Logical Disk Free Megabytes Windows Server 2016 and above	LogicalDisk/Free Megabytes	300	1	0
Network Adapter Bytes Received per Second Windows Server 2016 and above	Network Adapter/Bytes Received/sec	60	1	0
Network Adapter Bytes Sent per Second Windows Server 2016 and above	Network Adapter/Bytes Sent/sec	60	1	0
Percent Bandwidth Used Total	Network Adapter/ PercentBandwidthUsedTotal	60	–	–
Network Adapter Bytes Total per Second Windows Server 2016 and above	Network Adapter/Bytes Total/sec	60	1	0
System Uptime Collection Rule	System/Uptime	600	–	–

🟣 NOTE
The rule “System Uptime Collection Rule” is based on the standard data source “Microsoft.Windows.WmiPerfCounterProvider” so it has no “Tolerance” setting.

Of course, we need to keep in mind “zero tolerance” in performance collection rules leads grow of databases’ size and the right capacity management is needed here.

Then I created a special view within data warehouse database to unify a data source for SQL exporter: [OperationsManagerDW].[dbo].[vGrafanaCommonDataset].

SQL view’s design (GitHub link):

WITH BaseSet AS (
    SELECT 
        PerformanceRuleInstanceRowId, 
        ManagedEntityRowId, 
        MAX(DateTime) AS DateTime
    FROM 
        Perf.vPerfRaw
    WHERE 
        DateTime > DATEADD(MINUTE, -10, GETDATE())
    GROUP BY 
        PerformanceRuleInstanceRowId, ManagedEntityRowId
)
SELECT
    vme.Path, 
    SUBSTRING(vme.Path, CHARINDEX('.', vme.Path) + 1, LEN(vme.Path)) AS Domain,
    vpr.ObjectName AS Object, 
    vpr.CounterName AS Counter, 
    vpri.InstanceName AS Instance, 
    ROUND(pvpr.SampleValue, 0) AS Value,
FROM 
    Perf.vPerfRaw AS pvpr
    INNER JOIN dbo.vManagedEntity AS vme 
        ON pvpr.ManagedEntityRowId = vme.ManagedEntityRowId
    INNER JOIN dbo.vPerformanceRuleInstance AS vpri 
        ON pvpr.PerformanceRuleInstanceRowId = vpri.PerformanceRuleInstanceRowId
    INNER JOIN dbo.vPerformanceRule AS vpr 
        ON vpr.RuleRowId = vpri.RuleRowId
    INNER JOIN BaseSet 
        ON pvpr.ManagedEntityRowId = BaseSet.ManagedEntityRowId 
        AND pvpr.PerformanceRuleInstanceRowId = BaseSet.PerformanceRuleInstanceRowId 
        AND pvpr.DateTime = BaseSet.DateTime
WHERE 
    (
        vpr.ObjectName = 'Processor Information' 
        OR vpr.ObjectName = 'System'
        OR vpr.ObjectName = 'Memory'
        OR vpr.ObjectName = 'LogicalDisk'
        OR vpr.ObjectName = 'Network Adapter'
    )
    AND pvpr.DateTime > DATEADD(MINUTE, -10, GETDATE())

The CTE (Common Table Expression) named BaseSet represens the latest samples per counter/entity in the last 10 minutes. This CTE compensates delays and irregular copying of data from the data staging table Perf.PerformanceStage into Perf.PerRaw_* tables.

Setup SQL Exporter

SQL exporter can be set up on any computer having connection to SCOM data warehouse server. I placed the service right on “SCOMDWSRV” server:

Create a new folder “sql_exporter” under “C:\Program Files”.
Download the latest release named like sql_exporter-*.*.*.windows-amd64.zip from https://github.com/burningalchemist/sql_exporter/releases .
Create a new service by running Powershell cmdlet (GitHub link):

New-Service -name 'SqlExporterSvc' -BinaryPathName '"C:\Program Files\sql_exporter\sql_exporter.exe" --config.file "C:\Program Files\sql_exporter\sql_exporter.yml"' -StartupType 'Automatic' -DisplayName 'Prometheus SQL Exporter'

Be sure the user account of ‘SqlExporterSvc’ service has “db_datareader” permission in “OperationsManagerDW” database. I prefer to use Group Managed Service Accounts (GMSA) for any Windows service if it is possible. In this case I also configured GMSA for ‘SqlExporterSvc’ service however description of this is outside of the article’s context. The single what I want to mention is I configured Service Principal Names for that account:

HTTP/SCOMDWSRV:9399
HTTP/SCOMDWSRV.contoso.com:9399

Configuration file for the service sql_exporter.yml (GitHub link):

# Global defaults.
global:
# If scrape_timeout <= 0, no timeout is set unless Prometheus provides one. The default is 10s.
scrape_timeout: 40s
# Subtracted from Prometheus' scrape_timeout to give us some headroom and prevent Prometheus from timing out first.
scrape_timeout_offset: 500ms
# Minimum interval between collector runs: by default (0s) collectors are executed on every scrape.
min_interval: 0s
# Maximum number of open connections to any one target. Metric queries will run concurrently on multiple connections,
# as will concurrent scrapes.
max_connections: 3
# Maximum number of idle connections to any one target. Unless you use very long collection intervals, this should
# always be the same as max_connections.
max_idle_connections: 3
# Maximum number of maximum amount of time a connection may be reused. Expired connections may be closed lazily before reuse.
# If 0, connections are not closed due to a connection's age.
max_connection_lifetime: 5m

# The target to monitor and the collectors to execute on it.
target:
# Data source name always has a URI schema that matches the driver name. In some cases (e.g. MySQL)
# the schema gets dropped or replaced to match the driver expected DSN format.
data_source_name: 'sqlserver://SCOMDWSRV:1433'

# Collectors (referenced by name) to execute on the target.
# Glob patterns are supported (see <https://pkg.go.dev/path/filepath#Match> for syntax).
collectors: [mssql_*,scom_database]

# Collector files specifies a list of globs. One collector definition is read from each matching file.
# Glob patterns are supported (see <https://pkg.go.dev/path/filepath#Match> for syntax).
collector_files:
- "*.collector.yml"

mssql_standard.collector.yml (GitHub link). I left it default just for statistics:

# A collector defining standard metrics for Microsoft SQL Server.
#
# It is required that the SQL Server user has the following permissions:
#
#   GRANT VIEW ANY DEFINITION TO
#   GRANT VIEW SERVER STATE TO
#
collector_name: mssql_standard

# Similar to global.min_interval, but applies to the queries defined by this collector only.
#min_interval: 0s

metrics:
  - metric_name: mssql_local_time_seconds
    type: gauge
    help: 'Local time in seconds since epoch (Unix time).'
    values: [unix_time]
    query: |
      SELECT DATEDIFF(second, '19700101', GETUTCDATE()) AS unix_time

  - metric_name: mssql_connections
    type: gauge
    help: 'Number of active connections.'
    key_labels:
      - db
    values: [count]
    query: |
      SELECT DB_NAME(sp.dbid) AS db, COUNT(sp.spid) AS count
      FROM sys.sysprocesses sp
      GROUP BY DB_NAME(sp.dbid)

The part of scom_database.collector.yml (the full version of the file is available on GitHub):

# A collector defining standard metrics for Microsoft SQL Server.
#
# It is required that the SQL Server user has the following permissions:
#
#   GRANT VIEW ANY DEFINITION TO
#   GRANT VIEW SERVER STATE TO
#
collector_name: scom_database

# Similar to global.min_interval, but applies to the queries defined by this collector only.
#min_interval: 0s

metrics:
  - metric_name: windows_cpu_processor_performance_ratio
    type: gauge
    help: '% Processor Time'
    key_labels:
      - Path
      - Domain
      - Object
      - Counter
    values:
      - Value
    query: |
      SELECT [Path],[Domain],[Object],[Counter],[Value]
      FROM [OperationsManagerDW].[dbo].[vGrafanaCommonDataset]
      where Counter = '% Processor Time'

# the full version of the file is available on GitHub: https://github.com/degotkov/scomgrafanaproject/blob/main/scom_database.collector.yml

After this SQL exporter is available by URL: http://SCOMDWSRV:9399/metrics
Be aware, this exporter supports TLS for enhanced security.

On this step performance data can already be scrapped by Prometheus. As I noted above in my case VictoriaMetric has to be used to scrap data from the exporter.

VictoriaMetrics agent

VictoriaMetrics does not provide a native Windows Service installer or built-in functionality to run directly as a Windows Service. However, we can configure VictoriaMetrics components to run as Windows Services by using third-party tools.. To enable VMAgent to run as a windows service I used Windows Service Wrapper (WinSW). For VMagent instance I reused existing sql_exporter directory.

The procedure of installation:

Download the latest version of agent vmutils-windows-amd64-vX.XXX.X.zip from GitHub https://github.com/VictoriaMetrics/VictoriaMetrics/releases
Extract vmagent-windows-amd64-prod.exe and place it under C:\Program Files\sql_exporter
Create C:\Program Files\sql_exporter\VMAgent.xml (GitHub link) configuration file and fill it with your specific parameters:

<service>
  <id>VMAgent</id>
  <name>VMAgent</name>
  <description>This service is a service created from a sample configuration</description>
  <executable>%BASE%\vmagent-windows-amd64-prod.exe</executable>
  <onfailure action="restart" delay="10 sec"/>
  <onfailure action="restart" delay="20 sec"/>
  <resetfailure>1 hour</resetfailure>
  <arguments>-envflag.enable</arguments>
  <priority>Normal</priority>
  <stoptimeout>15 sec</stoptimeout>
  <stopparentprocessfirst>true</stopparentprocessfirst>
  <startmode>Automatic</startmode>
  <waithint>15 sec</waithint>
  <sleeptime>1 sec</sleeptime>
  <logpath>%BASE%\logs</logpath>
  <log mode="roll">
    <sizeThreshold>10240</sizeThreshold>
    <keepFiles>8</keepFiles>
  </log>
  <env name="loggerFormat" value="json" />
  <env name="loggerOutput" value="stderr" />
  <env name="remoteWrite_oauth2_tokenUrl" value="https://<your_victoria_metrics_host>/oauth/token" />
  <env name="remoteWrite_oauth2_clientID" value="<your_oauth2_clientID>" />
  <env name="remoteWrite_oauth2_clientSecret" value="your_oauth2_clientSecret" />
  <env name="remoteWrite_oauth2_scopes" value="metrics.write" />
  <env name="remoteWrite_url" value="https://<your_victoria_metrics_host>/insert/prometheus" />
  <env name="remoteWrite_maxDiskUsagePerURL" value="52428800" />
  <env name="promscrape_suppressScrapeErrors" value="false" />
  <env name="promscrape_config" value="C:\Program Files\sql_exporter\promscrape.yml" />
 </service>

Download the latest version of WinSW from https://github.com/winsw/winsw/releases
Rename WinSW-x64.exe to VMAgent.exe and move it to C:\Program Files\sql_exporter
Run elevated Powershell terminal and execute commands (GitHub link):

& "C:\Program Files\sql_exporter\VMAgent.exe" install VMAgent.xml
Get-Service VMAgent | Start-Service

🟣 NOTE
For WinSW, the XML configuration file must have the same name as the executable.

Since XML contains secrets we can remove Users and Trusted Installer from Directory (GitHub link):

# Restrict folder access to system administrators to ensure users cannot read secrets
$Folder = "C:\Program Files\sql_exporter"
# Remove permissions inheritance
$Acl = Get-Acl -Path $Folder
$Acl.SetAccessRuleProtection($True, $True)
Set-Acl -Path $Folder -AclObject $Acl
# Remove users
$Acl = Get-Acl -Path $Folder
$Users = "BUILTIN\Users","NT SERVICE\TrustedInstaller"
foreach ($User in $Users)
{
	$Rule = New-Object system.security.AccessControl.FileSystemAccessRule("$User","Read",,,"Allow")
	$acl.RemoveAccessRuleAll($rule)
}
Set-Acl -Path $Folder -AclObject $Acl

Create a scrape config for VMagent promscrape.yml (GitHub link):

global:
  scrape_interval: 60s
  scrape_timeout: 40s
  external_labels: 
    source: SCOM
scrape_configs:
  - job_name: scom_exporter
    static_configs:
      - targets: ['SCOMDWSRV:9399']
        labels:
          environment: TEST

Start VMagent.

VictoriaMetrics Agent can now scrape data from the SQL exporter and populate the database with metrics.

The setup of the VictoriaMetrics host and its database, along with the Grafana integration, is beyond the scope of this article. As mentioned earlier, the focus is on illustrating the workflow for transforming SCOM performance data into Prometheus-compatible metrics.

Final result

Here’s what we have in the end.

SQL exporter folder structure:

Metrics on http://SCOMDWSRV:9399/metrics page:

# HELP mssql_connections Number of active connections.
# TYPE mssql_connections gauge
mssql_connections{db="OperationsManagerDW"} 29
mssql_connections{db="ReportServer"} 4
mssql_connections{db="master"} 38
mssql_connections{db="msdb"} 4
# HELP mssql_local_time_seconds Local time in seconds since epoch (Unix time).
# TYPE mssql_local_time_seconds gauge
mssql_local_time_seconds 1.764456287e+09
# HELP scrape_errors_total Total number of scrape errors per job, target, collector and query
# TYPE scrape_errors_total counter
scrape_errors_total{collector="mssql_standard",job="",query="mssql_connections",target=""} 35
scrape_errors_total{collector="mssql_standard",job="",query="mssql_local_time_seconds",target=""} 87
scrape_errors_total{collector="scom_database",job="",query="windows_cpu_processor_performance_ratio",target=""} 145
scrape_errors_total{collector="scom_database",job="",query="windows_logical_disk_free_mbytes",target=""} 187
scrape_errors_total{collector="scom_database",job="",query="windows_logical_disk_free_ratio",target=""} 166
scrape_errors_total{collector="scom_database",job="",query="windows_logical_disk_requests_queued",target=""} 139
scrape_errors_total{collector="scom_database",job="",query="windows_memory_committed_used_ratio",target=""} 147
scrape_errors_total{collector="scom_database",job="",query="windows_memory_physical_used_ratio",target=""} 145
scrape_errors_total{collector="scom_database",job="",query="windows_net_bytes_received_per_second",target=""} 152
scrape_errors_total{collector="scom_database",job="",query="windows_net_bytes_sent_per_second",target=""} 161
scrape_errors_total{collector="scom_database",job="",query="windows_net_bytes_total_per_second",target=""} 173
scrape_errors_total{collector="scom_database",job="",query="windows_net_current_bandwidth_ratio",target=""} 159
scrape_errors_total{collector="scom_database",job="",query="windows_system_processor_queue_length",target=""} 133
scrape_errors_total{collector="scom_database",job="",query="windows_system_uptime_total",target=""} 49
# HELP windows_cpu_processor_performance_ratio % Processor Time
# TYPE windows_cpu_processor_performance_ratio gauge
windows_cpu_processor_performance_ratio{Counter="% Processor Time",Domain="domain1.name",Object="Processor Information",Path="servername1.domain1.name"} 8
windows_cpu_processor_performance_ratio{Counter="% Processor Time",Domain="domain2.name",Object="Processor Information",Partner="",Path="servername2.domain2.name"} 4
...
windows_logical_disk_free_mbytes{Counter="Free Megabytes",Domain="domain1.name",Instance="R:",Object="LogicalDisk",Path="servername1.domain1.name"} 5052
windows_logical_disk_free_mbytes{Counter="Free Megabytes",Domain="domain2.name",Instance="C:",Object="LogicalDisk",Path="servername2.domain2.name"} 11790
...
windows_system_uptime_total{Counter="Uptime",Domain="domain3.name",Object="System",Path="servername1.domain3.name"} 1.085932e+06

Grafana panel:

Conclusion

The article highlights some features of the ETL process that allows to gather performance data collected by Microsoft System Center Operations Manager (SCOM) agents and to put the data into Prometheus-like database to be shown on Grafana dashboards:

tuning of SCOM performance collection rules;
development of metrics;
preparing data in SQL database with buffering;
configuration of Windows services

There are no doubts some steps can be optimized and even be omitted, finally it depends on a specific case.

Aleksei Degotkov's Technical Blog