SCOM performance data on Grafana dashboards

  1. Background
  2. Schema
  3. Development of metrics
  4. SCOM adjustments
  5. Setup SQL Exporter
  6. VictoriaMetrics agent
  7. Final result
  8. Conclusion

The purpose of this post is to outline the technical considerations behind a solution that visualizes performance data collected by Microsoft System Center Operations Manager (SCOM) agents on Grafana dashboards.

At the moment there are several solutions to display SCOM data in Grafana:

All of these solutions use the direct connection to SCOM databases hosted on MS SQL instances as a data source for Grafana.

However I was challenged with another case.

Background

The company was finding SCOM performance views very inconvenient. Along with SCOM reporting it is the nightmare for people with unwillingness to dive into classes and instances of monitored objects. Also they were not happy with approximated data on charts, that was caused by default tolerance setting in performance collection rules. Another annoying thing is the performance of SCOM databases having hundreds gigabytes of data and serving thousands agents. Generally it’s the usual story.

At that moment the company had already been using VictoriaMetrics with Grafana to display data from some network devices. Grafana is the one of the best mean for visualization of performance data present in metrics from Prometheus-like database. Grafana has impressive performance in ability to show big amount data simultaneously and to perform transformation on the fly. So it was supposed I need to use VictoriaMetrics agent (VMagent) to push data into Victoria Metrics database with using authorization provided by VMagent.

OK, challenge accepted.

Schema

I selected SQL Exporter for Prometheus as the solution to transform performance data collected by SCOM agents into metrics. SQL Exporter is a configuration driven exporter that exposes metrics gathered from databases. In my case SQL Exporter had to expose data stored in SCOM warehouse database. Of course, it is possible to get data from SCOM operational database however a warehouse databases also contains fresh performance data because of the most of SCOM performance collecting rules have the second write action to publish data into a warehouse database and this fact allows us to avoid any impact to performance of operational database.

So, the scheme of data flow should look as:

Windows → SCOM agent → SCOM Warehouse database → SQL Exporter for Prometheus → VictoriaMetrics agent → Victoria database → Grafana dashboard.

Development of metrics

It is important to follow naming rules defined in Data model and Metric and label naming articles in Prometheus documentation.
SCOM can provide data “ready to use” by doing calculation on agents, it allows us to avoid mathematics on Grafana dashboard and to use GAUGE type of metrics:

MetricTypeSCOM Object/Counter
windows_cpu_processor_performance_ratiogaugeProcessor Information/% Processor Time
windows_system_processor_queue_lengthgaugeSystem/Processor Queue Length
windows_memory_physical_used_ratiogaugeMemory/PercentMemoryUsed
windows_memory_committed_used_ratiogaugeMemory/% Committed Bytes In Use
windows_logical_disk_free_ratiogaugeLogicalDisk/% Free Space
windows_logical_disk_requests_queuedgaugeLogicalDisk/Current Disk Queue Length
windows_logical_disk_free_mbytesgaugeLogicalDisk/Free Megabytes
windows_net_bytes_received_per_secondgaugeNetwork Adapter/Bytes Received/sec
windows_net_bytes_sent_per_secondgaugeNetwork Adapter/Bytes Sent/sec
windows_net_bytes_total_per_secondgaugeNetwork Adapter/Bytes Total/sec
windows_net_current_bandwidth_ratiogaugeNetwork Adapter/ PercentBandwidthUsedTotal
windows_system_uptime_totalcounterSystem/Uptime

The second step is to define labels:

LabelDescription
PathThe name of computer containing a monitored object.
ObjectA monitored object.
InstanceInstance of a monitored object (in case of there are several instances on the same computer).
CounterPerformance counter or calculated field.
DomainActive Directory domain (calculated from Path field).
SCOM adjustments

Firstly, it needs to tune performance collection rules (by overrides) to have all data points in a desired interval. It is reached by disabling tolerance and by adjusting frequency to a desired interval:

RuleObject/CounterFrequencyMaximum Sample SeparationTolerance
Processor % Processor Time Total Windows Server 2016 and aboveProcessor Information/% Processor Time6010
System Processor Queue Length Windows Server 2016 and aboveSystem/Processor Queue Length6010
Memory % Committed Bytes in Use Windows Server 2016 and aboveMemory/% Committed Bytes In Use30010
Percent Memory UsedMemory/PercentMemoryUsed6010
% Logical Disk Free Space Windows Server 2016 and aboveLogicalDisk/% Free Space30010
Collection rule for Current Disk Queue Length Windows Server 2016 and aboveLogicalDisk/Current Disk Queue Length30010
Logical Disk Free Megabytes Windows Server 2016 and aboveLogicalDisk/Free Megabytes30010
Network Adapter Bytes Received per Second Windows Server 2016 and aboveNetwork Adapter/Bytes Received/sec6010
Network Adapter Bytes Sent per Second Windows Server 2016 and aboveNetwork Adapter/Bytes Sent/sec6010
Percent Bandwidth Used TotalNetwork Adapter/ PercentBandwidthUsedTotal60
Network Adapter Bytes Total per Second Windows Server 2016 and aboveNetwork Adapter/Bytes Total/sec6010
System Uptime Collection RuleSystem/Uptime600

Of course, we need to keep in mind “zero tolerance” in performance collection rules leads grow of databases’ size and the right capacity management is needed here.

Then I created a special view within data warehouse database to unify a data source for SQL exporter: [OperationsManagerDW].[dbo].[vGrafanaCommonDataset].

SQL view’s design (GitHub link):

WITH BaseSet AS (
    SELECT 
        PerformanceRuleInstanceRowId, 
        ManagedEntityRowId, 
        MAX(DateTime) AS DateTime
    FROM 
        Perf.vPerfRaw
    WHERE 
        DateTime > DATEADD(MINUTE, -10, GETDATE())
    GROUP BY 
        PerformanceRuleInstanceRowId, ManagedEntityRowId
)
SELECT
    vme.Path, 
    SUBSTRING(vme.Path, CHARINDEX('.', vme.Path) + 1, LEN(vme.Path)) AS Domain,
    vpr.ObjectName AS Object, 
    vpr.CounterName AS Counter, 
    vpri.InstanceName AS Instance, 
    ROUND(pvpr.SampleValue, 0) AS Value,
FROM 
    Perf.vPerfRaw AS pvpr
    INNER JOIN dbo.vManagedEntity AS vme 
        ON pvpr.ManagedEntityRowId = vme.ManagedEntityRowId
    INNER JOIN dbo.vPerformanceRuleInstance AS vpri 
        ON pvpr.PerformanceRuleInstanceRowId = vpri.PerformanceRuleInstanceRowId
    INNER JOIN dbo.vPerformanceRule AS vpr 
        ON vpr.RuleRowId = vpri.RuleRowId
    INNER JOIN BaseSet 
        ON pvpr.ManagedEntityRowId = BaseSet.ManagedEntityRowId 
        AND pvpr.PerformanceRuleInstanceRowId = BaseSet.PerformanceRuleInstanceRowId 
        AND pvpr.DateTime = BaseSet.DateTime
WHERE 
    (
        vpr.ObjectName = 'Processor Information' 
        OR vpr.ObjectName = 'System'
        OR vpr.ObjectName = 'Memory'
        OR vpr.ObjectName = 'LogicalDisk'
        OR vpr.ObjectName = 'Network Adapter'
    )
    AND pvpr.DateTime > DATEADD(MINUTE, -10, GETDATE())
	

The CTE (Common Table Expression) named BaseSet represens the latest samples per counter/entity in the last 10 minutes. This CTE compensates delays and irregular copying of data from the data staging table Perf.PerformanceStage into Perf.PerRaw_* tables.

Setup SQL Exporter

SQL exporter can be set up on any computer having connection to SCOM data warehouse server. I placed the service right on “SCOMDWSRV” server:

New-Service -name 'SqlExporterSvc' -BinaryPathName '"C:\Program Files\sql_exporter\sql_exporter.exe" --config.file "C:\Program Files\sql_exporter\sql_exporter.yml"' -StartupType 'Automatic' -DisplayName 'Prometheus SQL Exporter'

Be sure the user account of ‘SqlExporterSvc’ service has “db_datareader” permission in “OperationsManagerDW” database. I prefer to use Group Managed Service Accounts (GMSA) for any Windows service if it is possible. In this case I also configured GMSA for ‘SqlExporterSvc’ service however description of this is outside of the article’s context. The single what I want to mention is I configured Service Principal Names for that account:

HTTP/SCOMDWSRV:9399
HTTP/SCOMDWSRV.contoso.com:9399

Configuration file for the service sql_exporter.yml (GitHub link):

# Global defaults.
global:
  # If scrape_timeout <= 0, no timeout is set unless Prometheus provides one. The default is 10s.
  scrape_timeout: 40s
  # Subtracted from Prometheus' scrape_timeout to give us some headroom and prevent Prometheus from timing out first.
  scrape_timeout_offset: 500ms
  # Minimum interval between collector runs: by default (0s) collectors are executed on every scrape.
  min_interval: 0s
  # Maximum number of open connections to any one target. Metric queries will run concurrently on multiple connections,
  # as will concurrent scrapes.
  max_connections: 3
  # Maximum number of idle connections to any one target. Unless you use very long collection intervals, this should
  # always be the same as max_connections.
  max_idle_connections: 3
  # Maximum number of maximum amount of time a connection may be reused. Expired connections may be closed lazily before reuse.
  # If 0, connections are not closed due to a connection's age.
  max_connection_lifetime: 5m

# The target to monitor and the collectors to execute on it.
target:
  # Data source name always has a URI schema that matches the driver name. In some cases (e.g. MySQL)
  # the schema gets dropped or replaced to match the driver expected DSN format.
  data_source_name: 'sqlserver://SCOMDWSRV:1433'

  # Collectors (referenced by name) to execute on the target.
  # Glob patterns are supported (see <https://pkg.go.dev/path/filepath#Match> for syntax).
  collectors: [mssql_*,scom_database]

# Collector files specifies a list of globs. One collector definition is read from each matching file.
# Glob patterns are supported (see <https://pkg.go.dev/path/filepath#Match> for syntax).
collector_files: 
  - "*.collector.yml"

mssql_standard.collector.yml (GitHub link). I left it default just for statistics:

# A collector defining standard metrics for Microsoft SQL Server.
#
# It is required that the SQL Server user has the following permissions:
#
#   GRANT VIEW ANY DEFINITION TO
#   GRANT VIEW SERVER STATE TO
#
collector_name: mssql_standard

# Similar to global.min_interval, but applies to the queries defined by this collector only.
#min_interval: 0s

metrics:
  - metric_name: mssql_local_time_seconds
    type: gauge
    help: 'Local time in seconds since epoch (Unix time).'
    values: [unix_time]
    query: |
      SELECT DATEDIFF(second, '19700101', GETUTCDATE()) AS unix_time

  - metric_name: mssql_connections
    type: gauge
    help: 'Number of active connections.'
    key_labels:
      - db
    values: [count]
    query: |
      SELECT DB_NAME(sp.dbid) AS db, COUNT(sp.spid) AS count
      FROM sys.sysprocesses sp
      GROUP BY DB_NAME(sp.dbid)

The part of scom_database.collector.yml (the full version of the file is available on GitHub):

# A collector defining standard metrics for Microsoft SQL Server.
#
# It is required that the SQL Server user has the following permissions:
#
#   GRANT VIEW ANY DEFINITION TO
#   GRANT VIEW SERVER STATE TO
#
collector_name: scom_database

# Similar to global.min_interval, but applies to the queries defined by this collector only.
#min_interval: 0s

metrics:
  - metric_name: windows_cpu_processor_performance_ratio
    type: gauge
    help: '% Processor Time'
    key_labels:
      - Path
      - Domain
      - Object
      - Counter
    values:
      - Value
    query: |
      SELECT [Path],[Domain],[Object],[Counter],[Value]
      FROM [OperationsManagerDW].[dbo].[vGrafanaCommonDataset]
      where Counter = '% Processor Time'

# the full version of the file is available on GitHub: https://github.com/degotkov/scomgrafanaproject/blob/main/scom_database.collector.yml

After this SQL exporter is available by URL: http://SCOMDWSRV:9399/metrics
Be aware, this exporter supports TLS for enhanced security.

On this step performance data can already be scrapped by Prometheus. As I noted above in my case VictoriaMetric has to be used to scrap data from the exporter.

VictoriaMetrics agent

VictoriaMetrics does not provide a native Windows Service installer or built-in functionality to run directly as a Windows Service. However, we can configure VictoriaMetrics components to run as Windows Services by using third-party tools.. To enable VMAgent to run as a windows service I used Windows Service Wrapper (WinSW). For VMagent instance I reused existing sql_exporter directory.

The procedure of installation:

  • Download the latest version of agent vmutils-windows-amd64-vX.XXX.X.zip from GitHub https://github.com/VictoriaMetrics/VictoriaMetrics/releases
  • Extract vmagent-windows-amd64-prod.exe and place it under C:\Program Files\sql_exporter
  • Create C:\Program Files\sql_exporter\VMAgent.xml (GitHub link) configuration file and fill it with your specific parameters:
<service>
  <id>VMAgent</id>
  <name>VMAgent</name>
  <description>This service is a service created from a sample configuration</description>
  <executable>%BASE%\vmagent-windows-amd64-prod.exe</executable>
  <onfailure action="restart" delay="10 sec"/>
  <onfailure action="restart" delay="20 sec"/>
  <resetfailure>1 hour</resetfailure>
  <arguments>-envflag.enable</arguments>
  <priority>Normal</priority>
  <stoptimeout>15 sec</stoptimeout>
  <stopparentprocessfirst>true</stopparentprocessfirst>
  <startmode>Automatic</startmode>
  <waithint>15 sec</waithint>
  <sleeptime>1 sec</sleeptime>
  <logpath>%BASE%\logs</logpath>
  <log mode="roll">
    <sizeThreshold>10240</sizeThreshold>
    <keepFiles>8</keepFiles>
  </log>
  <env name="loggerFormat" value="json" />
  <env name="loggerOutput" value="stderr" />
  <env name="remoteWrite_oauth2_tokenUrl" value="https://<your_victoria_metrics_host>/oauth/token" />
  <env name="remoteWrite_oauth2_clientID" value="<your_oauth2_clientID>" />
  <env name="remoteWrite_oauth2_clientSecret" value="your_oauth2_clientSecret" />
  <env name="remoteWrite_oauth2_scopes" value="metrics.write" />
  <env name="remoteWrite_url" value="https://<your_victoria_metrics_host>/insert/prometheus" />
  <env name="remoteWrite_maxDiskUsagePerURL" value="52428800" />
  <env name="promscrape_suppressScrapeErrors" value="false" />
  <env name="promscrape_config" value="C:\Program Files\sql_exporter\promscrape.yml" />
 </service>
& "C:\Program Files\sql_exporter\VMAgent.exe" install VMAgent.xml
Get-Service VMAgent | Start-Service
  • Since XML contains secrets we can remove Users and Trusted Installer from Directory (GitHub link):
# Restrict folder access to system administrators to ensure users cannot read secrets
$Folder = "C:\Program Files\sql_exporter"
# Remove permissions inheritance
$Acl = Get-Acl -Path $Folder
$Acl.SetAccessRuleProtection($True, $True)
Set-Acl -Path $Folder -AclObject $Acl
# Remove users
$Acl = Get-Acl -Path $Folder
$Users = "BUILTIN\Users","NT SERVICE\TrustedInstaller"
foreach ($User in $Users)
{
	$Rule = New-Object system.security.AccessControl.FileSystemAccessRule("$User","Read",,,"Allow")
	$acl.RemoveAccessRuleAll($rule)
}
Set-Acl -Path $Folder -AclObject $Acl
  • Create a scrape config for VMagent promscrape.yml (GitHub link):
global:
  scrape_interval: 60s
  scrape_timeout: 40s
  external_labels: 
    source: SCOM
scrape_configs:
  - job_name: scom_exporter
    static_configs:
      - targets: ['SCOMDWSRV:9399']
        labels:
          environment: TEST
  • Start VMagent.

VictoriaMetrics Agent can now scrape data from the SQL exporter and populate the database with metrics.

The setup of the VictoriaMetrics host and its database, along with the Grafana integration, is beyond the scope of this article. As mentioned earlier, the focus is on illustrating the workflow for transforming SCOM performance data into Prometheus-compatible metrics.

Final result

Here’s what we have in the end.

SQL exporter folder structure:

Metrics on http://SCOMDWSRV:9399/metrics page:

# HELP mssql_connections Number of active connections.
# TYPE mssql_connections gauge
mssql_connections{db="OperationsManagerDW"} 29
mssql_connections{db="ReportServer"} 4
mssql_connections{db="master"} 38
mssql_connections{db="msdb"} 4
# HELP mssql_local_time_seconds Local time in seconds since epoch (Unix time).
# TYPE mssql_local_time_seconds gauge
mssql_local_time_seconds 1.764456287e+09
# HELP scrape_errors_total Total number of scrape errors per job, target, collector and query
# TYPE scrape_errors_total counter
scrape_errors_total{collector="mssql_standard",job="",query="mssql_connections",target=""} 35
scrape_errors_total{collector="mssql_standard",job="",query="mssql_local_time_seconds",target=""} 87
scrape_errors_total{collector="scom_database",job="",query="windows_cpu_processor_performance_ratio",target=""} 145
scrape_errors_total{collector="scom_database",job="",query="windows_logical_disk_free_mbytes",target=""} 187
scrape_errors_total{collector="scom_database",job="",query="windows_logical_disk_free_ratio",target=""} 166
scrape_errors_total{collector="scom_database",job="",query="windows_logical_disk_requests_queued",target=""} 139
scrape_errors_total{collector="scom_database",job="",query="windows_memory_committed_used_ratio",target=""} 147
scrape_errors_total{collector="scom_database",job="",query="windows_memory_physical_used_ratio",target=""} 145
scrape_errors_total{collector="scom_database",job="",query="windows_net_bytes_received_per_second",target=""} 152
scrape_errors_total{collector="scom_database",job="",query="windows_net_bytes_sent_per_second",target=""} 161
scrape_errors_total{collector="scom_database",job="",query="windows_net_bytes_total_per_second",target=""} 173
scrape_errors_total{collector="scom_database",job="",query="windows_net_current_bandwidth_ratio",target=""} 159
scrape_errors_total{collector="scom_database",job="",query="windows_system_processor_queue_length",target=""} 133
scrape_errors_total{collector="scom_database",job="",query="windows_system_uptime_total",target=""} 49
# HELP windows_cpu_processor_performance_ratio % Processor Time
# TYPE windows_cpu_processor_performance_ratio gauge
windows_cpu_processor_performance_ratio{Counter="% Processor Time",Domain="domain1.name",Object="Processor Information",Path="servername1.domain1.name"} 8
windows_cpu_processor_performance_ratio{Counter="% Processor Time",Domain="domain2.name",Object="Processor Information",Partner="",Path="servername2.domain2.name"} 4
...
windows_logical_disk_free_mbytes{Counter="Free Megabytes",Domain="domain1.name",Instance="R:",Object="LogicalDisk",Path="servername1.domain1.name"} 5052
windows_logical_disk_free_mbytes{Counter="Free Megabytes",Domain="domain2.name",Instance="C:",Object="LogicalDisk",Path="servername2.domain2.name"} 11790
...
windows_system_uptime_total{Counter="Uptime",Domain="domain3.name",Object="System",Path="servername1.domain3.name"} 1.085932e+06

Grafana panel:

Conclusion

The article highlights some features of the ETL process that allows to gather performance data collected by Microsoft System Center Operations Manager (SCOM) agents and to put the data into Prometheus-like database to be shown on Grafana dashboards:

  • tuning of SCOM performance collection rules;
  • development of metrics;
  • preparing data in SQL database with buffering;
  • configuration of Windows services

There are no doubts some steps can be optimized and even be omitted, finally it depends on a specific case.

Leave a comment