Continuing on the theme of log standardization, let’s dive into Splunk’s Common Information Model or CIM.
What is CIM?
As discussed in previous posts on logging, there are several standards to support formatting, structure and transportation of logs. However, as with all standards they are only as valuable as their ubiquity. Sadly, most data source vendors such as Azure, Cisco, Microsoft and everyone in between do not follow the same standard. This results in logs from different data sources but talking about the same event or activity looking different enough that machines and humans alike have challenges interpreting the data at scale. Splunk being a market leader in the log management and SIEM space, decided to create (yet another) logging standard to improve usability and performance within their products. Let’s discuss.
Source Type Data Models and Data Sets
There are a few logging format and structure standards out there such as Common Event Format (CEF) or Elastic Common Schema (ECS) which define hundreds of data fields in a single schema. Splunk’s CIM went with a more sophisticated approach by deciding that the “Earth isn’t flat” and nor should their schema. Instead of a massive schema CIM has several “data models” (schemas) based on the data source category and provides further granularity through “data sets” which focus on the sub-categories of a log source category.
Normalizing Fields Between Disparate Sources
Like all logging standards, one of the problems they are trying to solve is normalizing field names between different data sources. The most common example is how different brands of firewalls will log the same event but in slightly different nomenclatures and format. As an example one vendor will refer to the IP which initiated a connection as the “source_ip” where as another vendor will refer to the same value as “srcip” or even worse “sip”! While us humans can translate this in our heads line by line, our computer companions are not as intelligent.
CIM forces the standardization of field names and the value types that can be in each field in an effort to support the features below.
Easier and Faster Searching
Splunk’s claim to fame was fast “free text” and “parse on query” searching against large amounts of unstructured data. In the early days this was novel and worked well, being a welcome change to most data systems requiring data to be structured before searchable. However, as our data sets became larger and larger, limitations of the unstructured data world became obvious.
By standardizing the field names, regardless of the brand of the data source and paired with a SQL like query language, users are able to quickly create queries in minutes, rather than potentially spending hours attempting to describe each variations of what the desired logs could look like in extremely ugly query statements. Regex anyone?
The second benefit of this is performance. While the back end database within Splunk is still unstructured, the documents within the database leverage CIM and therefore have structure which greatly improves performance, particular across multiple data nodes.
Splunk Apps, Dashboards and Correlations
Lastly, Splunk’s power goes well beyond being able to search large sets of data but rather aggregate, correlate and visualize them into dashboards, reports and alerts. Splunk provides excellent capabilities to build custom dashboards, but more importantly offers a marketplace of “apps” which provide near turnkey abilities to convert your data into intelligence through the app integrations.
Dashboards and app rely heavily on data being ingested and stored in the CIM format.
For more information you can read the documentation here: https://docs.splunk.com/Documentation/CIM/4.19.0/User/Overview