The format of a log dictates the quality and therefore the value of the log for operational purposes such as troubleshooting, debugging, security and performance monitoring. In the early days and sadly in many cases still today log formats were a dog’s breakfast entirely determined by who was developing the software. This resulted in lack luster logs and most IT professionals not really caring about logs. Fast forward to today, while there is no one standard to rule them all, many standards have emerged to support consist and standardized logging structures and formats within the IT space. Here are 5 standards to know about.
Syslog is both a logging facility built into the operating system, a transportation protocol and a message formatting standard. Arguably the oldest logging standard, Syslog was the first defined via RFC-5424 and covers the internal mechanics, deployment scenarios, the transportation layer and most importantly the message format.
The Syslog RFC defines several optional fields with particular focus on standardizing how we express time, internal mechanics such as log facility and severity as well as details on what or who generated the log. The RFC standard provides various ways to express time, which arguably has resulted in a variety of date formats and headaches.
Many technologies have leveraged internal databases to collect, store and search through logs. Good examples of this would be Microsoft Windows Event Viewer, McAfee EPO, Checkpoint firewalls and many other technologies. There are pros and cons to this model. From an advantage point of view a defined schema means data structure and formatting are predefined and easy to search through. However, on the downside there likely will be a capacity and retention limitation and additional complications when sending logs to an external system such as a SIEM.
The road to hell is paved with good intentions and the intent to standardize logging makes a lot of sense, as long as we can all agree. However, see XKCD comic, we are not very good at standardizing standards.
Semi-structured logging standards focus on standardizing what values are captured and how they are loosely structured within a single log message. Good examples of these standards would be Common Log Format, Extended Log Format, W3C Extended Log Format and a few semi-proprietary semi-structured formats such as Graylog Extended Log Format. All of these standards have flexibility in what data is to be included and are mostly focused on structuring the message in a recognizable format for easier parsing. These semi-structured formats typically separate data values through formatting such as comma, spaces or quotes.
Structured logging standards tries to bring the best of structured database logging into a flexible semi-structured format through the use of key-value pairs. ArcSight, a trail blazer in the logging/SIEM industry, set out to create a white paper standard called the Common Event Format (CEF) which could be used across any technology and was intended to make life better. And for a time it was good. However, Splunk, a leader in the logging/SIEM space, did not want to use a competitor’s standard and created their own Common Information Model (CIM) which was almost identical to CEF in concept. Yet another standard, see XKCD comic.
And more recently, an up and coming competitor to Splunk, Elastic, came up with yet another log format standard called Elastic Common Schema (ECS) which are more or less conceptually identical to CEF and CIM. #facepalm.
Last but not least and becoming a de facto log structure standard is writing logs in JSON format. Building on top of the power and benefits of XML, JSON has become a popular option for two reasons. First, many of the log management/SIEM solutions have moved to a NoSQL database backend, which makes ingesting JSON documents into their database easy and efficient. Secondly, JSON is hierarchical and object oriented making it great for storing relational data and is very developer friendly.
However, there are some downside to using JSON, at least alone. NoSQL databases became popular because you could write data without knowing the schema. That made things easy to ingest, but more challenging to query. When writing logs in JSON format without aligning to a formatting standard/schema, developers have the freedom to define the key or name as they please. This means that describing a source IP address field could look like sourceip, src, sip or any other string for that matter. Inconsistent key/name/fields create challenges with query completeness or utilizing more advanced features such as visualization, ML and automation.