How Machine Learning is Helping Combat Insider Threats
The volume of computer data federal agencies must collect, store and analyze to effectively manage insider threat programs grows so fast that it threatens to overwhelm agencies’ ability to analyze all the data they collect. This potentially undermines the effectiveness of insider threat programs. Sophisticated tools can help, but implementing the needed levels of sophistication requires skilled people – those with years of experience not only with tools but with the analytics and analysis those tools enable.
Sorting through the data in a reasonable timeframe will require new ways of organizing it, as well as new tools for accelerating search. These new tools incorporate machine learning techniques to identify anomalies and patterns in broad data sets, speeding up investigations and threat detection, industry experts say.
Some of the federal government’s most damaging security breaches over the past decade have involved insiders – cleared government employees and contractors with direct internal access to government facilities, systems and information. In each case, these insiders downloaded massive amounts of classified data.
In the wake of those data breaches and other similar incidents, new standards for insider data collection require agencies to collect and analyze detailed information about the comings, goings and digital behavior of employees and contractors – essentially anyone with access to government systems and facilities. Security standards have now expanded the range of data collected to include physical access information, foreign contact information, foreign travel information, personnel security information and financial disclosure information.
From Gigabytes to Petabytes
All of that information translates into staggering volumes of data. Agencies capture every single log on, log off, file action (open, edit, print, share and more), permission change (especially escalation) and addition or deletion of users. “We’re talking 50,000 events per second today rising to 100,000 events per second before too long,” says David Sarmanian, an enterprise architect with General Dynamics Information Technology (GDIT).
Security Information Event Management (SIEM) systems are the tried-and-true tools for capturing and tracking all that data. Systems such as ArcSight have a proven track record of providing real-time analysis of security alerts generated by network hardware and applications. Based on a relational database backend, ArcSight allows security operations teams to gather events from applications, devices, servers and their networks. They then load the information into an SQL database where it can be analyzed.
But the exponential growth, over just the past few years, in the amount of data collected has such systems now struggling to keep up, some experts say.
“When the sum of all that data was measured in gigabytes of data per day, that was a good approach,” says Michael Paquette, director of products for security markets with Elastic. “The amount of security data is exploding so much today that large organizations, especially in the government, are not talking about terabytes per day, they are talking about petabytes per day.”
Elastic develops the Elasticsearch distributed search and analytics engine and is used in a wide range of big data applications, from Netflix on the consumer side to security in the IT world.
“With ArcSight, analysts had to create a rule that might say something like ‘if this IP performs a certain action, notify me or if this user goes to this subnet or computer, notify me,’” says GDIT’s Sarmanian. “But now you have petabytes of data. How do you write a rule that is constantly looking for anomalies? You will crush the system because there is not enough RAM to run the system and do these types of queries.”
Applying Machine Learning
Security analysts can analyze data through visualization, traditional security rules or via machine learning (ML), which can be built on top of fast search engine capability. ML employs artificial intelligence, allowing computers to learn or become more proficient at a given task over time. This learning can be either supervised, where analysts help train the system, or unsupervised, where it is autonomous.
Elastic’s engineers understood the analysis problem associated with big data is fundamentally a search problem, Paquette explains. Security teams usually perform analysis using a portion of data extracted by search technology.
Elasticsearch uses unsupervised learning and incorporates an ML job, Paquette says. This ML job builds on an index of data or fields within the data. Analysts can instruct the ML tool to model the behavior of that data and inform the tool to notify the security team whenever something unusual occurs. The ML job automatically runs in the background without any human participation, generating an alert or email each time it detects an anomaly.
Splunk Inc., a San Francisco provider of systems and tools for managing vast amounts of machine data, is another leading player in this market. The Splunk platform performs anomaly detection, adaptive thresholding and predictive analytics by using pre-packaged or custom algorithms to forecast future events.
“Splunk provides a comprehensive answer to one of the biggest challenges facing modern organizations: How does it harness diverse and increasingly profuse amounts of data to gain valuable business insights,” analyst Jason Stamper writes in a report on Splunk and machine learning.
Splunk software can also pull data from the Hadoop data file system, traditional databases and any other relevant data source via APIs to provide additional context for the data, according to the white paper. The software also creates correlations between disparate data sources and normalizes different data types at search time. For instance, a person might appear in log data as an employee number, but appear as a full name in a human resource system. Splunk software helps normalize the ways the data is represented, allowing analysts to take full advantage of the software’s statistical analysis capabilities, which can help them monitor for activities that are statistical outliers at a variety of levels of standard deviation.
GDIT security experts use both Splunk and Elastic ML capabilities in their work with federal agencies, to establish a baseline of behavior for anomaly and pattern detection, according to GDIT’s Sarmanian.
Hunting for Threats
Another tool gaining traction in the federal space is Sqrrl’s Threat Hunting Platform, built on a foundation of Apache Hadoop and Accumulo, an open source database originally developed at the National Security Agency, Sarmanian notes.
Sqrrl’s threat hunting capabilities are grounded in the concept of a behavior graph combining linked data with various types of advanced algorithms including ML, Bayesian statistics and graph algorithms. Sqrrl goes beyond simple log search and histograms. It provides linked data analysis and visualization capabilities, which involve the fusion of disparate data sources. These use defined ontologies to better enable ad hoc data interrogation, greater contextual awareness, faster search and more intuitive visualization, Sqrrl officials say.
Machine Learning Savvy
Insider threats require different types of analytics, says Chris Meenan, director of QRadar Management and Strategy with IBM Security.
“It is not about a specific vulnerability being exploited or a specific malware that has a signature associated with it,” Meenan points out. “It is about someone who has gone rogue, who uses entitlement to do their job, but starts to use that in a malicious way. So that requires analytics that is more behavior-based.”
IBM’s QRadar is a SIEM platform that detects anomalies and uncovers advanced threats through the consolidation of log events, network flow data from thousands of devices and endpoints and applications distributed throughout a network. ML extends the predictive modeling capabilities of QRadar and its User Behavior Analytics application.
ML helps analysts build models of user behavior. One single threshold cannot be applied across an organization because different groups of people have different job functions, requirements and work habits. A marketing staffer might travel often and log into the network from remote locations multiple times per week, while an administrative employee might only log in from a desktop computer at work. Analysts can build models for these employee profiles that anticipate what kinds of work they do, what types of files and apps they access and whether they are authorized to use privileged accounts. Such model-driven security is becoming more common, Meenan says.
Sven Krasser, chief scientist with CrowdStrike, a developer of ML-based endpoint security tools, says it’s relatively easy to build a machine learning model for such uses. The hard part is the integration of that machine learning model into a larger infrastructure. “You need to collect and normalize the data,” he says. “You need to have a data strategy where you are collecting data that is useful to machine learning and does not cause breakage.”
Finally, as valuable as ML technology is, it’s important to recognize that it’s a tool – not magic. ML is only effective in situations where there’s enough data generated to provide insight into IT network traffic.
“These are complex problems,” says GDIT’s Sarmanian. “You need to have the right data, you need to understand the complexity of the problem and you need to match the tools to the challenge in such a way that it can scale and adapt over time. Finally, you need to do all that within a finite budget. Get any of those wrong and things can spiral out of control in a heartbeat. This is one of those areas where experience makes a really big difference.”