15.6. Automated Analysis: The Digital Immune System
In the previous sections, I detailed the basic principles of manual malicious code analysis. This chapter would not be complete without a discussion of automated code analysis techniques, such as the Digital Immune System operated by Symantec. DIS was developed by IBM Research starting around 199518. There are three major analyzer components of the system, supporting DOS viruses, macro viruses, and Win32 viruses.
DIS supports automated definition delivery to newly emerging threats via the Internet, end-to-end. Figure 15.27 shows a high-level data flow of DIS.
Figure 15.27. A high-level view of the Digital Immune System.
There are a number of inputs to the system from the customer side to the vendor side via the cluster of customer gateways. Obviously, there are a number of firewalls built in on both the customer side and the vendor side, but these are not shown to simplify the picture19. The system developed by IBM can handle close to 100,000 submissions per day.
The input to the system is a suspicious sample, such as a possibly infected file, which is collected by heuristics built into antivirus clients. The output is a definition that is delivered to the client who submitted the suspicious object for analysis.
Several clients can communicate with a quarantine server at corporate customer sides. The quarantine server synchronizes definitions with the vendor and pushes the new definitions to the clients. Individual end users also can submit submissions to the system via their built-in AV quarantine interface. Suspicious samples also can be delivered from attack quarantine honeypot systems9.
The automated analysis center processes the submission and creates definitions that can be used to detect and disinfect new threats. Alternatively, submissions are referred to manual analysis, which is handled by a group of researchers.
The heart of the automated analysis center is based on the use of an automated computer virus replication system. In late 1993, Ferenc Leitold and I realized the need for a system to replicate computer viruses automatically. When we attempted to create a collection of properly replicated samples from a large collection of virus-infected sample sets, we observed that computer virus replication is simply the most time-consuming operation in the process of computer virus analysis20.
A replicator system can run a virus in a controlled way until it infects new objects, such as goat files. The infected objects are collected automatically and stored for future analysis. This kind of controlled replication system was also developed by Marko Helenius at the University of Tampere for the purpose of automated antivirus testing21.
On the other hand, IBM built on the groundwork of replication systems that used virtual machines, such as Bochs (http://bochs.sourceforge.net), in modified forms using the principles of generic disinfection. IBM researchers realized that heuristic generic disinfection (discussed in Chapter 11 "Antivirus Defense Techniques,") was essential to achieving automated definition generation. The principle of generic disinfection is simple: If you know how to disinfect an object, you can detect and disinfect the virus in an automated way.
Figure 15.28 shows the process of automated virus detection and repair definition generation. The input of the system is a sample of malicious code. The output is either an automated definition or a referral to manual analysis, which results in a definition if needed.
Figure 15.28. The automated definition-generation process in DIS.
In the first step, the sample arrives at a Threat Classifier module22. In this step, the filtering process takes place first, analyzing the format of the possibly malicious code and referring it accordingly to a controller module. Unrecognized objects go to manual analysis. The filtering process involves steps that were previously discussed as part of the manual analysis process. It is important to understand that multiple analysis processes can take place simultaneously.
In the second step, a replication controller runs a number of replication sessions. The replicator fires up a set of virtual machines, or alternatively, real systems to test replicate computer viruses. For example, documents containing macros are loaded into an environment in which Microsoft Office products are available. The replication process uses modules loaded into the system that run the viruses. The virtual machines run monitoring tools that track file and Registry changes, as well as network activity, and save such information for further analysis. The replicator loads and runs more than one environment by starting with a clean state each time until a predefined number of steps or until the virus is successfully replicated.
If insufficient information is collected about the computer virus in any of the test environments, the controller sends the samples to manual analysis. Otherwise, the controller passes information to the analyzer module. In turn, the analyzer checks the data, such as the infected goat files, and attempts to extract detection strings23 from them (or uses alternative methods). If this step fails, for example if the virus is metamorphic, the replicated sample set will be forwarded to manual analysis.
If the analyzer can create definitions to detect and disinfect the virus, it passes the definition to a builder module. The builder takes the source code of the definition and compiles it to new binary definitions. At this point, a temporary name is assigned to the new viral threat automatically. The temporary name is later changed based on classification by a researcher.
Finally, the builder passes the compiled definitions to a tester module. The tester module double-checks the correctness of definition and tests it for false positives. If a problem is detected in any of the previous steps, the sample set is forwarded to manual analysis. Otherwise, the definition is ready and is forwarded to the definition server and then to the system that submitted the sample.
For example, the W32/Swen.A@mm worm was automatically handled by DIS as Worm.Automat.AHB. There is nothing more fascinating when there are no humans required to respond to an outbreak.