11.2. Second-Generation Scanners
11.2.1. Smart Scanning
Smart scanning was introduced as computer virus mutator kits appeared. Such kits typically worked with Assembly source files and tried to insert junk instructions, such as do-nothing NOP instructions, into the source files. The recompiled virus looked very different from its original because many offsets could change in the virus.
Smart scanning skipped instructions like NOP in the host program and did not store such instructions in the virus signature. An effort was made to select an area of the virus body that had no references to data or other subroutines. This enhanced the likelihood of detecting a closely related variant of the virus.
This technique is also useful in dealing with computer viruses that appeared in textual forms, such as script and macro viruses. These computer viruses can easily change because of extra white spaces (such as the Space, CR/LF, and TAB characters, and so on). These characters can be dropped from the scanned buffers using smart scanning, which greatly enhances the scanner's detection capabilities.
11.2.2. Skeleton Detection
Skeleton detection was invented by Eugene Kaspersky. Skeleton detection is especially useful in detecting macro virus families. Rather than selecting a simple string or a checksum of the set of macros, the scanner parses the macro statements line to line and drops all nonessential statements, as well as the aforementioned white spaces. The result is a skeleton of the macro body that has only essential macro code that commonly appear in macro viruses. The scanner uses this information to detect the viruses, enhancing variant detection of the same family.
11.2.3. Nearly Exact Identification
Nearly exact identification is used to detect computer viruses more accurately. For example, instead of one string, double-string detection is used for each virus. The following secondary string could be selected from offset 0x7CFC in the previous disassembly to detect Stoned nearly exactly:
0700 BA80 00CD 13EB 4990 B903 00BA 0001
The scanner can detect a Stoned variant if one string is detected and refuse disinfection of the virus because it could be a possibly unknown variant that would not be disinfected correctly. Whenever both strings are found, the virus is nearly exactly identified. It could be still a virus variant, but at least the repair of the virus is more likely to be proper. This method is especially safe when combined with additional bookmarks.
Another method of nearly exact identification is based on the use of a checksum (such as a CRC32) range that is selected from the virus body. Typically, a disinfection-specific area of the virus body is chosen and the checksum of the bytes in that range is calculated. The advantage of this method is better accuracy. This is because a longer area of the virus body can be selected, and the relevant information can be still stored without overloading the antivirus database: The number of bytes to be stored in the database will be often the same for a large range and a smaller one. Obviously, this is not the case with strings because the longer strings will consume more disk space and memory.
Second-generation scanners also can achieve nearly exact identification without using search strings of any kind, relying only on cryptographic checksums8 or some sort of hash function.
To make the scanning engine faster, most scanners use some sort of hash. This led to the realization that a hash of the code can replace search stringbased detection, provided that a safe hash in the virus can be found. For example, Icelander Fridrik Skulason's antivirus scanner, F-PROT9, uses a hash function with bookmarks to detect viruses.
Other second-generation scanners, such as the Russian KAV, do not use any search strings. The algorithm of KAV was invented by Eugene Kaspersky. Instead of using strings, the scanner typically relies on two cryptographic checksums, which are calculated at two preset positions and length within an object. The virus scanner interprets the database of cryptographic checksums, fetches data into scan buffers according to the object formats, and matches the cryptographic checksums in the fetched data. For example, a buffer might contain the entry-point code of an executable. In that case, each first cryptographic checksum that corresponds to entry-point code detections is scanned by calculating a first and a second cryptographic checksum. If only one of the checksums matches, KAV displays a warning about a possible variant of malicious code. If both cryptographic checksums match, the scanner reports the virus with nearly exact identification. The first range of checksum is typically optimized to be a small range of the virus body. The second range is larger, to cover the virus body nearly exactly.
11.2.4. Exact Identification
Exact identification9 is the only way to guarantee that the scanner precisely identifies virus variants. This method is usually combined with first-generation techniques. Unlike nearly exact identification, which uses the checksum of a single range of constant bytes in the virus body, exact identification uses as many ranges as necessary to calculate a checksum of all constant bits of the virus body. To achieve this level of accuracy, the variable bytes of the virus body must be eliminated to create a map of all constant bytes. Constant data bytes can be used in the map, but variable data can hurt the checksum.
Consider the code and data selected from the top of the Stoned virus shown in Figure 11.5. In the front of the code at the zero byte of the virus body, there are two jump instructions that finally lead the execution flow to the real start of virus code.
Figure 11.5. Variable data of the Stoned virus.
Right after the second jump instructions is the data area of the virus. The variables are flag, int13off, int13seg, and virusseg. These are true variables whose values can change according to the environment of the virus. The constants are jumpstart, bootoff, and bootseg; these values will not change, just like the rest of the virus code.
Because the variable bytes are all identified, there is only one more important item remaining to be checked: the size of the virus code. We know that Stoned fits in a single sector; however, the virus copies itself into existing boot and master boot sectors. To find the real virus body size, you need to look for the code that copies the virus to the virus segment, which can be found in the disassembly shown in Figure 11.6.
Figure 11.6. Locating the size of the virus body (440 bytes) in Stoned.
Indeed, the size of the virus is 440 (0x1B8) bytes. After the virus has copied its code to the allocated memory area, the virus code jumps into the allocated block. To do so, the virus uses a constant jumpstart offset and the previously saved virus segment pointed by virusseg in the data area at CS:0Dh (0x7C0D). Thus we have all the information we need to calculate the map of the virus.
The actual map will include the following ranges: 0x00x7, 0xD0xE, 0x110x1B7, with a possible checksum of 0x3523D929. Thus the variable bytes of the virus are precisely eliminated, and the virus is identified.
To illustrate exact identification better, consider the data snippets of two minor variants of the Stoned virus, A and B, shown in Listing 11.1 and Listing 11.2, respectively. These two variants have the same map, so their code and constant data ranges match. However, the checksum of the two minor variants are different. This is because the virus author only changed a few bytes in the message and textual area of the virus body. The three-byte changes result in different checksums.
Listing 11.1. The Map of the Stoned.A Virus
Virus Name: Stoned.A Virus Map: 0x0-0x7 0xD-0xE 0x11-0x1B7 Checksum: 0x3523D929 0000:0180 0333DBFEC1CD13EB C507596F75722050 ..........Your P 0000:0190 43206973206E6F77 2053746F6E656421 C is now Stoned! 0000:01A0 070D0A0A004C4547 414C495345204D41 .....LEGALISE MA 0000:01B0 52494A55414E4121 0000000000000000 RIJUANA!........
Listing 11.2. The Map of the Stoned.B Virus
Virus Name: Stoned.B Virus Map: 0x0-0x7 0xD-0xE 0x11-0x1B7 Checksum: 0x3523C769 0000:0180 0333DBFEC1CD13EB C507596F75722050 ..........Your P 0000:0190 43206973206E6F77 2073746F6E656421 C is now stoned! 0000:01A0 070D0A0A004C4547 414C495A45004D41 .....LEGALIZE.MA 0000:01B0 52494A55414E4121 0000000000000000 RIJUANA!........
Exact identification can differentiate precisely between variants. Such a level of differentiation can be found only in a few products, such as F-PROT9. Exact identification has many benefits to end users and researchers both. On the downside, exact identification scanners are usually a bit slower than simple scanners when scanning an infected system (when their exact identification algorithms are actually invoked).