11.1. First-Generation Scanners
Most computer books discuss virus detection at a fairly simple level. Even newer books describe antivirus scanners as simple programs that look for sequences of bytes extracted from computer viruses in files and in memory to detect them. This is, of course, one of the most popular methods to detect computer viruses, and it is reasonably effective. Nowadays, state-of-the-art antivirus software uses a lot more attractive features to detect complex viruses, which cannot be handled using first-generation scanners alone. The next few sections show examples of detection and identification methods that can be applied to detect computer viruses.
Not all the techniques can be applied to all computer viruses. However, doing so is not a requirement. It is enough to have an arsenal of techniques, one of which will be a good solution to block, detect, or disinfect a particular computer virus. This fact is often overlooked by security professionals and researchers, who might argue that if one technique cannot be used all the time, it is completely ineffective.
11.1.1. String Scanning
String scanning is the simplest approach to detecting computer viruses. It uses an extracted sequence of bytes (strings) that is typical of the virus but not likely to be found in clean programs. The sequences extracted from computer viruses are then organized in databases, which the virus scanning engines use to search predefined areas of files and system areas systematically to detect the viruses in the limited time allowed for the scanning. Indeed, one of the most challenging tasks of the antivirus scanning engine is to use this limited time (typically no more than a couple of seconds per file) wisely enough to succeed.
Consider the code snippet shown in Figure 11.2, selected from a variant of the Stoned boot virus using IDA (the interactive disassembler).
Figure 11.2. A code snippet of the Stoned virus loaded to IDA.
The actual code reads the boot sector of the diskettes up to four times and resets the disk between each try.
The virus needs to call the original INT 13 disk handler because the virus monitors the same interrupt to infect diskettes whenever they are accessed. Thus the virus makes a call to CS:, right into the data areas in the front of the virus code at 0:7C09, where the original handler was previously stored. Indeed, there are a few bytes of data in the front section of the virus code, but the rest of the virus code remains constant.
This is a typical code sequence of the virus. The four attempts to read the first sector are necessary because older diskette drives were too slow to speed up. The virus uses the PUSH CS, POP ES instruction pair to set the disk buffer to the virus segment.
Notice that the code style appears to be a failing optimization attempt to set the contents of CX and DX registers, which are both parameters of the disk interrupt handler call.
Thus a possible pattern extracted from the Stoned virus is the following 16 bytes, which is the search string that was published in Virus Bulletin magazine:
0400 B801 020E 07BB 0002 33C9 8BD1 419C
Sixteen unique bytes is usually a long enough string to detect 16-bit malicious code safely, without false positives. Not surprisingly, computer virus research journals such as the Virus Bulletin published sequences that were typically 16 bytes long to detect boot and DOS viruses. Longer strings might be necessary to detect 32-bit virus code safely, especially if the malicious code is written in a high-level language.
The previous code sequence could appear in other Stoned virus variants. In fact, several minor variants of Stoned, such as A, B, and C, can be detected with the preceding string. Furthermore, this string might be able to detect closely related viruses that belong to a different family. On the one hand, this is an advantage because the scanner using the string is able to detect more viruses. On the other hand, this could be a serious disadvantage to the user because a completely different virus might be incorrectly detected as Stoned. Thus the user might think that the actual virus is relatively harmless, but the misidentified virus is probably much more harmful.
Identification problems might lead to harmful consequences. This can happen easily if the virus scanner also attempts to disinfect the detected-but-not-identified virus code. Because the disinfections of two different virus familiesor even two minor variants of the same virusare usually different, the disinfector can easily cause problems.
For example, some minor variants of Stoned store the original master boot sector on sector 7, but other variants use sector 2. If the antivirus program does not check carefully, at least in its disinfection code, whether the repair routine is compatible with the actual variant in question, the antivirus might make the system unbootable when it disinfects the system.
Several techniques exist to avoid such problems. For example, some of the simple disinfectors use bookmarks in the repair code to ensure that the disinfection code is proper for the virus code in question.
0400 B801 020E 07BB ??02 %3 33C9 8BD1 419C
The preceding string would be interpreted the following way:
Wildcard strings are often supported for nibble bytes, which allow more precise matches of instruction groups. Some early-generation encrypted and even polymorphic viruses can be detected easily with wildcard-based strings.
Evidently, even the use of Boyer-Moore algorithm5 alone is not efficient enough for string scanners. This algorithm was developed for fast string searching and takes advantage of backward string matching. Consider the following two words of equivalent length:
CONVENED and CONVENER
If the two strings are compared from the front, it takes seven matches to find the first mismatch. Starting from the end of the string, the first attempt is a mismatch, which significantly reduces the number of matches that must be performed because a number of match positions can be ignored automatically.
Interestingly, Boyer-Moore algorithm does not work very well in network IDS systems because the backward comparison can force out-of-packet comparison overhead.
Similar success, however, can be achieved based on bookmark techniques explained later. Furthermore, with the use of filtering6 and hashing algorithms, scanning speed can become virtually independent of the number of scan strings that need to be matched.
Mismatches in strings were invented for the IBM Antivirus. Mismatches allow N number of bytes in the string to be any value, regardless of their position in the string. For example, the 01 02 03 04 05 07 08 09 string with the mismatch value of 2 would match any of the following patterns, as shown in Figure 11.3.
Figure 11.3. A set of strings that differ in 2 mismatches.
01 02 AA 04 05 06 BB 08 09 0A 0B 0C 0D 0E 0F 10 01 02 03 CC DD 06 07 08 09 0A 0B 0C 0D 0E 0F 10 01 EE 03 04 05 06 07 FF 09 0A 0B 0C 0D 0E 0F 10
Mismatches are especially useful in creating better generic detections for a family of computer viruses. The downside of this technique is that it is a rather slow scanning algorithm.
11.1.4. Generic Detection
Generic detection scans for several or all known variants of a family of computer viruses using a simple string (and in some cases an algorithmic detection that requires some special code besides standard scanning). When more than one variant of a computer virus is known, the set of variants is compared to find common areas of code. A simple search string is selected that is available in as many variants as possible. Typically, a generic string contains both wildcards and mismatches.
Hashing is a common term for techniques that speed up searching algorithms. Hashing might be done on the first byte or 16-bit and 32-bit words of the scan string. This allows further bytes to contain wildcards. Virus researchers can control hashing even better by being selective about what start bytes the string will contain. For example, it is a good idea to avoid first bytes that are common in normal files, such as zeros. With further efforts, the researcher can select strings that typically start with the same common bytes, reducing the number of necessary matches.
To be extremely fast, some string scanners do not support any wildcards. For example, the Australian antivirus VET uses an invention of Roger Riordan7, which is based on the use of 16-byte scan strings (with no wildcards allowed) based on a 64KB hash table and an 8-bit shift register. The algorithm uses each 16-bit word of the string as an index into the hash table.
A powerful hashing was developed by Frans Veldman in TBSCAN. This algorithm allows wildcards in strings but uses two hash tables and a corresponding linked list of strings. The first hash table contains index bits to the second hash table. The algorithm is based on the use of four constant 16-bit or 32-bit words of the scan strings that do not have wildcards in them.
Bookmarks (also called check bytes) are a simple way to guarantee more accurate detections and disinfections. Usually, a distance in bytes between the start of the virus body (often called the zero byte of the body) and the detection string is calculated and stored separately in the virus detection record.
Good bookmarks are specific to the virus disinfection. For example, in the case of boot viruses, someone might prefer to select a set of bookmarks that point to references of the locations of the stored boot sectors. Staying with the previous example string for Stoned, the distance between the start of the virus body and the string is 0x41 (65) bytes. Now, look at the snippet of Stoned shown in Figure 11.4. The code reads the stored boot sector according to a flag. In case of the hard disk, the stored boot sector is loaded and executed from head 0, track 0, and sector 7 from drive C:. In case of the diskettes, the end of the root directory sector is loaded from head 0, track 3, and sector 1 from drive A:.
Figure 11.4. Another code snippet of the Stoned virus loaded to IDA.
The following could be a good set of bookmarks:
You can find these bytes at offset 0x7CFC and 0x7D07 in the preceding disassembly. Remember that the virus body is loaded to offset 0x7C00.
In the case of file viruses, it is a good practice to choose bookmarks that point to an offset to the stored original host program header bytes. Additionally, the size of the virus body stored in the virus is also a very useful bookmark.
You can safely avoid incorrectly repairing the virus by combining the string and the detection of the bookmarks. In practice, it is often safe to repair the virus based on this much information. However, exact and nearly exact identification further refine the accuracy of such detection.
11.1.7. Top-and-Tail Scanning
Top-and-tail scanning is used to speed up virus detection by scanning only the top and the tail of a file, rather than the entire file. For example, the first and last 2, 4, or even 8KB of the file is scanned for each possible position. This is a slightly better algorithm than those used in early scanner implementations, which worked very similarly to GREP programs that search the content of the entire file for matching strings. As modern CPUs became faster, scanning speed typically became I/O bound. Thus to optimize the scanning speed, developers of antivirus programs looked for methods to reduce the number of disk reads. Because the majority of early computer viruses prefixed, appended, or replaced host objects, top-and-tail scanning became a fairly popular technique.
11.1.8. Entry-Point and Fixed-Point Scanning
Entry-point and fixed-point scanners made antivirus scanners even faster. Such scanners take advantage of the entry point of objects, such as those available via the headers of executable files. In structureless binary executables such as DOS COM files, such scanners follow the various instructions that transfer control (such as jump and call instructions) and start scanning at the location to which such instructions point.
Because this location is a common target of computer viruses, such scanners have major advantages. Other scanning methods, such as top-and-tail scanning, must mask the strings (or hashes of strings) to each scanned position of the scanned area, but entry-point scanners typically have a single position to mask their scan strings: the entry point itself.
Consider a 1KB-long size for a buffer called B. The number of positions to start a string match in B is 1,024S, where S is the size of the shortest string to match. Even if the hashing algorithm of the scanner is so efficient that the scanner needs to perform a complete string search at a given position only 1% of the time, the number of computations could increase quickly, according to the number of strings. For example, with 1,000 strings, the scanner might need to make 10 complete matches for each possible position. Thus (1,024S)x10 is a possible number of minimum matches required. Indeed, the 1,024S multiplier can be dropped using fixed-point scanning with a single match position at the entry point. This is a very significant difference.
If the entry point does not have good enough strings, fixed-point scanning can come to the rescue. Fixed-point scanning uses a match position with each string. Thus it is possible to set a start position M (for example, the main entry point of the file) and then match each string (or hash) at positions M+X bytes away from this fixed point. Again, the number of necessary computations is reduced because X is typically 0. As a bonus, such scanners also can reduce significantly the disk I/O.
I used this technique in my own antivirus program. Each string of Pasteur required only a single, fixed start and ending byte, as well as a constant size. Wildcards were supported but only in a limited way. The strings were sorted into several tables according to object types. String matching picked the first byte of the entry point and checked whether there were any such start bytes for any strings using a hash vector. If there were no such first bytes, the next entry point was selected until there were no more entry points.
Because the size of each string was constant, the algorithm could also check whether the last byte of the string matched the corresponding location in the file being scanned. If the last byte of the string matched, only then was a complete string match performed. However, this rarely happened in practice. This trick is somewhat similar to the idea of the Boyer-Moore algorithm combined with simple hashing.
11.1.9. Hyperfast Disk Access
Hyperfast disk access was another useful technique in early scanner implementations. This was used by TBSCAN, as well as the Hungarian scanner VIRKILL, based on my inspiration. These scanners optimize scanning by bypassing operating systemlevel APIs to read the disk directly with the BIOS. Because MS-DOS was especially slow in handling FAT file systems, a ten-times-faster file I/O could be achieved using direct BIOS reads instead of DOS function calls. In addition, this method was often useful as an antistealth technique. Because file infector stealth viruses typically bypassed only DOS-level file access, the file changes could be seen via BIOS access in most, but not all, cases. Other scanners and integrity checkers even talked directly to the disk controllers for reasons of speed and security.
Unfortunately, nowadays these methods cannot be used easily (or at all) on all operating systems. Not only are there too many file systems that must be recognized and supported, there are also a variety of disk controllers, making such a task almost impossible.