What is a Computer Virus in the Modern Sense? Part 1

What is a Computer Virus in the Modern Sense? Part 1

Let's talk about a computer virus as a code that generates its own copies. Like its biological brethren, it requires a host file that is workable, and will be workable in the future to give life to the new generation of the virus.

It requires a fertile environment for the reproduction, many tasty executable files, and also many stupid and active users to run it. So the name virus is not just a beautiful label for describing a malicious program, a computer virus, in its classical sense, is an entity very close to its biological brethren.

Humanity, as it has been repeatedly proved, is capable of creating very sophisticated solutions, especially when it comes to creating something that harms other people.

Long long time ago, after DOS came to people, and every developer had own little universe, and there was the single address space, and the rights to the files were always rwx, there was a point of whether "a program can copy itself?". "Of course, it can!", said the programmer and wrote the code that copies his own executable file.

The next point was "can two programs be combined into one?". "Of course, they can!", said the developer and wrote the first infector. "But why?" he thought, and this was the beginning of the era of computer viruses. As it turned out, shitting on the computer and trying to avoid detection is very fun, and creating viruses is very interesting from the point of view of the system developer. In addition, the antiviruses that appeared on the market provided the creators of viruses a serious challenge to their professionalism.

In general, there are enough lyrics for the article, let's get down to business. I want to talk about the classic virus, its structure, basic concepts, methods of detection and algorithms that are used by both sides to win.

Anatomy of the Virus

We will talk about viruses which live in executable files of PE and ELF formats, viruses whose body is executable code for the x86 platform. In addition, let our virus does not destroy the original file, fully preserving its efficiency and correctly infecting any suitable executable file. It's much easier to break, but we agreed to talk about the right viruses. In order to make the material relevant, I will not take the time to consider infectors of the old COM format, although the first advanced techniques of working with executable code were run on it.

The main parts of the virus code are infector and payload.

Infector is a code that searches for files suitable for infection and injects a virus into them, trying to hide the fact of implementation as much as possible without damaging the functionality of the file.

Payload is the code that performs the actions necessary for the virus developer, for example, sends spam, DoS from somebody, or simply leaves a text file on the machine "The virus was here". We are absolutely unprincipled what is inside the payload, the main thing is that the virus developer tries in every possible way to hide its contents.

Let's start with the properties of the virus code. To make the code more convenient to implement, you do not want to separate the code and data, so you usually use data integration directly into executable code. Well, for example, like this:

jmp message the_back: mov eax, 0x4 mov ebx, 0x1 pop ecx ; The address of "Hello, World" will be taken from the stack mov edx, 0xF int 0x80 ... message: call the_back ; After execution on the stack will be lying the address of "return", i.e. the address of "Hello, World \n" db "Hello, World!", 0Dh, 0Ah

Or like this:

push 0x68732f2f ; “hs//” push 0x6e69622f ; “nib/” mov ebx, esp ; ESP now contains the address of the string "/bin/sh" mov al, 11 int 0x80

All these code options under certain conditions can simply be copied into memory and make JMP on the first instruction. Correctly writing such code, taking care of the correct offsets, system calls, the purity of the stack before and after execution, etc., and it can be embedded inside the buffer with a foreign code.

Suppose the virus developer has the ability to write the virus code in this style, and now he needs to inject it into the existing executable. He needs to take care of two things:

Let's look deeply at the implementation in the file. Modern executable formats for the x86 platform on Windows and Linux are PE (Portable Executable) and ELF (Executable and Linkable Format). You will easily find their specifications in the system documentation, and if you are concerned with the protection of executable code, then do not miss it. Executable formats and the system loader (the operating system code that runs the executable file) are one of the "elephants" on which the operating system stands. The procedure for running the .exe file is a very complex algorithmic process with lots of nuances, and you can read about it in a dozen articles that you will find by yourself if the topic interests you. I will limit myself to a simple consideration sufficient for a basic understanding of the startup process. I will have in mind under the compiler the whole complex of programs that turns the source code into a ready executable file, that is, in fact, the compiler + linker.

The executable file (PE or ELF) consists of the header and the set of sections. Sections are aligned (see below) buffers with the code or data. When you start a file, the sections are copied into memory and memory is allocated to them, and not necessarily the volume size is equal to that they occupied on the disk. The header contains partition markup, and tells the loader how the sections in the file are located when it is on the disk, and how to place them in memory before passing control to the code inside the file. We are interested in three key parameters for each section, these are psize, vsize, and flags. Psize (physical size) is the size of the section on the disk. Vsize (virtual size) - the size of the section in memory after uploading the file. Flags are the attributes of the section (rwx). Psize and Vsize can vary significantly, for example, if a developer declared an array in a million elements in the program, but is going to fill it in the execution process, the compiler will not increase psize (on the disk, the contents of the array should not be stored before running), but vsize will increase by one million (There must be enough memory in the runtime for the array).

Flags (attributes of access) will be assigned to memory pages, in which the section will be displayed. For example, a section with executable code will have the attributes r_x (read, execute), and the data section attributes rw_ (read, write). The processor, when it tries to execute the code on the page without the execution flag, will throw an exception, the same concerns the attempt to write to the page without the w attribute, so when placing the virus code, the virus developer must take into account the attributes of the memory pages in which the virus code will be located. Standard sections of uninitialized data (for example, the program stack area) until recently had the attributes rwx (read, write, execute), which allowed to copy the code directly to the stack and execute it there. Now this is considered unfashionable and unsafe, and in the latest operating systems, the stack area is only for data. Of course, the program itself can change the attributes of the memory page in runtime, but this complicates the implementation.

Also, in the header lies Entry Point - the address of the first instruction, with which the execution of the file begins.

It is also necessary to mention the property of executables, important for the virus developers, like alignment. In order for the file to be optimally read from the disk and displayed in memory, the sections in the executable files are aligned on the boundaries, divisible by the powers of two, and the free space left from the padding is filled with something at the discretion of the compiler. For example, it is logical to align sections to the size of a page of memory - then it is convenient to copy it entirely into memory and assign attributes. I will not even remember about all these alignments, wherever a standard piece of data or code lies, it is leveled (any developer knows that there are exactly 1024 meters in a kilometer). Well, the description of the standards Portable Executable (PE) and Executable Linux Format (ELF) for working with security methods of executable code is a desktop book.

Since the addresses inside all these sections are connected, just slap a piece of code into the middle of the section, "bundling" it with JMPs will not work, the original file will break. Therefore, the most popular places to implement the virus code are:

  1. The main code section (overwriting the beginning of the executable code starting directly from the Entry Point by the virus).
  2. Padding between the end of the header and the first section. There's nothing there and it's possible to fit a small virus (or its loader) there without breaking the file.
  3. A new section that can be added to the header and placed in the file after all the others. In this case, no internal bias will break, and there is no problem with it. True, the last section in the file in which execution is allowed, of course, attracts heuristics attention.
  4. Padding between the end of the contents of the section and its aligned end. This is much more difficult, because we first need to find this "end", and not the fact that we will be lucky and there will be enough place. But for some compilers this place can be found simply by the characteristic bytes.

There are more subtle ways and some I will describe in the second article.

Now about the control transfer. For the virus to work, its code must somehow get the control. The most obvious way: first the virus gets control, and then, after it works, the host program. This is the easiest way, but also have the right to life another options when the virus gets control, for example, after the host completes the work, or in the middle of the performance, "replacing" the execution of some function. Here are some control transfer techniques (the term Entry Point or EP, used below, is the entry point, the address to which the system loader will transfer control after preparing the executable file for the launch):

  1. JMP on the body of the virus replaces the first bytes in the Entry Point file. The virus wipes the dead bytes in its body, and at the end of its own work, restores them and transfers control to the beginning of the restored buffer.
  2. The method is similar to the previous one, but instead of bytes the virus saves several complete machine instructions in the Entry Point, then it can, without restoring (after only correctly cleaning the stack), execute them after finishing their own work and transfer control to the instruction address following by "Stolen".
  3. As with the implementation, there are more subtle ways but we'll look at them below, or postpone it until the next article.

All this - ways to make the correct insertion of the buffer with the code into some executable file. In this case, 2 and 3 mean a functional that allows you to understand what bytes are instructions, and where the boundaries between instructions are. After all, we can not "break" the instructions in half, in this case everything will break. Thus, we proceed smoothly to the consideration of disassemblers in viruses. The concept of the principle of disassemblers we will need to consider all the normal techniques for working with executable code, so it's okay if I describe it right now.

If we insert our code into a position exactly between instructions, we can save the context (stack, flags) and, having executed the virus code, restore everything back, return the control to the host program. Of course, this can also be a problem if you use code integrity controls, anti-debugging, etc., but this is also discussed in the second article. In order to find such an item, we need this:

This is the minimum functional necessary to not fall into the middle of an instruction, and the function that takes a pointer to a byte string, and in return gives the length of an instruction, is called a disassembler of lengths.

For Example, the Infection Algorithm Can Be Such as:

This is a quite correct virus, which can penetrate into the executable file, do not break anything, hide its code and return it to the host program. Now, let's catch him.

Anatomy of the Detector

Suddenly, from nowhere, a knight appears on a white computer, he has a debugger in his left hand, and a disassembler in the right hand. Where did he come from? You, of course, you guessed. With a high degree of probability, he appeared there from the "adjacent area". The antivirus area in terms of developing is highly respected by those who are in the topic, because these guys have to deal with very sophisticated algorithms, and in rather cramped conditions. Judge for yourself: you have at the entrance of hundreds of thousands of copies of any infection and the executable, you should work almost in real time, and the price of the error is very high.

For an antivirus, as for any finite state machine that accepts a binary "yes/no" solution (infected/healthy), there are two types of errors - false positive and false negative (mistakenly recognized the file as infectious, mistakenly missed the infected one). It is clear that the total number of errors must be reduced in any scenario, but false negative for the antivirus is much more unpleasant than false positive. "After downloading the torrent, disable the antivirus before installing the game" - familiar? This is the "false positive" - crack.exe, which writes something into an executable .exe file for a fairly clever heuristic analyzer (see below), looks like a virus.

I consider it's not necessary to describe the components of an ordinary antivirus, they all revolve around one functional - an antivirus detector. A monitor that checks files on the fly, scans disks, scans e-mail attachments, quarantines and stores already-checked files - all this is the binding of the main detecting core. The second key component of the antivirus is the replenished database of features, without which it is impossible to keep the antivirus up-to-date. The third, rather important, but deserving separate cycle of articles is the component that is monitoring the system for suspicious activity.

So (we are considering classic viruses), we have an executable file on the input and one of hundreds of thousands of potential viruses in it. Let's detect.

Consider This is the Piece of Executable Code of the Virus:

XX XX XX XX XX XX ; The beginning of a virus with a length of N bytes 68 2F 2F 73 68 push 0x68732f2f ; "hs//" 68 2F 62 69 6E push 0x6e69622f ; "nib/" 8B DC mov ebx, esp ; ESP now has the address of the string "/bin/sh" B0 11 mov al, 11 CD 80 int 0x80 XX XX XX XX ; The end of the virus with a length of M bytes

Just want to just take a pack of opcodes (68 2F 2F 73 68 68 2F 62 69 6E 8B DC B0 11 CD 80) and look for this byte string in the file. If found - caught, reptile. But, alas, it turns out that the same sting of bytes is also found in other files (well, does not anyone know who invokes the command interpreter), and even such lines for searching are more than million, if you look for each, then no optimization will help. The only, quick and correct way to check the presence of such a line in a file is to test its existence by a FIXED offset. Where can I get it?

We recall the "adjacent area" - especially the place about where the virus puts itself and how the management reports:

Now from There on the Transfer of Control:

I'm tired of writing a "byte-string", it has not constant length, storing it in the database is inconvenient, and completely unnecessary, so instead of a byte-string, we'll use its length plus CRC32 from it. Such a record is very short and the comparison works fast, since the CRC32 algorithm is not a slow one. It is not worthwhile to pursue the stability to checksum collisions, since the probability of collision at fixed displacements is scanty. In addition, even in case of collision, the error will be of the "false positive" type, which is not so terrible. We generalize all of the above, here is an example structure of the record in the antivirus database:

Let's optimize the input (leaving only the signatures that "fit" in this file, we immediately prepare a set of necessary offsets from the header) and then:

{ # for all eligible entries - based on the flags, let's calculate the base offset in the file (the beginning of the code section, entry point, etc.) - add to it the offset - read Lsig byte - consider from them CRC32 - if coincided - we caught the virus }

Hurray, here is our first antivirus. It's pretty cool, because with a fairly full database of signatures, normally selected flags and good optimization, this detector is able to catch some of all of the infections very quickly. Next, the game starts "who will update the signature databases faster" and "for whom will be sent a new copy of some nasty stuff earlier".

Collection and cataloging of this "nasty stuff" is a very nontrivial task, but absolutely necessary for a qualitative testing of the detector. Collecting the reference database of executable files is not an easy task: try to find all instances of infected files (for complex cases in several instances), catalog them, mix them with "clean" files and regularly run a detector on them to detect detection errors.

Heuristic Analyzer

What terrible word is a "heuristic analyzer", now you will not see it in the interfaces of antiviruses (probably it is frightening users). This is one of the most interesting parts of the antivirus, since everything that does not fit into one of the engines (neither the signature one, nor the emulator) is shoved into it, and looks like a doctor who sees that the patient coughs and sneezes, but determines a particular disease can not. This is the code that checks the file for some characteristic symptoms of infection. Examples of such signs:

In addition, to indicating the fact of infection, the heuristic can help decide whether to launch a "heavier" file analysis. Each sign has a different weight, from "suspicious some" to "I do not know what, but the file is infected exactly." These signs give the majority of "false positive" errors. Do not forget also that it is heuristics that can provide the antivirus company with instances of potential viruses. Heuristics worked, but nothing concrete was found? So the file is exactly the candidate for sending to the antivirus company.

Interspecific Interaction and Evolution

As we saw, for a fast and accurate comparison of the detector, the bytes of the signature and its offset are needed. Or, in another language, the contents of the code and the address of its location in the host file. Therefore, it is understandable how the ideas of concealing the executable code of viruses evolved in two directions:

Hiding the code of the virus resulted in the appearance of polymorphic engines. Engines that allow the virus to change its code in each new generation. In each new infected file, the body of the virus mutates, trying to hinder detection. Thus, the content of the signature is difficult to create.

Hiding the entry point (Entry Point Obscuring) as a result served as an impetus for the appearance in the virus engines of automatic disassemblers to determine, at a minimum, the instructions for the transition. The virus tries to hide the place from which the transition to its code occurs, using from the file what eventually leads to the transition: JMP, CALL, RET all sorts, address tables, etc. Thus, the virus makes it difficult to specify the offset of the signature.

we will look in details some algorithms of such engines and the detectors in the second article, which I plan to write in the near future.

In parallel with the development of virus engines and confronting to them detectors, commercial protection of executable files actively developed. There was a huge number of small commercial programs, and developers needed engines to take the EXE file and wrap it in some "envelope" that can securely generate a valid serial number. And who can hide the executable code, and implement it into executable files without losing performance? That's right, the same developers from the "adjacent area." Therefore, writing a good polymorphic virus and hinged protection of the executable file is a very similar task, using the same algorithms and tools. Similarly, the process of analyzing viruses, and creating signatures, and hacking commercial software is similar. In both cases, you need to get to the true code and either create a signature, or get out of it the algorithm for generating the serial number.

In the Internet there are several pages on the topic "classification of computer viruses." But we agreed, the virus can reproduce itself in the system, and what is needed is a media file. Therefore, any Trojans-rootkits-malware are not viruses, but the type of payload-code that a virus can drag on itself. For the technologies described in the article, the classification of computer viruses can be only one: polymorphic and non-polymorphic viruses. That is, changing from generation to generation, or not.

The detector considered in this article can detect some non-polymorphic viruses (monomorphic to name them, that is). Well, the transition to polymorphic viruses is an excellent reason to finally complete this article, promising to return to more interesting methods of hiding executable code in the second part.

But the era of Bod Intelligent Antivirus has come. So just download it and feel safe.

May 08, 2017