taint analysis
Fundamentals
Stain analysis is a technology to track and analyze the flow of stain information in programs. In vulnerability analysis, we use the stain analysis technology to mark the data of interest (usually from the external input of the program) as stain data, and then we can know whether they will affect some key program operations by tracking the flow of information related to the stain data, so as to mine program vulnerabilities. That is, the problem of whether the program has a certain vulnerability is converted into the problem of whether the tainted information will be used by the operations on the Sink dot.
Define
The taint analysis can be abstracted into the form of a triple<sources, sinks, sanitizers>
, where source is the taint source, representing the direct introduction of untrusted data or confidential data into the system; Sink is the gathering point of stains, representing the direct generation of security sensitive operations (violating data integrity) or the disclosure of private data to the outside world (violating data confidentiality); Sanitizer means harmless treatment, which means that data transmission will no longer endanger the information security of software systems by means of data encryption or hazard removal operations Stain analysis is to analyze whether the data introduced by the stain source in the program can be directly transmitted to the stain collection point without harmless treatment If not, it indicates that the system is secure in information flow; Otherwise, it indicates that the system has security problems such as privacy data leakage or dangerous data operation.
taint analysis often includes the following parts
:
- Identify the source point of the taint information in the program and mark the stain information
- Track and analyze the spread process of taint information in programs by using specific rules
- At some key program points (Sink points), check whether key operations are affected by the stain information
The treatment process of stain analysis can be divided into three stages (as shown in Figure 2):
(1) Identify stain sources and gathering points;
(2) Stain propagation analysis;
(3) Harmless treatment
Identify stain sources and gathering points
Identifying the source and sink of stains is the premise of stain analysis. At present, the methods of identifying the source and sink of stains in different applications are different. The reason for the lack of a common method is on the one hand the difference between system models and programming languages On the other hand, different types of security vulnerabilities focused on by stain analysis will also lead to different collection methods for stain sources and gathering points Table 1 shows the stain sources in Web application vulnerability detection.
The existing methods for identifying the source and sink of stains can be roughly divided into three categories:
(1) Use heuristic strategies to mark, for example, the data from external input of the program are collectively referred to as "tainted" data, conservatively believing that these data may contain malicious attack data (such as PHP Aspis);
(2) Manually mark the source and sink points (such as DroidSafe) according to the API or important data types called by specific applications;
(3) Use statistical or machine learning techniques to automatically identify and mark the source and sink of stains
Stain propagation analysis
Smudge propagation analysis is to analyze the propagation path of the tainted data in the program According to the different program dependencies in the analysis process, we can divide the stain propagation analysis into explicit flow analysis and implicit flow analysis
explicit flow analysis
Analyze how the taint mark propagates with the data dependency between variables in the program.
void foo(){
int a = source();
int b = source();
int x,y;
x = a * 2; // 2 -> 5
y = b + 4; // 3 -> 6
sink(x); // 5 -> 7
sink(y); // 6 -> 8
}
Variables a and b are marked as the stain source by the predefined stain source function source Assume that the stain mark given to a and b is respectively a point_ A and point_ b. Since the variable x direct data in line 5 depends on variable a, and the variable y direct data in line 6 depends on variable b, explicit flow analysis will mark the stain respectively_ A and point_ B The variable x in line 5 and the variable y in line 6 are propagated. Because x and y can reach the taint aggregation points in line 7 and line 8 respectively (identified by the predefined taint aggregation point function sink), the code has the problem of information leakage We will introduce the main challenges and solutions of explicit flow analysis in stain propagation analysis later
implicit flow analysis
How the tags propagate with the control dependency between variables in the program, that is, how the stain tags propagate from conditional instructions to the statements they control
void foo(){
string X = source();
string Y = new string();
for (int i = 0; i < x.length(); i++){
int x = (int)X.charAt(i);
int y = 0;
for(int j=0; j<x; j++){
y = y + 1;
}
Y = Y + char(y);
}
sink(Y);
}
Variable X is a string variable marked with a stain. There is no direct or indirect data dependency (explicit stream relationship) between variable Y and variable X, but the stain on X can be implicitly propagated to Y through control dependency.
Specifically, the outer loop controlled by the loop condition in the fourth line sequentially takes out every character in X, converts it into an integer type, and assigns it to the variable x, then the inner loop controlled by the loop condition in the seventh line assigns the value of x to y in a cumulative manner, and finally the outer loop transfers y to Y one by one. Finally, the value of Y and X in the twelfth line are the same, and the program has an information leakage problem However, if the implicit flow stain propagation analysis is not performed, the variable Y in line 12 will not be given a stain mark, and the program information leakage problem will be covered up.
Implicit flow stain propagation has always been an important problem. Like explicit flow, if not handled correctly, the result of stain analysis will be inaccurate The problem of unmarked variables due to improper handling of implicit stream stain propagation is called under taint problem On the contrary, the problem of excessive diffusion of stain variables due to excessive number of stain marks is called over point problem. At present, the research focus on implicit flow problem is to minimize under pollution and over pollution. We will introduce how existing technologies solve the above problems later.
Generally, we call program vulnerabilities that can be detected by using stain analysis as stain type vulnerabilities, such as SQL injection vulnerabilities:
String user = getUser();
String pass = getPass();
String sqlQuery = "select * from login where user='" + user + "' and pass='" + pass + "'";
Statement stam = con.createStatement();
ResultSetrs = stam.executeQuery(sqlQuery);
if (rs.next())
success = true;
During stain analysis, the variables user
and pass
are marked as polluted. Because the value of the variable sqlQuery is affected by user and pass, the sqlQuery is also marked as polluted. The program uses the variable sqlQuery as a parameter to construct SQL operation statements, so it can be determined that the program has SQL injection vulnerabilities.
- Data stream based stain analysis. Without considering the implicit information flow, the stain analysis can be regarded as the data flow analysis for the stain data. Track the stain information or variable pollution on the marking path according to the stain propagation rules, and then check whether the stain information affects sensitive operations.
- Dependency based stain analysis. Consider the implicit information flow. During the analysis, check whether the sensitive operation at Sink depends on the operation of receiving tainted information at Source according to the dependency between statements or instructions in the program.
Harmless treatment
The tainted data may pass through the harmless treatment module in the process of transmission. The harmless treatment module means that after the tainted data is processed by the module, the data itself will no longer carry sensitive information or the operation against the data will no longer cause harm to the system. In other words, after the data with the tainted mark passes through the harmless treatment module, the tainted mark can be removed Proper use of harmless treatment can reduce the number of stain marks in the system, improve the efficiency of stain analysis, and avoid inaccurate analysis results caused by stain diffusion
In the application process, in order to prevent sensitive data from being disclosed (to protect confidentiality), sensitive data is usually encrypted. At this time, the encryption library functions should be identified as harmless processing modules On the one hand, a large number of encryption algorithms are used in library functions, which makes it difficult for attackers to effectively calculate the possible range of passwords; On the other hand, the encrypted data is no longer threatening, and it is meaningless to continue to spread stain marks
In addition, in order to prevent external data from harming key areas of the system due to hazardous operations (to protect integrity), the input data is usually verified At this time, the input validation module should be recognized as a harmless processing module
For example, to prevent code injection vulnerabilities, the HTML entities function provided by PHP can convert HTML strings with special meanings into HTML entities (for example, convert '<' into '<') After the above conversion, the input string will not carry any code that may cause harm, and can be safely sent to users In addition to the input verification functions provided by the language itself, some systems also provide additional input verification tools, such as ScriptGard, CSA, XSS Auditor, Bek. These tools should also be recognized as harmless processing modules
Stain Analysis Based on explicit Data Stream
In the stain analysis based on data flow, some auxiliary analysis technologies, such as alias analysis and value analysis, are often required to improve the analysis accuracy. Auxiliary analysis and stain analysis are conducted alternately, usually along the direction of the program path to analyze the flow of stain information, and check whether the stain information received by the program at the Source point will affect the sensitive operation at the Sink point.
In in-process analysis, each statement or instruction in the process is analyzed in a certain order to analyze the flow of tainted information.
-
Record stain information
. At the static analysis level, the pollution of program variables is the main concern. To record pollution information, a pollution label is usually added to the variable. The simplest is a Boolean variable, which indicates whether the variable is polluted. More complex tags can also record which source points the pollution information of variables comes from, or even which part of the data the source point receives. Of course, we can also avoid using pollution tags. At this time, we can analyze the flow of stain information by tracking variables. For example, use stacks or queues to record polluted variables. -
Analysis of program statements
. After determining how to record pollution information, program statements will be statically analyzed. Usually, we mainly focus on assignment statements, control transfer statements and procedure call statements.assignment statement
-
For simple assignment statements, such as
a=b
, the variables at the left end of the record statement and the variables at the right end have the same pollution state. Constants in programs are generally considered uncontaminated. If a variable is assigned to a constant, the state of the variable is considered uncontaminated after assignment without considering the implicit information flow.a; a = 3;
-
For assignment statements with binary operations such as a=b+c, it is generally specified that if only one of the operands at the right end is polluted, the variables at the left end are polluted (unless the calculation result at the right end is constant)
b = source(); c = 1; a = b + c; // if b is taint data, and a is also polluted d = 1 + c; // it is nothing
-
For the assignment related to array elements, if the value or range of array subscripts can be determined through static analysis, then you can accurately determine which element or elements in the array are polluted. But usually static analysis cannot determine that a variable is polluted, so it is simply considered that the entire array is polluted.
-
For assignment statements containing fields or pointer operations, analysis results pointing to analysis are often required.
control transfer statements
- When analyzing conditional control transfer statements, first consider that the path conditions in the statements may contain restrictions on tainted data. In actual analysis, it is often necessary to identify such restrictions on tainted data to determine whether these restrictions are sufficient to include programs that will not be attacked. If it is concluded that the restriction of path conditions is sufficient, the corresponding variable can be marked as uncontaminated.
- For circular statements, it is generally stipulated that the value range of circular variables cannot be affected by the input. For example, in the statement
for (i=1; i<k; i++) {}
, it can be specified that the upper bound k of the loop cannot be polluted.
procedure call statements
- You can use interprocedural analysis or directly apply process summary for analysis. The process summary used for stain analysis mainly describes how to change the pollution state of variables related to the process, and which variables to detect the pollution state. These variables can be parameters used by the procedure, parameter fields, or return values of the procedure. For example, in the statement
flag=obj. method (str)
;str
is polluted. Through interprocedural analysis, the field str of the variable obj is marked as polluted, while the variable flag that records the return value of the method is marked as uncontaminated. - In the actual interprocess analysis, the process summary can be built for the analyzed process. For example, in the previous statement, the process summary is described as follows: the pollution state of the parameter of the method determines the pollution state of the instance domain str of its receiving object, and its return value is not polluted. Then the summary can be directly used for analysis the next time.
-
-
Code traversal
. In general, flow sensitive or path sensitive methods are often used to traverse and analyze the code in the process. If the flow sensitive method is used, the analysis results on different paths can be collected to find the data purification rules in the program. If the path sensitive analysis method is used, you need to pay attention to the path conditions. If the path conditions involve restrictions on the value of pollution variables, you can think that the path conditions have purified the pollution data, and you can also record the restrictions of the analysis path conditions on the pollution data. If these restrictions are sufficient to ensure that the data will not be used by attackers on a program path, you can mark the corresponding variables as uncontaminated.
The interprocedural analysis is similar to the interprocedural analysis of data flow. The bottom-up analysis method is used to analyze each process in the call graph, and then analyze the program as a whole.
Stain Analysis Based on Dependency (implicit analysis)
When using the stain analysis method to detect program vulnerabilities, the program vulnerabilities related to the stain data are the main objects of concern, such as SQL injection vulnerabilities, command injection vulnerabilities, and cross site scripting vulnerabilities.
The following is an example of an ASP program with SQL injection vulnerabilities:
<%
Set pwd = "bar"
Set sql1 = "SELECT companyname FROM " & Request.Cookies("hello")
Set sql2 = Request.QueryString("foo")
MySqlStuff pwd, sql1, sql2
Sub MySqlStuff(password, cmd1, cmd2)
Set conn = Server.CreateObject("ADODB.Connection")
conn.Provider = "Microsoft.Jet.OLEDB.4.0"
conn.Open "c:/webdata/foo.mdb", "foo", password
Set rs = conn.Execute(cmd2)
Set rs = Server.CreateObject("ADODB.recordset")
rs.Open cmd1, conn
End Sub
%>
First, the code is expressed as a three address code. For example, the third line can be expressed as:
a = "SELECT companyname FROM "
b = "hello"
param0 Request
param1 b
callCookies
return c
sql1 = a & c
After parsing, you need to analyze the control flow of the program code. There is only one call relationship (Line 5).
Next, you need to identify the Source point and Sink point in the program and the initial contaminated data.
The specific analysis process is as follows:
- Call Request The returned result of cookies ("hello") is polluted, so the variable sql1 is also polluted.
- Call Request The returned result sql2 of QueryString ("foo") is polluted.
- The function MySqlStuff is called, and its parameters sql1 and sql2 are polluted. The processing of the analysis function is divided. According to the statement of the function in line 6, its parameters cmd1 and cmd2 are marked as polluted.
- Line 10 is the Sink point of the program. The function conn. Execute executes the SQL operation. Its parameter cmd2 is contaminated. It is found that the contaminated data is propagated from the Source point to the Sink point. Therefore, it is believed that the program has SQL injection vulnerabilities
Dynamic taint analysis
Basic principle of dynamic stain analysis
Dynamic stain analysis is to monitor the data flow or control flow on the basis of program operation, so as to track and detect the explicit propagation of data in memory and misuse of data. The only difference between dynamic stain analysis and static stain analysis is that the static stain analysis technology does not really run the program during detection, but transmits stain marks by simulating the execution process of the program, while the dynamic stain analysis technology needs to run the program and simultaneously transmit and detect stain marks in real time.
The dynamic stain analysis technology can be divided into three parts:
-
Dirty data mark
: The program attack surface is a set of interfaces for a program to accept input data, generally consisting of program entry points and external function calls. In stain analysis, input data from outside will be marked as stain data. According to the different sources of input data, it can be divided into three categories: network input, file input and input device input. -
Stain dynamic tracking
: On the basis of the stain data mark, the dynamic tracking analysis of the instruction granularity is carried out for the process, and the effect of each instruction is analyzed until the whole program running process is covered, and the data flow propagation is tracked.-
Dynamic stain tracking is usually based on the following three mechanisms:
-
Dynamic code pegging
: It can track the flow of tainted data in a single process. By inserting analysis code into the analyzed program, it can track the flow direction of the tainted information flow in the process -
Full system simulation
: use full system simulation technology to analyze the stain information diffusion path of each instruction in the simulation system, so as to track the flow of stain data in the operating system. -
Virtual machine monitor
: The function of analyzing stain information flow is added to the virtual machine monitor to track the flow of stain data among virtual machines in the entire client machine.
-
-
Stain dynamic tracking usually requires shadow memory to map the actual memory pollution, so as to record whether the memory area and register are polluted. During the analysis of each statement, the stain tracking attack judges whether there is the spread of the stain information according to the shadow memory, so as to spread the stain information and save the propagation results in the shadow memory, thus tracking the flow of the stain data
-
In general, both data movement and arithmetic instructions will cause the displayed information flow to propagate. In order to track the display propagation of the tainted data, it is necessary to monitor each data movement instruction and arithmetic instruction before they are executed. When the result of the instruction is polluted by one of the operands, set the shadow memory corresponding to the result data as a pointer to the data structure of the source pollution point operand.
-
-
Smudge misuse check
: After correctly marking the tainted data and tracking the spread of the tainted data in real time, it is necessary to make correct detection of attacks, that is, to detect whether the tainted data is used illegally.
Implementation of dynamic stain analysis
Dirty data mark
The tainted data usually refers to the external input data accepted by the software system. In the computer, these data may be stored in the form of temporary memory data or in the form of files. When the program needs to use these data, it usually accesses and processes the data through functions or system calls. Therefore, it only needs to monitor these key functions to get what tainted information the program reads or outputs. In addition, for network input, the network operation function also needs to be monitored.
After identifying the stain data, it is necessary to mark the stain. Stain life cycle refers to the time range within which a stain is defined as valid. The stain life cycle starts at the time of stain creation, generates stain marks, and ends at the time of stain deletion, removes stain marks.
- Stain Creation
When data from an unreliable source is allocated to a register or memory operand
When data marked as dirty is allocated to a register or memory operand through operation - Stain Removal
When non tainted data is assigned to a register or memory operand that holds the taint
When the stain data is assigned to the register or memory address where the stain is stored, the original stain will be deleted and a new stain will be created
Some arithmetic operations or logic operations that will remove stains
Stain dynamic tracking
When the stain data is transferred from one location to another, it is considered that there is stain propagation. Stain propagation rules:
Instrument type | Propagation rules | Examples |
---|---|---|
Copy or move instructions | T(a)<-T(b) | mov a, b |
arithmetic operation | T(a)<-T(b) | add a, b |
stack operation instruction | T(esp)<-T(a) | push a |
Copy or move class function call instructions | T(dst)<-T(src) | call memcpy |
Zero clearing instruction | T(a)<-false | xor a, a |
The value of T (x) can be divided into true and false. When the value is true, it means that x is a stain, otherwise x is not a stain.
For the taint information flow, it is possible to analyze the flow direction of the taint information flow through the stain tracking and function monitoring. However, due to the lack of information at the object level, the flow of information at the instruction level alone cannot completely give the exact behavior of the software to be analyzed. Therefore, it is necessary to reconstruct views based on function monitoring, such as obtaining the details of file objects and socket objects, to facilitate further analysis.
According to the actual needs of vulnerability analysis, stain analysis should include two aspects of information:
- The spread of a stain can be known for any stain.
- All instruction information for processing tainted data, including instruction address, operation code, operand, and the order of execution of these instructions during the taint processing.
The implementation of stain dynamic tracking usually uses:
Shadow memory
: the image of the stain data in the real memory, which is used to store all the valid stains at the current time of program execution.Stain propagation tree
: used to represent the propagation relationship of the stain.Stain processing instruction chain
: used to store all instructions related to stain data processing in chronological order.
When encountering an instruction that will cause stain propagation, first, each operand in the instruction is quickly mapped through the stain to find whether there is a corresponding shadow stain in the shadow memory to determine whether it is stain data. Then, according to the stain propagation rules, the stain propagation result caused by the instruction is obtained, and the new stain generated by the propagation is added to the shadow memory and the stain propagation tree, At the same time, the shadow stain corresponding to the invalid stain will be deleted. At the same time, whether an instruction involves the processing of the stain data needs to be determined dynamically during the stain analysis, so the instruction information of the stain data needs to be recorded in the stain processing instruction chain.
Smudge misuse check
Smudge sensitive points, namely Sink points, are instructions or system call points where the tainted data may be misused. They are mainly divided into:
- Jump Address: check whether the tainted data is used for jumping objects, such as return address, function pointer, function pointer offset, etc. The specific operation is to monitor and analyze each jump class instruction (such as call, ret, jmp, etc.) before it is executed to ensure that the jump object is not the memory address of the tainted data.
- Format string: check whether the stain data is used as the format string parameter of printf series functions.
- System call parameters: check whether the special parameters of special system calls are stain data.
- Flag bit: track whether the flag bit is infected and whether the infected flag bit is used to change the program control flow.
- Address: Check whether the address of the data movement instruction is infected.
When checking for misuse of stains, it is usually necessary to check according to some vulnerability patterns. First, it is necessary to clarify the representation of common vulnerabilities in binary code, and then refine them into vulnerability patterns to guide automated security analysis more effectively.
Example of Dynamic Stain Analysis
void fun(char *str){
char temp[15];
printf("in strncpy source: %s\n", str);
strncpy(temp, str, strlen(str));
}
int main(int argc , char *argv[]){
char source[30];
gets(source);
if (strlen(source) < 30)
fun(source);
else
printf("too long string, %s\n", source);
return 0;
}
The vulnerability is obvious. A buffer overflow exists when calling strncpy function.
The program accepts the binary code of the external input
string as follows:
0x08048609 <+51>: lea eax,[ebp-0x2a]
0x0804860c <+54>: push eax
0x0804860d <+55>: call 0x8048400 <gets@plt>
...
0x0804862c <+86>: lea eax,[ebp-0x2a]
0x0804862f <+89>: push eax
0x08048630 <+90>: call 0x8048566 <fun>
The binary code of the program calling strncpy function
is as follows:
0x080485a1 <+59>: push DWORD PTR [ebp-0x2c]
0x080485a4 <+62>: call 0x8048420 <strlen@plt>
0x080485a9 <+67>: add esp,0x10
0x080485ac <+70>: sub esp,0x4
0x080485af <+73>: push eax
0x080485b0 <+74>: push DWORD PTR [ebp-0x2c]
0x080485b3 <+77>: lea eax,[ebp-0x1b]
0x080485b6 <+80>: push eax
0x080485b7 <+81>: call 0x8048440 <strncpy@plt>
First, when scanning the binary code of the program, the call can be scanned 0x0804860d <+55> call < gets@plt >
, the function reads the external input, that is, the attack point of the program. After determining the attack point, we will analyze the pollution source data and mark it, that is, [ebp-0x2a]
array (that is, the source in the source program) as the stain data. The program continues to execute, and the pollution mark will be passed along with the propagation of this value. When entering the fun()
function, the pollution mark is transferred to the parameter str through the mapping of formal parameter arguments. Then run to the Sink point function strncpy()
. The second parameter of the function, str, and the third parameter, strlen (str)
, are both tainted data. Finally, when executing the strncpy()
function, if the corresponding vulnerability rules are set (the target array is smaller than the source array), the vulnerability rules will be triggered to detect buffer overflow vulnerabilities.