1. Introduction
Rice's Theorem
Static Analysis analyzes a program P to reason about its behaviors and determines whether it satisfies some properties before running P.
Rice's Theorem: Any non-trivial property of the behavior of programs in a recursively enumerable language is undecidable.
我们可以把任意程序看成一个从输入到输出上的函数(输入输出对的集合),该函数描述了程序的行为关于该函数/集合的任何非平凡属性,都不存在可以检查该属性的通用算法
Regular languages \(\in\) Context-Free Languages \(\in\) Recursive Languages \(\in\) Recursively Enumerable Languages \(\in\) Non-Recursively Enumerable Languages
-
A language is recursively enumerable (r.e.) if it is the set of strings accepted by some TM.
-
A language is recursive if it is the set of strings accepted by some TM that halts on every input.
-
A property is trivial if either it is not satisfied by any r.e. language, or if it satisfies by all r.e. languages; otherwise it is non-trivial(平凡属性,要么对全体程序都为真,要么对全体程序都为假)
Examples of trivial properties:
- The programs are either the empty or they contain at least one word.
- The length of a program must be finite.
Approximate: Useful Static Analysis
Original decision problem: output Yes or No
Approximate solution to decision problem: Yes, No or Unknown
- Must Analysis, lower/under approximate: only output yes or unknown (outputs information that must be true)
- May Analysis, upper/over approximate: only output no or unknown (outputs information that may be true)
Objective: Answer Yes or No as much as possible, answer unknown as little as possible
假设正确答案是一个集合 S,must analysis 总是返回 S 的子集,may analysis 总是返回集合 S 的超集
Necessity of Soundness
Soundness (健壮性) in static-analysis refers to the property that if an analysis reports no errors, then the program is guaranteed to be free of the class of errors being checked. In other words, a sound static analysis will not miss any real bugs or defects in the code that it claims to detect. 可以理解为允许误报,但不能漏报。
Completeness (完备性) 可以理解为可能漏报,但不误报的情况。
Static Analysis: ensure (or get close to) soundness, while making good trade-offs between analysis precision and analysis speed.
Abstraction + Over-approximation
Two Words to Conclude Static Analysis (for most analysis): Abstraction + Over-approximation
Abstraction
e.g: ints -> signs (+, -, 0, ⊤(unknown), ⊥(undefined))
Overapproximation
transfer function
In static analysis, transfer functions define how to evaluate different program statements on abstract values.
Transfer functions are defined according to “analysis problem” and the “semantics” of different program statements
e.g: '+' + '-' = '⊤'(unknown)
control flows
As it’s impossible to enumerate all paths in practice, flow merging (as a way of over-approximation) is taken for granted in most static analyses.
2. IR
一般基于 IR 做分析,而不是 AST。
AST
-
high-level and closed to grammar structure •
-
usually language dependent
-
suitable for fast type checking
-
lack of control flow information
IR
- low-level and closed to machine code
- usually language independent
- compact and uniform
- contains control flow information
- usually considered as the basic for static analysis
3-Address Code (3AC)
at most one operator on the right side of an instruction.
e.g
- t2 = a + b + 3 is not 3AC
- t1 = a + b is 3AC
each 3AC contains at most 3 addresses. Address can be one of the following:
- Name: such as a, b
- Constant: such as 3
- Compiler-generated temporary: such as t1, t2
Some Common 3AC Forms:
- x = y bop z ( bop: binary arithmetic or logical operation)
- x = uop y (uop: unary operation (minus, negation, casting) )
- x = y
- goto L
- if x goto L
- if x rop y goto L (rop: relational operator (>,<,==,>=,<=, etc))
Soot
Soot is the most pupular static analysis framework for java
Its IR is Jimple: typed 3-address code
Static Single Assignment (SSA) (Optional)
All assignments in SSA are to variables with distinct names
- Give each definition a fresh name
- Propagate fresh name to subsequent uses
- Every variable has exactly one definition
When a variable use is at control flow merges, a special merge operator, ∅ (called phi-function), is introduced to select the values at merge nodes
Why SSA
Flow information is indirectly incorporated into the unique variable names.
Define-and-Use pairs are explicit
Why not SSA
SSA may introduce too many variables and phi-functions
May introduce inefficiency problem when translating to machine code (due to copy operations)
Control Flow Analysis
Usually refer to building Control Flow Graph (CFG). CFG serves as the basic structure for static analysis.
The node in CFG can be an individual 3-address instruction, or (usually) a Basic Block (BB)
Basic Block
Basic blocks(BB) are maximal sequences of consecutive 3AC with the properties that
- it can be entered only at the beginning, i.e., the first instruction in the block.
- it can be exited only at the end, i.e., the last instruction in the block.
How to build Basic blocks
INPUT: A sequence of three-address instructions of P
OUTPUT: A list of basic blocks of P
METHOD:
- Determine the leaders in P
- The first instruction in P is a leader
- Any target instruction of a conditional or unconditional jump is a leader
- Any instruction that immediately follows a conditional or unconditional jump is a leader
- Build BBs for P
- A BB consists of a leader and all its subsequent instructions until the next leader
Control Flow Graph
The nodes of CFG are basic blocks
There is an edge from block A to block B if and only if
- There is a conditional or unconditional jump from the end of A to the beginning of B
- B immediately follows A in the original order of instructions and A does not end in an unconditional jump
We say that A is a predecessor of B, and B is a successor of A
Usually we add two nodes, Entry and Exit.
- They do not correspond to executable IR
- A edge from Entry to the BB containing the first instruction of IR
- A edge to Exit from any BB containing an instruction that could be the last instruction of IR
3. Data Flow Analysis
Each execution of an IR statement transforms an input state to a new output state,the input (output) state is associated with program point before (after) the statement
Data-flow analysis is to find a solution to a set of safe-approximation-directed constraints on the IN[s]’s and OUT[s]’s, for all statements s.
- constraints based on semantics of statements (transfer functions)
- Forward Analysis: OUT[s] = f_s(IN[s])
- Backward Analysis: IN[s] = f_s(OUT[s])
- constraints based on the flows of control