CS224 Program Analysis@Shanghaitech 24 Fall Notes

标签：24 CS224 Notes analysis program Analysis basic block OUT

1. Introduction

Rice's Theorem

Static Analysis analyzes a program P to reason about its behaviors and determines whether it satisfies some properties before running P.

Rice's Theorem: Any non-trivial property of the behavior of programs in a recursively enumerable language is undecidable.

我们可以把任意程序看成一个从输入到输出上的函数（输入输出对的集合），该函数描述了程序的行为关于该函数/集合的任何非平凡属性，都不存在可以检查该属性的通用算法

Regular languages \(\in\) Context-Free Languages \(\in\) Recursive Languages \(\in\) Recursively Enumerable Languages \(\in\) Non-Recursively Enumerable Languages

A language is recursively enumerable (r.e.) if it is the set of strings accepted by some TM.
A language is recursive if it is the set of strings accepted by some TM that halts on every input.
A property is trivial if either it is not satisfied by any r.e. language, or if it satisfies by all r.e. languages; otherwise it is non-trivial（平凡属性，要么对全体程序都为真，要么对全体程序都为假）

Examples of trivial properties:

The programs are either the empty or they contain at least one word.
The length of a program must be finite.

Approximate: Useful Static Analysis

Original decision problem: output Yes or No

Approximate solution to decision problem: Yes, No or Unknown

Must Analysis, lower/under approximate: only output yes or unknown （outputs information that must be true）
May Analysis, upper/over approximate: only output no or unknown （outputs information that may be true）

Objective: Answer Yes or No as much as possible, answer unknown as little as possible

假设正确答案是一个集合 S，must analysis 总是返回 S 的子集，may analysis 总是返回集合 S 的超集

Necessity of Soundness

Soundness (健壮性) in static-analysis refers to the property that if an analysis reports no errors, then the program is guaranteed to be free of the class of errors being checked. In other words, a sound static analysis will not miss any real bugs or defects in the code that it claims to detect. 可以理解为允许误报，但不能漏报。

Completeness (完备性) 可以理解为可能漏报，但不误报的情况。

Static Analysis: ensure (or get close to) soundness, while making good trade-offs between analysis precision and analysis speed.

Abstraction + Over-approximation

Two Words to Conclude Static Analysis (for most analysis): Abstraction + Over-approximation

Abstraction

e.g: ints -> signs (+, -, 0, ⊤(unknown), ⊥(undefined))

Overapproximation

transfer function

In static analysis, transfer functions define how to evaluate different program statements on abstract values.

Transfer functions are defined according to “analysis problem” and the “semantics” of different program statements

e.g: '+' + '-' = '⊤'(unknown)

control flows

As it’s impossible to enumerate all paths in practice, flow merging (as a way of over-approximation) is taken for granted in most static analyses.

2. IR

一般基于 IR 做分析，而不是 AST。

AST

high-level and closed to grammar structure •
usually language dependent
suitable for fast type checking
lack of control flow information

low-level and closed to machine code
usually language independent
compact and uniform
contains control flow information
usually considered as the basic for static analysis

3-Address Code (3AC)

at most one operator on the right side of an instruction.

e.g

t2 = a + b + 3 is not 3AC
t1 = a + b is 3AC

each 3AC contains at most 3 addresses. Address can be one of the following:

Name: such as a, b
Constant: such as 3
Compiler-generated temporary: such as t1, t2

Some Common 3AC Forms:

x = y bop z ( bop: binary arithmetic or logical operation)
x = uop y (uop: unary operation (minus, negation, casting) )
x = y
goto L
if x goto L
if x rop y goto L (rop: relational operator (>,<,==,>=,<=, etc))

Soot

Soot is the most pupular static analysis framework for java

Its IR is Jimple: typed 3-address code

Static Single Assignment (SSA) (Optional)

All assignments in SSA are to variables with distinct names

Give each definition a fresh name
Propagate fresh name to subsequent uses
Every variable has exactly one definition

When a variable use is at control flow merges, a special merge operator, ∅ (called phi-function), is introduced to select the values at merge nodes

Why SSA

Flow information is indirectly incorporated into the unique variable names.

Define-and-Use pairs are explicit

Why not SSA

SSA may introduce too many variables and phi-functions

May introduce inefficiency problem when translating to machine code (due to copy operations)

Control Flow Analysis

Usually refer to building Control Flow Graph (CFG). CFG serves as the basic structure for static analysis.

The node in CFG can be an individual 3-address instruction, or (usually) a Basic Block (BB)

Basic Block

Basic blocks(BB) are maximal sequences of consecutive 3AC with the properties that

it can be entered only at the beginning, i.e., the first instruction in the block.
it can be exited only at the end, i.e., the last instruction in the block.

How to build Basic blocks

INPUT: A sequence of three-address instructions of P

OUTPUT: A list of basic blocks of P

METHOD:

Determine the leaders in P
- The first instruction in P is a leader
- Any target instruction of a conditional or unconditional jump is a leader
- Any instruction that immediately follows a conditional or unconditional jump is a leader
Build BBs for P
- A BB consists of a leader and all its subsequent instructions until the next leader

Control Flow Graph

The nodes of CFG are basic blocks

There is an edge from block A to block B if and only if

There is a conditional or unconditional jump from the end of A to the beginning of B
B immediately follows A in the original order of instructions and A does not end in an unconditional jump

We say that A is a predecessor of B, and B is a successor of A

Usually we add two nodes, Entry and Exit.

They do not correspond to executable IR
A edge from Entry to the BB containing the first instruction of IR
A edge to Exit from any BB containing an instruction that could be the last instruction of IR

3. Data Flow Analysis

Each execution of an IR statement transforms an input state to a new output state，the input (output) state is associated with program point before (after) the statement

Data-flow analysis is to find a solution to a set of safe-approximation-directed constraints on the IN[s]’s and OUT[s]’s, for all statements s.

constraints based on semantics of statements (transfer functions)
- Forward Analysis: OUT[s] = f_s(IN[s])
- Backward Analysis: IN[s] = f_s(OUT[s])
constraints based on the flows of control
- Control flow with a BB
  - \(IN[
    标签：24,CS224,Notes,analysis,program,Analysis,basic,block,OUT
    From： https://www.cnblogs.com/YjmStr/p/18456001