1. Theano: A CPU and GPU Math Compiler in Python(2010 optional)
- 动机
- Python is slow, one reason is that Python uses full-fledged Python objects on the heap to represent simple numeric scalars.
- To reduce the overhead in numeric calculations, it is important to use array types such as NumPy’s ndarray so that single Python objects on the heap can stand for multidimensional arrays of numeric scalars, each stored efficiently in the host processor’s native format. In numpy, functions are implemented in C for use within Python programs. However, the composition of many such NumPy functions can be unnecessarily slow when each call is dominated by the cost of transferring memory rather than the cost of performing calculations
-
[numexpr] goes one step further by providing a loop fusion optimization that can glue several element-wise computations together. Unfortunately, numexpr requires an unusual syntax (the expression must be encoded as a string within the code)
-
[Cython] and [scipy.weave] address Python’s performance issue by offering a simple way to hand-write crucial segments of code in C (or a dialect of Python which can be easily compiled to C, in Cython’s case).
-
Theano, on the other hand, works on a symbolic representation of mathematical expressions, provided by the user in a NumPy-like syntax. Access to the full computational graph of an expression opens the door to advanced features such as symbolic differentiation of complex expressions, but more importantly allows Theano to perform local graph transformations that can correct many unnecessary, slow or numerically unstable expression patterns. Once optimized, the same graph can be used to generate CPU as well as GPU implementations (the latter using CUDA) without requiring changes to user code.
- 工作
- 四步
- What kinds of work does Theano support?
-
Theano’s expression types cover much of the same functionality as NumPy, and include some of what can be found in SciPy
-
- Compilation by theano.function
- Canonicalization(规范化)
-
For example, duplicate expressions are merged into a single expression. Two expressions are considered duplicates if they carry out the same operation and have the same inputs. Since Theano expressions are purely functional (i.e., cannot have side effects,没有副作用如不会修改全局变量),these expressions must return the same value and thus it is safe to perform the operation once and reuse the result.
-
For another example, subexpressions involving only multiplication and division are put into a standard fraction form (e.g. a / (((a * b) / c) / d) -> (a * c * d) / (a * b) -> (c * d) / (b)).
-
Some useless calculations are eliminated in this phase, for instance cancelling out uses of the a term in the previous example, but also reducing exp(log(x)) to x,
-
- Stabilization
-
For instance, consider the function log(1 + exp(x)), which tends toward zero as x趋向于0, and x as x趋向于无穷. Due to limitations in the representation of double precision numbers, the computation as written yields infinity for x > 709. The stabilization phase replaces patterns like one with an expression that simply returns x when x is sufficiently large
-
- Specialization
-
The specialization transformation replaces expressions with faster ones
-
Expressions like pow(x,2) become sqr(x)
-
expressions involving scalar-multiplied matrix additions and multiplications may become BLASGeneral matrix multiply (GEMM) nodes and reshape, transpose, and subtensor expressions (which create copies by default) are replaced by constant-time versions that work by aliasing memory.
-
Expressions subgraphs involving element-wise operations are fused together (as in numexpr) in order to avoid the creation and use of unnecessary temporary variables.
-
If the user desires to use the GPU, expressions with corresponding GPU implementations are substituted in, and transfer expressions are introduced where needed.
-
- Code Generation
-
The code generation phase(python生成c代码?) of the compilation process produces and loads dynamically-compiled Python modules with specialized implementations for the expressions in the computation graph. Not all expressions have C (technically C++) implementations, but many (roughly 80%) of Theano’s expressions generate and compile C or CUDA code during theano.function. The majority of expressions that generate C code specialize the code based on the dtype(根据数据类型定制?), broadcasting pattern, and number of dimensions of their arguments. A few expressions, such as the small-filter convolution (conv2d), further specialize code based on the size the arguments will have.
-
- Canonicalization(规范化)
2. Automatic differentiation in Machine Learning: a Survey(2018 optional)
- 动机和工作
-
We survey the intersection of AD and machine learning, cover applications where AD has direct relevance, and address the main implementation techniques. By precisely defining the main differentiation techniques and their interrelationships, we aim to bring clarity to the usage of the terms “autodiff”, “automatic differentiation”, and “symbolic differentiation” as these are encountered more and more in machine learning settings.
-
- Introduction
- manually working out derivatives and coding them, is time consuming and prone to error
-
numerical differen-tiation using finite difference approximations,
-
is simple to implement but can be highly inaccurate due to round-off and truncation errors
- scales poorly for gradients
-
-
symbolic differentiation using expression manipulation in computer algebra systems such as Mathematica, Maxima, and Maple;
- results in complex and cryptic expressions plagued with the problem of “expression swell”
- automatic differentiation
- What AD Is Not
- AD Is Not Numerical Differentiation
-
Numerical differentiation is the finite difference approximation of derivatives using values of the original function evaluated at some sample points
- Numerical approximations of derivatives are inherently ill-conditioned and unstable, This is due to the introduction of truncation and round-off errors inflicted by the limited precision of computations and the chosen value of the step size h. Truncation error tends to zero as h → 0. However, as h is decreased, round-off error increases and becomes dominant
- center difference approximation, the first-order errors cancel and one effectively moves the truncation error from firstorder to second-order in h. However, with increasing dimensionality, a trade-off between accuracy and performance is faced, where computing a Jacobian matrix of a function f : Rn → Rm requires 2mn evaluations.
-
- AD Is Not Symbolic Differentiation
-
Symbolic differentiation is the automatic manipulation of expressions for obtaining derivative expressions (Grabmeier and Kaltofen, 2003) (Figure 2, center right), carried out by applying transformations representing rules of differentiation such as 交换率和结合律
-
If we just proceeded to symbolically differentiate f(x) and plugged its derivative into the appropriate place, we would have nested duplications of any computation that appears in common between f(x) and d/dxf(x). Hence, careless symbolic differentiation can easily produce exponentially large symbolic expressions which take correspondingly long to evaluate. This problem is known as expression swell
-
it is in principle possible to significantly simplify computations by storing only the values of intermediate sub-expressions in memory. Moreover, for further efficiency, we can interleave as much as possible the differentiation and simplification steps. This interleaving idea forms the basis of AD and provides an account of its simplest form: apply symbolic differentiation at the elementary operation level and keep intermediate numerical results, in lockstep with the evaluation of the main function.This is AD in the forward accumulation mode
-
- AD Is Not Numerical Differentiation
-
- AD and Its Main Modes
-
AD can be thought of as performing a non-standard interpretation of a computer program where this interpretation involves augmenting the standard computation with the calculation of various derivatives
-
An important point to note here is that AD can differentiate not only closed-form expressions in the classical sense, but also algorithms making use of control flow such as branching, loops, recursion, and procedure calls, giving it an important advantage over symbolic differentiation which severely limits such expressivity.
- Forward Mode
-
-
-
-
This generalizes naturally to computing the Jacobian of a function f : Rn → Rm with n independent (input) variables xi and m dependent (output) variables yj . the full Jacobian can be computed in n evaluations,
-
In practice, a function f coded in a programming language of choice would be fed into an AD tool, which would then augment it with corresponding extra code to handle the dual operations so that the function and its derivative are simultaneously computed. This can
be implemented through calls to a specific library, in the form of source code transformationwhere a given source code will be automatically modified, or through operator overloading, making the process transparent to the user
-
- Reverse Mode
-
-
-
-
An important advantage of the reverse mode is that it is significantly less costly to evaluate (in terms of operation count) than the forward mode for functions with a large number of inputs(深度学习的情况). In the extreme case of f : Rn → R, only one application of the reverse mode is sufficient to compute the full gradient。
-
-
- AD and Machine Learning
- Neural Networks, Deep Learning, Differentiable Programming
-
In mainstream frameworks including Theano18(Bastien et al., 2012), TensorFlow (Abadi et al., 2016), Caffe (Jia et al., 2014), and CNTK (Seide and Agarwal, 2016) the user first constructs a model as a computational graph using a domain-specific mini language, which then gets interpreted by the framework during execution. This approach has the advantage of enabling optimizations of the computational graph structure (e.g., as in Theano), but the disadvantages of having limited and unintuitive control flow and being difficult to debug.
-
In contrast, the lineage of recent frameworks led by autograd (Maclaurin, 2016), Chainer (Tokui et al., 2015), and PyTorch (Paszke et al., 2017) provide truly general-purpose reverse mode AD of the type we outline in Section 3, where the user directly uses the host programming language(如python) to define the model as a regular program of the forward computation. This eliminates the need for an interpreter, allows arbitrary control flow statements, and makes debugging simple and intuitive.
-
The terms define-and-run and static computational graph refer to Theano-like systems where a model is constructed, before execution, as a computational graph structure,which later gets executed with different inputs while remaining fixed. In contrast, the terms define-by-run and dynamic computational graph refer to the general-purpose ADcapability available in newer PyTorch-like systems where a model is a regular program in the host programming language, whose execution dynamically constructs a computational graph on-the-fly that can freely change in each iteration(静态图和动态图,静态图不随输入改变而改变,图内部会支持if,loop等操作;动态图会随输入改变而改变,不同输入,if判断结果不同,会生成不同的图)
-
- Neural Networks, Deep Learning, Differentiable Programming
- Implementations
-
A principal consideration in any AD implementation is the performance overhead introduced by the AD arithmetic(计算) and bookkeeping(记账,指内存占用吧). In terms of computational complexity, AD guarantees that the amount of arithmetic goes up by no more than a small constant factor (Griewank and Walther, 2008).
-
using operator overloading may introduce method dispatches(方法调度) with attendant costs, which, compared to raw numerical computation of the original function, can easily amount to a slowdown of an order of magnitude
-
Translation of mathematics into computer code often requires attention to numeric issues(类似于Theano中Canonicalization部分提到的问题).
-
Deep learning systems are generally compute-bound and spend a considerable amount of computation time in highly-optimized numerical kernels for matrix operations (Hadjis et al., 2015; Chetlur et al., 2014). This is a situation which is arguably amenable to operatoroverloading- based AD implementations on high-level operations, as is commonly found in current machine learning frameworks. In contrast, numerical simulation workloads in traditional AD applications can be bandwidth-bound, making source code transformation and
compiler optimization approaches more relevant(operatoroverloading和source code transformation).
-
Another difference worth noting is that whereas high numerical precision is desirable in traditional application domains of AD such as computational fluid dynamics (Cohen and Molemaker, 2009), in deep learning lowerprecision
is sufficient and even desirable in improving computational efficiency, thanks to the error resiliency of neural networks (Gupta et al., 2015; Courbariaux et al., 2015). -
Compilers and Source Code Transformation
-
These implementations provide extensions to programming languages that automate the decomposition of algorithms into AD-enabled elementary operations. They are typically executed as preprocessors to transform the input in the extended language into the original
language.
-
-
Operator Overloading
-
In modern programming languages with polymorphic features, operator overloading provides the most straightforward way of implementing AD, exploiting the capability of redefining elementary operation semantics
-
-
- AD and Its Main Modes
3. TensorFlow: A system for large-scale machine learning(2016 require)
- 动机
- Previous system: DistBelief
- Defining new layers. For efficiency, we implemented DistBelief layers as C++ classes. Using a separate, less familiar programming language for implementing layers is a barrier for machine learning researchers who seek to experiment with new layer architectures
- Refining the training algorithms. Researchers often want to experiment with new optimization methods, but doing that in DistBelief involves modifying the parameter server implementation
- Defining new training algorithms. This pattern works for training simple feed-forward neural networks, but fails for more advanced models, such as recurrent neural networks, which contain loops [39]; adversarial networks, in which two related networks are trained alternately; and reinforcement learning models, where the loss function is computed by some agent in a separate system, such as a video game emulator [54].
- Design principles
- Dataflow graphs of primitive operators
-
represents individual mathematical operators (such as matrix multiplication, convolution, etc.) as nodes in the dataflow graph, Many optimization algorithms require each layer to have defined gradients, and building layers out of simple operators makes it easy to differentiate these models automatically
-
we represent mutable state(指模型参数吧,variable), and the operations that update it, as nodes in the dataflow graph, thus enabling experimentation with different update rules.
-
- Deferred execution
-
A typical TensorFlow application has two distinct phases: the first phase defines the program (e.g., a neural network to be trained and the update rules) as a symbolic dataflow graph with placeholders for the input data and variables that represent the state; and the second phase executes an optimized version of the program on the set of available devices.
-
Tensor-Flow can optimize the execution phase by using global information about the computation. For example, Tensor-Flow achieves high GPU utilization by using the graph’s dependency structure to issue a sequence of kernels to the GPU without waiting for intermediate results(现实是dl是compute密集型问题,所以优化效率不多?). While this design choice makes execution more efficient, we have had to push more complex features—such as dynamic control flow (§3.4)—into the dataflow graph, so that models using these features enjoy the same optimizations.
-
- Common abstraction for heterogeneous accelerators
-
At a minimum, a device(硬件,加速器) must implement methods for (i) issuing a kernel for execution, (ii) allocating memory for inputs and outputs, and (iii) transferring buffers to and from host memory. Each operator (e.g., matrix multiplication) can have multiple specialized implementations for different devices. As a result, the same program can easily target GPUs, TPUs, or mobile CPUs as required for training, serving, and offline inference.
-
- Dataflow graphs of primitive operators
- TensorFlow execution model
- Dataflow graph elements
- each vertex represents a unit of local computation, and each edge represents the output from, or input to, a vertex. We refer to the computation at vertices as operations, and the values that flow along edges as tensors.
- Tensors
-
we model all data as tensors (n-dimensional arrays)
- At the lowest level, all TensorFlow tensors are dense, for the reasons we discuss in Subsection 2.2(减少内存占用). TensorFlow offers two alternatives for representing sparse data: either encode the data into variable-length string elements of a dense tensor, or use a tuple of dense tensors (e.g., an n-D sparse tensor with m non-zero elements can be represented in coordinate-list format as an m × n matrix of coordinates and a length-m vector of values). The shape of a tensor can vary in one or more of its dimensions, which makes it possible to represent sparse tensors with differing numbers of elements.
-
- Operations
-
An operation can be polymorphic and variadic at compile-time: its attributes determine both the expected types and arity of its inputs and outputs.
-
- Stateful operations: variables
- A Variable operation owns a mutable buffer that may be used to store the shared parameters of a model as it is trained. A Variable has no inputs, and produces a reference handle, which acts as a typed capability for reading and writing the buffer.
- Stateful operations: queues
-
TensorFlow includes several queue implementations, which support more advanced forms of coordination. The simplest queue is FIFOQueue
-
- Partial and concurrent execution
- The API for executing a graph allows the client to specify declaratively the subgraph that should be executed. The client selects zero or more edges to feed input tensors into the dataflow, and one or more edges to fetch output tensors from the dataflow; the runtime then prunes the graph to contain the necessary set of operations. Each invocation of the API is called a step, and TensorFlow supports multiple concurrent steps on the same graph. Stateful operations allow steps to share data and synchronize when necessary.
- Distributed execution
- Each operation resides on a particular device, such as a CPU or GPU in a particular task. A device is responsible for executing a kernel for each operation assigned to it.
- The TensorFlow runtime places operations on devices, subject to implicit or explicit constraints in the graph. The placement algorithm computes a feasible set of devices for each operation, calculates the sets of operations that must be colocated, and selects a satisfying device for each colocation group.
- Dynamic control flow
- TensorFlow supports advanced machine learning algorithms that contain conditional and iterative control flow. For example, a recurrent neural network (RNN) [39] such as an LSTM [32] can generate predictions from sequential data.
- Extensibility case studies
- Differentiation and optimization
- TensorFlow includes a user-level library that differentiates a symbolic expression for a loss function and produces a new symbolic expression representing the gradients. For example, given a neural network as a composition of layers and a loss function, the library will automatically derive the backpropagation code(Source code transformation).
- The differentiation algorithm performs breadth-first search to identify all of the backwards paths from the target operation (e.g., a loss function) to a set of parameters, and sums the partial gradients that each path contributes.
- We have extended the algorithm to differentiate conditional and iterative subcomputations (§3.4) by adding nodes to the graph that record the control flow decisions in the forward pass, and replaying those decisions in reverse during the backward pass(和tfeager trace不同,这里反向传播和前向传播一一对应).
- Differentiation and optimization
- Dataflow graph elements
- Implementation
- The distributed master translates user requests into execution across a set of tasks. Given a graph and a step definition, it prunes (§3.2) and partitions (§3.3) the graph to obtain subgraphs for each participating device, and caches these subgraphs so that they may be re-used in subsequent steps. Since the master sees the overall computation for a step, it applies standard optimizations such as common subexpression elimination and constant folding; pruning is a form of dead code elimination. It then coordinates execution of the optimized subgraphs across a set of tasks.
- The dataflow executor in each task handles requests from the master, and schedules the execution of the kernels that comprise a local subgraph. We optimize the dataflow executor for running large graphs with low overhead.
- The runtime contains over 200 standard operations, including mathematical, array manipulation, control flow, and state management operations. Many of the operation kernels are implemented using Eigen::Tensor [36], which uses C++ templates to generate efficient parallel code for multicore CPUs and GPUs; however, we liberally use libraries like cuDNN [13] where a more efficient kernel implementation is possible. We have also implemented quantization, which enables faster inference in environments such as mobile devices and high-throughput datacenter applications, and use the gemmlowp low-precision matrix library [35] to accelerate quantized computation.
- If it is difficult or inefficient to represent a subcomputation as a composition of operations, users can register additional kernels that provide an efficient implementation written in C++. We have found it profitable to hand-implement fused kernels for some performance critical operations, such as the ReLU and Sigmoid activation functions and their corresponding gradients. We are currently investigating automatic kernel fusion using a compilation-based approach
- Previous system: DistBelief
4. TENSORFLOW EAGER: A MULTI-STAGE(可以查查multi-stage programming,在IR角度), PYTHON-EMBEDDED DSL FOR MACHINE LEARNING(2019 optional)
- INTRODUCTION
- DSLs for differentiable programming are usually embedded in a host language (for a reference on embedded DSLs, see Hudak, 1996), and they can be roughly classified as either imperative or declarative, in the programming languages sense.
- Programming in an imperative DSL for differentiable programming is like programming in an imperative programming language such as Python: the construction and execution of primitive operations are inextricably tied, with each operation returning concrete numerical data. performance is bottlenecked on the interpreter and serialization of models is difficult(imperative DSL两个缺点).
- declarative DSLs separate the specification of models from their execution. These “define-before-run” libraries require users to stage their models as dataflow graphs, permitting compiler optimizations and the exploitation of parallelism, and simplifying deployment, distribution, and code generation
- 动机:An ideal DSL would offer the flexibility and accessibility of imperative execution along with the many benefits of declarative programming, without either of their costs. It is with this motivation in mind that we present TensorFlow Eager, a Python-embedded DSL for differentiable programming that lets developers interpolate(转换) between imperative and staged computations in a single package.
- 贡献
- TensorFlow Eager can be viewed as a multi-stage front-end to TensorFlow. Imperative and staged TensorFlow Eager code share a single set of primitive operations, kernels, and uservisible APIs.
- While we are not the first in the differentiable programming community to recognize the value in bridging imperative and declarative programming, we are among the first to present this line of work in the context of multi-stage programming
- RELATED WORK
- In TensorFlow Eager, users must manually stage computations, which might require refactoring code (see x4.1). An ideal framework for differentiable programming would automatically stage computations, without programmer intervention.
- One way to accomplish this is to embed the framework in a compiled procedural language and implement graph extraction and automatic differentiation as compiler rewrites; this is what, e.g., DLVM, Swift for TensorFlow, and Zygote do. Python’s flexibility makes it difficult for DSLs embedded in it to use such an approach. Some projects, like AutoGraph (Moldovan et al., 2019) do operate on Python abstract syntax trees to rewrite imperative code to code that constructs dataflow graphs, but such techniques are out of the scope of this paper.
- An alternative to staging(将计算搬到图上) computations as graphs for performance is to implement fused kernels. For example, NVIDIA provides fused CuDNN kernels for popular recurrent neural network operations that are dramatically faster than nonfused implementations (Chetlur et al., 2014). This approach, while useful, is difficult to scale
- TensorFlow Eager is not the first Python library to offer a multi-stage programming model. JAX (Frostig et al., 2018), a tracing-JIT compiler that generates code for heterogeneous devices via XLA (The XLA team, 2017), provides a similar programming paradigm. MXNet and Gluon, PyTorch(为什么要用trace方法,而不是直接将imperative代码转成数据流图,因为python动态性,都转换太复杂了)
- DESIGN PRINCIPLES
- Privilege imperative execution
- Seamlessly embed into Python
- Stage imperative code as dataflow graphs
- Privilege imperative execution
- EXECUTION MODEL
- Multi-stage programming
-
TensorFlow Eager provides two ways of executing operations: imperatively or as part of a static dataflow graph. Both execution models have access to the same set of operationsand kernels, but they differ in how they dispatch kernels
- Imperative execution,resembles a NumPy-like library for hardware-accelerated numerical computation and machine learning
- Staged execution
- While imperative execution simplifies prototyping, the overhead of going back and forth into the Python interpreter limits its performance; representing computations as dataflow graphs before executing them not only removes this bottleneck but also allows for inter-op parallelism and optimizations like constant-folding and buffer reuse. Thus, TensorFlow Eager provides a mechanism to stage computations as dataflow graphs. In particular, we provide a decorator, function, that traces the execution of a Python function, recording all TensorFlow operations and the tensors flowing between them in a dataflow graph. function can be thought of as an opt-in, JIT compiler that generates an optimized polymorphic function for a Python function, creating concrete functions backed by dataflow graphs via a straightforward binding-time analysis at runtime.
- Invoking a callable returned by function will execute a dataflow graph instead of the corresponding Python function.
- A multi-stage workflow
- Many users will find the performance of imperative execution sufficient. Purely imperative TensorFlow Eager can match the performance of graph execution when training models with sufficiently expensive kernels, like ResNet-50 (He et al., 2016) (see x6). But when imperative performance disappoints, we recommend the following multi-stage workflow(什么时候用multi-stage workflow)
- Implementation. Develop, debug, and test a singlestage imperative program.
- Analysis. Using any profiling tool the user is familiar with, identify performance-critical blocks of operations, and express these blocks as staging-friendly Python functions or callable objects
- Staging. Decorate the functions identified in the previous step with @function.
- Python functions that are amenable to staging are those that, when called in a graph-building context, generate a graph that encapsulates the computation of interest. This means that if a Python function executes non-TensorFlow code, then there might be semantic discrepancies
- np.random.randn(5, 5),will return the same value every time it is called, since a particular random offset generated by NumPy will be inserted into the graph as a constant;replace the call to np.random.randn with tf.random——normal
- a Python function f has Python side-effects (e.g., every call to it increments a global Python counter
- Conditionals that depend on the value of tensors will need to be written using tf.cond, and while loops that depend on tensor values will need to be rewritten in terms of tf.while loops
- Note that staging trades off imperative execution (and therefore interactivity) and Python integration (and therefore run-time dynamism) for performance. It is up to the programmer to decide when this trade-off is acceptable and to use staging annotations judiciously. This trade-off can be diminished by using tools like AutoGraph that operate on abstract syntax trees and rewrite Python control flow to dataflow control flow
- Many users will find the performance of imperative execution sufficient. Purely imperative TensorFlow Eager can match the performance of graph execution when training models with sufficiently expensive kernels, like ResNet-50 (He et al., 2016) (see x6). But when imperative performance disappoints, we recommend the following multi-stage workflow(什么时候用multi-stage workflow)
-
- Automatic differentiation
- We implement a variant of tracing-based reverse-mode automatic differentiation (Baydin et al., 2018), with a few changes to better support partially staged computation. Our implementation is similar to the implementations of Chainer (Tokui et al., 2015), and PyTorch
- The main user-visible concept in the gradient API is atape. If a tape watches a value, operations taking this value as an input will be recorded. Tapes are composable data structures: multiple tapes can be active simultaneously
- The tape is tightly integrated with the logic responsible for staging code. The first time a graph function is called when a tape is both active and watching one of its inputs, we build a “forward” version of this function that returns any intermediate values needed for the backward step, in addition to its named outputs. Moreover, this ensures that if a computation was staged in the forward pass, its corresponding backward pass will also be staged. Note that gradient computation is itself expressed as a function which executes primitive operations, so it is possible to stage it or not.
- State
- Like TensorFlow, TensorFlow Eager keeps program state in variables, restoring a variable’s value by assigning to it from a restore operation and periodically saving it to disk by sending its value to a save operation. Variables are useful when implementing models because accessing a variable’s value automatically watches it on all active tapes
- Devices
-
Imperative and staged computations use the same underlying Device abstraction
-
- Staging computations
-
In this section, we discuss the implementation of function in detail
- Polymorphism.
- All Python functions are polymorphic in their inputs. In contrast, graph functions are not polymorphic: they have a fixed number of inputs, which are statically typed. We bridge this semantic gap between Python functions and graph functions by implementing a trace cache, similar to the one described in JAX (Frostig et al., 2018). The object F = function(f) maintains a cache mapping from inferred input signatures to concrete graph functions. In particular, each time F is invoked, its inputs are processed and their signature is inferred: tensors are represented as abstract types (numerical type and shape tuples), while non-tensor values are encoded by object identity. This input signature, coupled with a small amount of metadata about the surrounding program state such as the requested device, becomes a key into a cache of graph functions. A cache miss triggers a trace of f on the given inputs, while a cache hit results in the reuse of a previously created graph function. In this sense, function provides ad hoc polymorphism (Strachey, 2000) or function overloading.
-
- Escaping staged computations
- For concreteness, say that we have a Python function that we wish to stage, and say that the function is almost entirely staging-friendly with the exception of a call to a datadependent recursive Python function that performs some operations on tensors.
- we can stage the entire function while wrapping the recursive call in a py_func, an operation that takes a Python function as an attribute and executes it imperatively, even in the context of staged code
- Multi-stage programming
5. JANUS: Fast and Flexible Deep Learning via Symbolic Graph Execution of Imperative Programs (2019 require)
- 工作
- we present JANUS, a DL framework that achieves the best of both worlds by receiving an imperative DL program as input and creating symbolic graphs of the program accordingly with speculative program context assumptions. JANUS makes environment assumptions on the program context (e.g., constant variables and branches) based on past iterations to simplify the dynamic nature of the program and transform the program into a symbolic graph. These assumptions are speculative(推测的), because the context may change during execution; an incorrect assumption results in an invalidation ofa symbolic graph,in whichcase JANUS falls backto imperative execution to guarantee correctness. For design (Section 4.3.1) and implementation (Section 4.3.2) reasons, JANUS converts only the subset of Python programs into the efficient symbolic graphs, but the rest of them still can be executed imperatively, ensuring the full Python coverage.
- Challenges and Proposed Solution
- Challenges
- Dynamic control flow (DCF)
- Dynamic types (DT), Python is a dynamically-typed language, i.e., the type of a Python expression can only be determined at program execution timeoperations. Furthermore, various non-numerical types of Python, such as lists, dictionaries, and arbitrary class instances, are even harder to be converted into elements of a dataflow graph, of which vertices usually output numerical arrays.
- Impure functions (IF)
- Related Works
- Tracing-based graph generation approaches such as PyTorch’s JIT compiler (torch.jit.trace) [32], MXNet Gluon [29], and the defun [44] functionality of Tensor- Flow Eager [41] execute the imperative program once, and convert the single execution trace directly into a dataflow graph. it fails to capture dynamic semantics
- there exist other approaches that select a less-dynamic host language and therefore succeed in capturing the wider semantics of source programs. JAX [13] limits the Python syntax and supports converting only pure-and-statically-composed functions. S4TF [43] supports Swift, losing the merit of supporting Python. Moreover, since the graph conversion occurs before actually executing the program, these approaches can miss the opportunity to further optimize the graph with the information only obtainable during the program execution. For example, always converting a Python loop into control flow operations can be sub-optimal if the loop iteration count is known to be fixed.
- Concurrent works including AutoGraph-enabled TensorFlow defun functionality [27] and the "scripting" mode of PyTorch JIT (torch.jit.script) [32] also have limitations. AutoGraph makes users to explicitly provide the necessary information, or generates incorrect or suboptimal graph in some cases, all of which could be avoided if sufficient information existed. For example, users must explicitly specify the types of Python lists, prohibiting the dynamic typed or heterogeneous elements. For another example, for dynamic control flow statements, the statements with non-tensor predicates are always unrolled, which is error-prone,and the statements with tensor-typed predicates are always converted to control flow operations, which can be sub-optimal. In the "scripting" mode of PyTorch JIT, users must use TorchScript, a subset of Python which does not allow variables to have dynamic types. Further graph optimizations based on the runtime information are also not possible.
-
Proposed Solution: Speculative Graph Generation and Execution
- JANUS makes assumptions about the program’s behavior based on the runtime profiling information, and generates a symbolic graph tailored for the assumptions.If the assumptions do not hold, JANUS builds a new dataflow graph based on different assumptions. Since a DL program comprises a number of iterations of an optimization procedure, the speculative approach is a good fit since the interpreter is likely to execute specific code blocks of the program repeatedly(但是,每个step,数据不同,如果没统一序列长度,循环次数可能不同).
- JANUS System Design
- Given an input program, JANUS extracts the main neural network computation part, over which the automatic differentiation is performed, and starts the speculative graph generation and execution process. the given DL program is automatically transformed into a corresponding graph representation without any interactions()
- Fast Path for Common Cases
- Runtime profiling. Once JANUS receives a DL program, the program is first executed imperatively, while the Profiler gathers runtime information required for making reasonable assumptions (Figure 2 (A)). Various information is collected, including control flow decisions on conditional branches, loop iteration counts for iterative loop constructs, variable type information, non-local variables, object attributes, and so on.
- Symbolic graph generation. After a sufficient amount of information has been collected, the Speculative Graph Generator tries to convert the program into a symbolic dataflow graph with the assumptions based on the runtime information (Figure 2 (B)). To avoid making any hasty generalizations, JANUS does not begin graph generation until the executor has profiled the program for a certain amount of iterations(提前多收集几种情况,都生成对应的图,避免假设失败).First, JANUS traverses the abstract syntax tree (AST) of the DL program and generates the corresponding graph elements for each AST node, along with assertion operations(用于检测假设是否正确) that can validate the context assumption at runtime. Since JANUS targets DL programs, operations for automatic differentiation and model parameter updates are also automatically inserted if necessary. Next, the generated graph is further optimized by the post-processor, of which optimizations were not applicable to the original imperative DL program. Finally, the optimized graph and the assumption that were used to generate the graph are saved into the graph cache.
- Graph execution. If a graph representation with correct assumptions regarding the program context is available, the Speculative Graph Executor executes the symbolic graph (Figure 2 (D)). Note that the same graph can be reused multiple times, given that the runtime context assumption holds for future invocations.
- Accurate Path for Rare Cases
- Assumption failure. If an assumption is proven to be wrong, JANUS falls back to the imperative executor (Figure 2 (E)) and resumes runtime profiling(再跑一个) to make more relaxed assumptions for subsequent executions.
- Assumptions that can be validated before actually exeting the associated graph, such as type assumptions on input arguments, are checked when retrieving the graph from the graph cache (Figure 2 1 ). In the unfortunate case where such an assumption is wrong, JANUS regards this as a cache miss and falls back to imperative execution. On the other hand, for assumptions that can only be validated during graph execution (Figure 2 2 ), it can be erroneous to simply abort the current execution to fall back to the imperative executor, because the global state may have been changed during the current execution. To solve this issue, JANUS defers state update operations until every assumption is validated (Section 4.2.3). This way, even if an assumption turns out to be wrong during computation, no state update operation has been triggered yet and thus no state has been mutated. Knowing this, the system can safely stop the current execution. In other words, states are updated in an all-or-nothing manner.
- Imperatively executed programs(有些代码保持Imperative)
- However, the Speculative Graph Generator does not convert every single Python feature into a symbolic graph operation (Figure 2 (C)). For example, to ensure the all-or-nothing characteristic of state updates, programs that include invisible state mutations are not converted into symbolic graphs. Some complicated Python features such as coroutines and generators are also not converted, since they do not have any clear graph representations.
- Symbolic Graph Generation
- Graph Generation Basics
- Input parameters (x and y) are converted into graph input objects that require external inputs in order to execute the graph. In the case of TensorFlow, this corresponds to PlaceholderOp. At runtime, they are filled with the actual argument values. The return value of the return statement is marked as the computation target of the graph, so that we can retrieve the value after executing the graph. Python literals such as 0.5, 1.5 and 2 are simply converted into operations that output constant values – ConstantOp for TensorFlow. The conversion of mathematical operators is done by finding the corresponding mathematical graph operations and replacing them one to one. For standard Python operators such as + and **, JANUS places the appropriate primitive calculation operations in the graph, like AddOp and PowOp for TensorFlow.
- Dynamic Features
- Dynamic Control Flow
- Basic translation rules(这些是AutoGraph做的?直接解析和转)
-
Python’s conditional statement, the if statement, can be obtained by combining switch and merge primitives
- The iterative statements of Python, while and for, are handled by using the switch and merge primitives together with loop context primitives that hold iteration frames. TensorFlow conveniently provides EnterOp, ExitOp, and NextIterationOp [50] for creating iteration frames and passing values over them.
- for function calls, a separate graph is generated for the callee function, and a function invocation operation that points to the generated graph is inserted in the position of the function calls.
-
- Speculative graph generation: unrolling and inlining
- If JANUS detects that only a single particular path istaken for a certain control flow statement during profiling, JANUS presumes that the control flow decision is actually fixed. The system replaces the control flow operation with an assertion operation that double-checks the assumption for this control flow decision, and proceeds with graph generation as if the control flow statement were unrolled.
- Basic translation rules(这些是AutoGraph做的?直接解析和转)
- Dynamic Type
- Basic translation rules
- it is possible to infer the types of some expressions, given the types of other expressions; for example, it is clear that the variable c in c = a + b is an integer if a and b are integers.
- As a basic rule, JANUS converts numerical Python values such as scalars, list of numbers, and NumPy [48] arrays into corresponding tensors, and converts nonnumerical values, including arbitrary class instances, into integer-typed scalar tensors which hold pointers to the corresponding Python values. Next, JANUS infers the types of other expressions that are derived from expressions covered by the basic rule
- Speculative graph generation: specialization
- Expressions whose types cannot be inferred from other expressions require a different measure. For instance, it is impossible to identify the types of input parameters for functions, or Python object attribute accesses (obj.attr) without any external clues. Similarly, inferring the return types of recursive function calls is also challenging due to the circular dependencies. To make proper assumptions about the types of such expressions, Profiler observes the types of the expressions during imperative executions. Given these context assumptions, JANUS can finish inferring the types of remaining expressions, and construct a specialized dataflow graph accordingly.
- In addition, JANUS makes further assumptions about the expressions to apply more aggressive optimizations. For numerical expressions, we can try to specialize the shape of tensors before constructing the graph. Furthermore, if a Python expression always evaluates to the same value while profiling, JANUS converts it into a constant node in the dataflow graph. With statically determined shapes or values, the graph can be further optimized, or even be compiled to the efficient machine code [45]. Figure 4 shows an example hierarchy of shapes and values that a certain tensor may have. After profiling the first few runs, JANUS finds out that even though the values of the tensor are different every time, they all have the same shape, for example (4, 8), as in the figure
- Basic translation rules
- Impure Functions
- Naïve translation rules
- A trivial solution is to use TensorFlow’s PyFuncOps,which can execute arbitrary Python functions as graph operations. A function for reading and updating a certain global state can be created and inserted in the appropriate position within the graph. However, this trivial approach has clear limitations. First, since only one Python function can be executed at a time due to the global interpreter lock (GIL)(是python,不能直接调C++), the overall performance can be reduced when multiple operations should be executed in parallel. It also complicates the fallback mechanism of JANUS.
- Optimized graph generation: deferred state update
- To make things simpler and also faster, JANUS does not mutate global states in place on the fly. JANUS instead creates local copies of global states, and mutates only the local copies during symbolic graph execution
- Figure 5 shows the symbolic dataflow graph version of the program in Figure 1, which includes the object attribute expressions (self.state) that access and mutate the global states. We add new graph operations PyGetAttrOp and PySetAttrOp to represent Python attribute read and write. Each of them receives an object pointer (0xb84c) and a name of the attribute ("state") as inputs, and behaves as follows: 1 The PyGetAttrOp can access the Python heap to read the state unless a corresponding local copy exists. 2 When the PySetAttrOp wants to update the attribute, a new value is inserted to the local copy instead of directly updating the Python heap. 3 Further read and write operations are redirected to the local copies. Note that JANUS inserts appropriate dependencies between PyGetAttrOps and PySetAttrOps if necessary to prevent any data hazards. 4 After the graph executor finishes this run, the local copies are written back to the Python heap.
- Naïve translation rules
- Dynamic Control Flow
- Graph Generation Basics
6. Automatic differentiation in ML: Where we are and where we should be going(2019 require google)
- 动机和工作
- We review the current state of automatic differentiation (AD) for array programmingin machine learning (ML), including the different approaches such as operator overloading (OO) and source transformation (ST) used for AD, graph-based intermediate representations for programs, and source languages.
- Based on these insights, we introduce a new graph-based intermediate representation (IR) which specifically aims to efficiently support fully-general AD for array programming.
- our IR naturally supports function calls, higher-order functions and recursion, making ML models easier to implement.
- The ability to represent closures allows us to perform AD using ST without a tape, making the resulting derivative (adjoint) program amenable to ahead-of-time optimization using tools from functional language compilers, and enabling higher-order derivatives.
- Lastly, we introduce a proof of concept compiler toolchain called Myia which uses a subset of Python as a frontend.
- Background and prior w
- Automatic differentiation
- Automatic differentiation (AD, also called algorithmic differentiation) relies on the ability to decompose a program into a series of elementary operations (primitives) for which the derivatives are known and to which the chain rule can be applied.
- In reverse mode, the chain rule is evaluated in reverse order of the original program. This is a more complex program transformation: an adjoint program(副程序,即ad程序) must be constructed whose control flow is the reverse of the original (or primal) program. First, the primal program is run to obtain the output, and then the adjoint program is run to compute the gradient, starting from that output and going backwards. In order to do so efficiently, each statement in the adjoint must have access to the intermediate variables of the original program. Hence, the AD transformation must guarantee that the intermediate variables are not destroyed or mutated.
- Operator overloading
- OO relies on a language’s ability(编程语言要支持重载) to redefine the meaning of functions and operators. All primitives(每个算子) are overloaded so that they additionally perform a tracing operation: The primitive is logged onto a ‘tape’, along with its inputs to ensure that those intermediate variables are kept alive. At the end of the function’s execution, this tape contains a linear trace of all the numerical operations in the program. Derivatives can be calculated by walking this tape in reverse(每执行一次前向传播,就会tape一次吗,算一次导数,不然每次执行时图不一样,那导数也会变化).
- The main advantage of OO is that it is straightforward to implement. Because the tracing passes through function calls and control flow, the AD logic is simplified.
- A significant downside is that a separate ‘derivative interpreter’(针对tape中不同的 numerical operations,实现相应的导数) is needed for the adjoint program. Having an embedded interpreter inside of the host language can complicate debugging and performance analysis. Moreover, since the program is traced and reversed at runtime, OO incurs overhead on each function call which can be particularly problematic if the primitives are fast to execute relative to the tracing operation(解释了什么时候会慢). OO also does not allow for ahead-of-time optimizations on the adjoint program.
- Source transformation
- ST explicitly constructs the adjoint program. Unlike OO, ST needs to explicitly construct a program with a reversed control flow(有一个真实存在的adjoint program,OO是trace然后调用解释器生成adjoint program,非真实存在), which means that it needs transformation rules for function calls and control flow statements such as loops and conditionals. Whereas OO operates within the language, ST requires tooling such as parsers(针对不同的函数,实现相应的导数), tools to manipulate intermediate representations, and unparsers.
- The advantage of ST is that the AD transformation is done only once per program and hence doesn’t incur overhead at runtime(只做一次,提前做), which makes ST performant for a wider range of workloads. Moreover, the full adjoint program is available(真实存在) during compilation and can therefore be optimized ahead of time.
- Although ST does not have to deal with the AD transformation at runtime, it must still ensure that intermediate variables from the forward pass are accessible by the adjoint. There are a variety of approaches to deal with this.
- Tape-based
- Closure-based
- Dataflow programming
- Popular ML frameworks such as Theano, TensorFlow, and MXNet [10] follow the dataflow programming paradigm [21] and use computation graphs as their intermediate representation. These graph representations do not have scoping or recursive function calls, which means that AD is much easier to implement with ST.
- Since the adjoint program is part of the same dataflow graph, it can access the intermediate variables from the forward pass directly from the global scope, so neither tapes nor closures are required. Additionally, a simple liveness analysis makes it easy to keep intermediate values from the primal alive only for as long as required by the adjoint computation.
- Using dataflow graphs without function calls nor scoping introduces limitations. Some of these limitations are addressed by the use of metaprogramming, but others affect the end-user (e.g., the lack of recursion and higher-order functions reduces the expressiveness of the language) and the compiler pipeline (e.g., loops cannot be represented in a principled way, which complicates their implementation). An advantage of dataflow programming is that graphs are a natural representation for distributed computing [2]. This allows different operations to be easily distributed across different hosts, devices, and cores.
- Programming languages and compilers
- Theano was one of the first software packages to refer to itself as a ‘linear algebra compiler’. Since then, more frameworks started approaching the definition and execution of ML models as a compiler problem. In the case of Theano and TensorFlow, they can be considered compilers of a custom language(DSL) which must be metaprogrammed using Python as a metalanguage. The dataflow graph is an intermediate representation which is optimized using a series of compiler passes. The resulting program is compiled (e.g., XLA) and/or interpreted (e.g., the TensorFlow/Theano runtimes). Similarly, PyTorch has started optimizing its traced Python programs using just-in-time (JIT) compiler approaches.
- More recently, projects such as DLVM [42] and Swift for TensorFlow have attempted to extend existing compiler toolchains such as LLVM and Swift’s intermediate language (SIL) with array programming and AD in order to create frameworks better suited for ML workflow needs.
- Viewing ML frameworks as compiler toolchains raises several questions. For example, on what intermediate representations is it the easiest to apply AD and aggressive optimizations? IRs with closures(conv2d是一个closure?) as first-class objects will be able to use closure-based approaches to AD, whereas traditional SSA-based representations (such as SIL) would need to use a tape-based approach. And which IRs are most suitable for the heavy use of parallelism and distributed computing in ML?
- Secondly, what should the source language be? The ML community is highly invested in Python, an interpreted, dynamically typed programming language which does not have built-in support for multidimensional arrays. More recently, frameworks have suggested using Swift (DLVM) or Julia (JuliaDiff, [30]), languages with static typing and built-in multidimensional arrays respectively. On the other hand, frameworks such as Theano and TensorFlow do not have an exposed source language but can only be metaprogrammed. In the AD community, there has been strong push away from traditional imperative languages such as Fortran and C to purely functional languages, since they simplify the implementation of AD and are easier to optimize.
Graph-based direct intermediate representation(本论文提出)
- Automatic differentiation
- Graph based. Similar to Theano or TensorFlow, programs are represented as graphs. Graphs have the advantage of being easy to optimize and flexible about execution order, as operations that do not depend on each other in the graph may be executed in any order, or in parallel. Unlike Theano and TensorFlow, however, functions may be called recursively and they are first-class objects. Functions may be passed as parameters to other functions, or returned from a function and then called. A large variety of control flow constructs, ranging from simple loops to graph traversals, can be implemented using these capabilities. Other graph frameworks tend to implement only a few of these as specialized operators, such as Theano’s scan or TensorFlow’s while, leading to an IR which is both more complex and less powerful than the general one we are proposing. A general IR does require more work to transform and optimize in a provably correct way in the context of automatic differentiation, but this work only needs to be done once.
- Purely functional. Mutation and side effects are problematic for reverse mode AD, where the backward pass requires access to the unchanged intermediate variables from the forward pass. They also interact poorly with complex optimizations because of aliasing. Restricting our language to be purely functional therefore allows us to implement more robust AD and more advanced optimizations compared to imperative languages.
- Closure representation. AD on functional languages involves storing the primal’s intermediate results into closures which are then connected together to form the adjoint. It is therefore important to have a natural representation for closures.
- Strongly typed
- IR specification
- Concretely, our representation represents a function as a graph object with a list of parameter nodes and a single return node (multiple return values are supported through tuples). A node represents a function application and has an ordered list of incoming edges. The first incoming edge is a pointer to the function to apply, and the rest point to the arguments. Constants are represented as nodes with no incoming edges and a value field. Links between nodes are bidirectional, so that graphs can be traversed in either direction. Each non-constant node belongs to a single graph. See Figure 1 for a visual representation of the IR
- Source transformation
- AD can be implemented for this IR using ST with a closure-based method.
- The backpropagators of primitives are known, whereas the backpropagators of user-defined functions can be easily constructed by calling the backpropagators of the function calls in the body in reverse order.
- Myia
- Myia is a functioning proof of concept of a toolchain that uses the proposed graph representation