COMP3230 Principles of Operating Systems Programming Assignment Two
Due date: November 17 , 2024, at 23:59 Total 12 points – Release Candidate Version 2 Programming Exercise – Accelerate LLM Inference using Multi-Threading Objectives
- An assessment task related to ILO 4 [Practicability] – “demonstrate knowledge in applying systemsoftware and tools available in the modern operating system for software development”.
- A learning activity related to ILO 2.
- The goals of this programming exercise are:
- to have hands-on practice in designing and developing multi-threading programs;
- to learn how to use POSIX Pthreads (and Semaphore) libraries to create, manage, andcoordinate multiple threads in a shared memory environment;
- to design and implementsynchronization schemesfor multithreaded processes usingsemaphores, or mutex locks and condition variables.
Tasks
Optimizing the throughput of GPTs is an important topic. Similar to other neural networks, GPT andits variations utilize matrix-vector-multiplication, or called fully-connected/linear layer in Deepmechanism to adopt important information from history tokens. Thus, to accelerate GPT and getfaster response, it’s critical to have faster matrix-vector-multiplication and faster multi-headattention computation. Given their non-sequential property of the two algorithms, parallelcomputing based on multi-threading is usually considered helpful.Following PA1, we use Llama3, an open-source variation of GPT for this assignment. A single-thread Cimplementation of the inference program, named seq.c, is provided as the starting point of your
work. You need to use POSIX Pthreads with either the semaphore or (mutex_lock + condition variable)to implement a multi-threading version of the inference program,which parallelizes both matrixvector-multiplication and multi-head attention functions. This multi-threading version shallsignificantly accelerate the inference task of the Large Language Model. Moreover, to reduce systemworkload in creating multiple threads, you need to reuse threads by formulating them into a threadpool.Acknowledgement: The inference framework used in this assignment is based on the open-sourceproject llama2.c by Andrej Karpathy. The LLM used by this assignment is based on SmolLM byHuggingfaceTB. Thanks open-source!
GPT-based Large Language Model
In high-level, GPT is a machine that could generate words one by one based on previous words (alsoknown as prompts), and Figure 1a illustrate the basic workflow of GPT on generating “How are you”:Figure 1. GPT Insight. a) GPT generates text one by one, and each output is the input of next generation. b) GPT has fourmajor components: Tokenizer turns word (string) into vector, Softmax + Sample gives next token, and each layer haAttention and FFN (Feed-Forward Network), consisting of many Matrix-Vector-MultiplicationFigure 1b showcases the inference workflow of each word like “You” in“How are you”: First, wordsare transformed into token embeddings using the tokenizer, which is essentially a (python)ictionary that assigns a unique vector to each word. Embedding vectors go through multiple layers,hich involve three steps.
- The first step is Multi-Head Attention, where the model first calculates attention scores based onthe cosine similarity between the current word's query embedding and embeddings of previouswords (keys). Then weighted average value embeddings to formulate output.
- The second step is a feed-forward network (FFN) that applies more learnable parametershrough Matrix-Vector-Multiplication.
- The third step is positional embedding, which takes into account the ordering of words in natural
anguage by adding positional information by代写COMP3230 Principles of Operating Systems RoPE (not required in this assignment).After going through all the layers, the embeddings are classified to generate aspecific word as theoutput. This involves using a softmax function to convert the embeddings into a probabilitydistribution, and randomly samples a word from the distribution.
Task 1: Matrix-Vector-Multiplication
Asshown in Figure 2, Matrix-Vector-Multiplication can be illustrated as two iterations:For Each Row i
More specifically, a sample single-thread C implementation isshown below:void mat_vec_mul(float* out, float* vec, float* mat, int col, int row) {
for (int i = 0; i < row; i++) { // for each row ifloat val = 0.0f;
for (int j = 0; j < col; j++) // for each column j
val += mat[i * col + j] * vec[j]; // mat[i * col + j] := mat[i][j]
out[i] = val;
}
}
Your 1st task in this assignment is to parallelize the outer iteration (at the 2nd line) by allocatingblock of rows to threads. More specifically, in the case of a Matrix with ? rows and ? threadsworking on the computation, assuming that ? is divisible by ?, the k-th thread will handle the rows
rom [? × ?/?] to [(? + 1) × ?/? − 1]. To illustrate, if we have a 6-row matrix with 2 threads, the 0thhread will handle rows 0 to 2, while the 1st thread will handle rows 3 to5. If ? is not divisible by ?,we can assign ? − 1 threads with ⌈ ?/? ⌉ rows, while the last thread is assigned with the remainingrows. More explanation on such a design can be found on Appendix a. Parallel Checking.n this assignment, the model used is quantized, so there is a slight difference with the above C codebut the workflow is still the same.Task 2: Multi-Head Attention Figure 3. Attention and Multi-Head Attention Algorithm. Computation of each head is separated.Asshown in Figure 3, Multi-Head Attention can be illustrated by the following iterations:For Each Head h,
For Each Timestep t,
For head item i, accumulate q[h][i] * keys[h][t][i] to score[h][t]
SoftMax(score)
For Each Timestep t,
For head item i, out[h][i] += score[h][t] * values[h][t][i]More specifically, a sample single-thread C implementation isshown below:
void multi_head_attn(float* out, float* q, float* key_cache, float* value_cache,
float* att, int seq_len, int n_heads, int head_size, int kv_dim, int kv_mul) {
for (int h = 0; h < n_heads; h++) {
// iterate over all heads, PARALLEL THIS LOOP
float* head_q = q + h * head_size;
// query vector for this head
float* head_att = att + h * seq_len; // attention scores for this head
for (int t = 0; t <= pos; t++) {
// iterate over all timesteps
// get the key vector for this head and at this timestep
float* head_k = key_cache + t * kv_dim + (h / kv_mul) * head_size;
float score = 0.0f;
for (int i = 0; i < head_size; i++)
score += head_q[i] * head_k[i]; // attention score := q dot k
score /= sqrtf(head_size); // normalize by head_size
head_att[t] = score;
// save to the attention buffer
}
softmax(head_att, pos + 1); // THREADS-SAFE SoftMax to normalize scores to weight
float* head_out = out + h * head_size; // out vector for this head
memset(head_out, 0, head_size * sizeof(float)); // clear buffer
for (int t = 0; t <= pos; t++) {
// get the value vector for this head and at this timestep
float* head_v = value_cache + t * kv_dim + (h / kv_mul) * head_size;
float a = head_att[t]; // attention weight for this timestep
for (int i = 0; i < head_size; i++)
head_out[i] += a * head_v[i]; // accumulate the weighted sum to head out
}
}
}
Though looks complicated, it’s worth noticed that the computation involved for each head k is completelyndependent of other heads.our 2nd task in this assignment is to parallelize the head-iteration (in 3rd line of the above samplecode) by assigning block of heads to different threads. More specifically, consider a model with h heads and n threads. If h is divisible by n, then k-th thread will handle heads from [? × ℎ/?] to [(? +
- 1) × ℎ/? −1]. And if h is not divisible by n, we can assign ? − 1 threads with ⌈ ℎ/?⌉ heads and lastthread handle remaining heads. For example, our model has 9 heads, and with 4 threads, they willhandle 0-2 (1st thread), 3-5 (2nd thread), 6-8 (3rd thread) and the last thread shall handle no heads.
Note: Due to MQA, no. of heads for query might not be equal to no. of heads for key / value, but thisis already handled correctly within inner loop, just stick to n_heads.
Task 3: Thread Pool
Moreover, to reduce the performance overhead of frequent thread creation and cancellation tooperating system, your 3rd task is to create one set of N threads and reuse them for allmat_vec_mul() and multi_head_attn() function calls, instead of creating N threads for eachmat_vec_mul() or multi_head_attn() call. One popular method is based on synchronization using athread pool is shown in Figure 4.Figure 4. Reference Synchronization Workflow, consisting of 3 function: a) INIT_THR_POOL function: create N threads,each threads fall asleep immediately; b) MAT_VEC_MUL or MULTI_HEAD_ATTN function: assign new parameters andtasks, wake up all threads to work, and wait until threads to finish before returned; c) CLOSE_THR_POOL function: wakeup threads to collect system usage and exit, wait until all threads to exit and collect usage of terminated threads.More specifically, synchronization workflow in Figure 4 consists of 4 functions and a thread function:
- init_thr_pool(int thr_count): to be called by the main thread at the beginning ofprogram, shall:
- Create thr_count threads
- Let threads identify themselves, i.e., thread knows I am the i-th thread
- Let the created threads go to wait immediately
- void mat_vec_mul(float* out, float* vec, float* mat,...): API exposed to doMatrix-Vector-Multiplication, signature must be same as sequential version, shall:
- Assign the mat_vec_mul computation and the parameters (out, vec, mat, ...) tothreads
- Wake up threads to do calculation
- Main thread waits until all threads to complete the calculation
- void multi_head_attn(float* out, float* q, ...): API exposed to do Multi-Head
Attention, signature must be same as sequential version, shall:
- Assign the multi_head_attn computation and the parameters (out, q, ...) to threads
- Wake up threads to do calculation
- Main thread waits until all threads to complete the calculation
- close_thr_pool(): to be called at the end of program, shall:
- Wake up all threads and inform them to collect their system usages and terminatestart
├── common.h
# common and helper macro definitions, read through first
├── seq.c
# start point, single-thread inference engine
├── parallel_[UID].c # [your task] template including synchronization functions above
├── Makefile
# makefile for the project, update [UID] on line 5
└── model.h
# GPT model definition, modification not allowed
make prepare # will download if not existed
# or manually download via wget, will force repeated download, not recommended wget -O model.bin https://huggingface.co/huangs0/smollm/resolve/main/model.binwget -O tokenizer.bin https://huggingface.co/huangs0/smollm/resolve/main/tokenizer.binmake -B seq# -B := --always-make, force rebuild gcc -o seq seq.c -O2 -lm # or manually make with gcc
- Wait until all threads exit, and collect and print the system usage of all terminated threads
- Collect and print the system usage of the main thread as well as the whole program
- Release all resourcesrelated to multi-threading
- void* thr_func(void* arg): thread function, shall:
- Immediately wait for synchronization after initialization
- Can be woken up by the main thread to work on the assigned computation (i.e., based on thetask and parameters)
- After finishing the current workload, inform the main thread and go back to wait
- Being able to terminate and collect its system usageIt’s worth noticing that you should create one thread pool that can properly handle both mat_vec_muland multi_head_attn instead of two pools (one for each) Thus, you may also need to implement:
- mat_vec_mul_task_func(int id, ...): function executed by each thread to perform (itsportion of) matrix-vector-multiplication. The first argument is the thread id in the pool (not tidin OS), and the rest of the signature is left for you to design and implement.
- multi_head_attn_task_func(int id, ...): function executed by each thread toperform (its portion of) multi-head attention. The first argument is the thread id in the poolnot tid in OS), and the rest of the signature is left for you to design and implement.More details and reasons behind the design can be found in Appendix b. Context Design.
There might be other synchronization workflow, and we are open to your ideas. However, due tothe large class size, we can only accept submissions following the above design.
Specifications
- Preparing Environment
Download start code – Download start.zip from course’s Moodle, unzip to a folder with:
Download the model files. There are two files required, model.bin for model weight andtokenizer.bin for tokenizer. Please use the following instructions to download them:
Compile and run the inference program. The initial seq.c is a complete single-thread (sequential)C inference program that can be compiled as follows:Please use -lm flag to link math library and -O2 flag to apply level-2 optimization. Please use -O2and don’t use other optimization for fairness.
Run the compiled seq program. The program can be executed with an integer specifying therandom seed and a quoted string specifying the prompt. For simplicity, we fixed it to single prompt.
./seq <seed> <prompt> # prompt must quoted with "", recommend using only one prompt
# examples, more to be found in bench.txt
./seq 42 "What’s Fibonacci Number?"
./seq 42 "Why didn't my parents invite me to their wedding?"Upon invocation, the program will configure the random seed and begin sentence generation based
on the provided prompt. The program calls forward function to call LLM and generate the nexttoken, and the printf() with fflush() to print the generated word to terminal immediately. Apair of utility time measurement function time_in_ms will measure the time in millisecondaccuracy.Finally, when generation is finished, the length, average speed, and system usage will be printed:
$ ./seq 42 "What is Fibonacci Number?"
User: What is Fibonacci Number?
assistant
A Fibonacci sequence is a sequence of numbers in which each number is the sum of the two
preceding numbers (1, 1, 2, 3, 5, 8, 13, ...)
......
F(n) = F(n-1) + F(n-2) where F(n) is the nth Fibonacci number.length: 266, speed (tok/s): 21.4759main thread – user: 12.4495 s, system: 0.0240 sBy fixing the same machine (workbench2) and the same random seed, generated text can be exactly replicated. For example, the above sample is from the test we conducted on workbench2with random seed 42. Moreover, achieved tok/s represents the average number of tokensgenerated within a second, and we use it as the metric for speedmeasurement. Due to thefluctuating system load from time to time, the speed of the generation will fluctuate around some level.
- Implement the parallel mat_vec_mul and multi_head_attn by multi-threadingOpen the parallel_[UID].c, rename [UID] with your UID, and open Makefile, rename [UID]with your UID (make sure no blank space after). Implement the thread pool with workflowillustrated in Figure. 4 by completing the required five functions and addingappropriate global
variables or more functions if needed.
For multi-threading, please use pthread.h. For synchronization, please use either semaphore or (mutex locks and conditional variables). You can only modify the code between specified // YOURCODE STARTS HERE at line 28 and // YOUR CODE ENDS HERE at line 88.Here are some suggestions for the implementation:
- How to assign new parameters to threads and inform threads to terminate?
- Via variables in global space. Noted that thr_func can access global variables.
- Via pointers to parameters. Main thread changes pointers to params before wake-up workerhreads.
- How to assign new tasks (mat_vec_mul or multi_head_attn)? Via function pointers.
- Once the main thread invokes mat_vec_mul() or multi_head_attn(), it should wait for allcomputation to complete before returning from the function.
- For collecting system usage upon finish, please use getrusage.make -B # applicable after renaming [UID]gcc -o parallel parallel_[UID].c -O2 -lm -lpthread # or manually via gcc
./parallel 4 42 "What is Fibonacci Number?"
User: What is Fibonacci Number?
assistant
A Fibonacci sequence is a sequence of numbers in which each number is the sum of the two
preceding numbers (1, 1, 2, 3, 5, 8, 13, ...)
......
F(n) = F(n-1) + F(n-2) where F(n) is the nth Fibonacci number.
length: 266, speed (tok/s): 38.8889
Thread 0 has completed - user: 4.9396 s, system: 0.1620 s
Thread 1 has completed - user: 4.7195 s, system: 0.1806 s
Thread 2 has completed - user: 4.6274 s, system: 0.1843 s
Thread 3 has completed - user: 5.0763 s, system: 0.1702 s
main thread - user: 0.6361 s, system: 0.6993 s
Whole process - user: 20.0198 s, system: 1.3757 s
Your implementation shall be able to be compiled by the following command:
The program accepts three arguments, (1) thr_count , (2) seed, and (3) the prompt (enclosed by
""). Code related to reading arguments has been provided in parallel_[UID].c. You can use
thr_count to specify the number of threads to use.
./parallel <thr_count> <seed> <prompt>If your implementation is correct, under the same random seed, the generated text shall be the
same as sequential version, but the generation will be faster. Moreover, you should report on the
system usage for each thread respectively (including the main thread) and the whole program. Forexample, this is the output of random seed 42 on workbench2 with 4threads:
- Measure the performance and report your findingsBenchmark your implementation (tok/s) on your environment (Linux or WSL or docker or GithubCodespaces) with different thread numbers and report metrics like the following table:
16Regarding system usage (user time / system time), please report the usage of the whole processinstead of each thread. Then based on the above table, try to briefly analyze the relation betweenperformance and number of threads and reason the relationship.Submit the table, your analysis and
reasoning in a one-page pdf document.IMPORTANT: Due to the large number of students this year, please conduct the benchmark on your own computer instead of the workbench2 server. Grading of your report is based on your analysisand reasoning instead of the speed you achieved. When you’re working on workbench2, please breminded that you have limited maximum allowed thread numbers (128) and process (512), soplease do not conduct benchmarking on workbench2 server.
Submission Submit your program to the Programming # 2 submission page at the course’s moodle website.
Name the program to parallel _[UID].c (replace [UID] with your HKU student number). Asthe Moodle site may not accept source code submission, you can compress files to the zip formatbefore uploading. Submission checklist:
- Yoursource code parallel_[UID].c, must be self-contained with no dependencies thanmodel.h and common.h) and Makefile.
- Your report must include the benchmark table, your analysis and reasoning.
- Your GenAI usage report contains GenAI models used (if any), prompts and responses.
- Please do not compress and submit model and tokenizer binary file. (make clean_bin)
Documentation
- At the head of the submitted source code, state the:
- File name
- Student’s Name and UID
- Development Platform (Please include compiler version by gcc -v)
- Remark – describe how much you have completed (See Grading Criteria)
- Inline comments (try to be detailed so that your code could be understood by others easily)
Computer Platform to Use
For this assignment, you can develop and test your program on any Linux platform, but you must
make sure that the program can correctly execute on the workbench2 Linux server (as the tutors
will use this platform to do the grading). Your program must be written in C and successfully
compiled with gcc on the server.
It’s worth noticing that the only server for COMP3230 is workbench2.cs.hku.hk, and please do not
use any CS department server, especially academy11 and academy21, as they are reserved for
other courses. In case you cannot login to workbench2, please contact tutor(s) for help.
Grading Criteria
- Your submission will be primarily tested on the workbench2 server. Make sure that your
program can be compiled without any errors. Otherwise, we have no way to test your
submission, and you will get a zero mark.
- As the tutor will check your source code, please write your program with good readability (i.e.,
with clear code and sufficient comments) so that you will not lose marks due to confusion.
- You can only use pthread.h and semaphore.h (if needed), using other external libraries
like OpenMP, BLAS or LAPACK will lead to 0 mark.Detailed Grading Criteria
- Documentation -1 point if failed to do
- Include necessary documentation to explain the logic of the program
- Include required student’s info at the beginning of the program
- Report: 1 point
- Measure the performance of the sequential program and your parallel program on your
computer with various No. of threads (0, 1, 2, 4, 6, 8, 10, 12, 16).
- Briefly analyze the relation between performance and No. of threads, and reason the relation
- Implementation: 11 points evaluated progressively
- (+2 points = 3 points) Achieve correct result & use multi-threading. Correct means
generated text of multi-threading and sequential are identical with same random seed.
- (+3 points = 6 points total) All in 1., and achieve >10% acceleration by multi-threading
compared with sequential under 4 threads. Acceleration measurement is based on tok/s,
acceleration must result from multi-threading instead of others like compiler (-O3), etc.
- (+2 points = 8 points total) All in 2., and reuse threads in multi-threading. Reuse threads
means number of threads created in the whole program must be constant as thr_count.
- (+3 points = 11 points total) All in 3., and mat_vec_mul and multi_head_attn use the same
thread pool. Reuse same thread pool means there’s only one pool and one thread group.
Plagiarism
Plagiarism is a very serious offense. Students should understand what constitutes plagiarism, the
consequences of committing an offense of plagiarism, and how to avoid it. Please note that we may
request you to explain to us how your program is functioning as well as we may also make use of
software tools to detect software plagiarism.
GenAI Usage Guide
Following course syllabus, you are allowed to use Generative AI to help completing the assignment,
and please clearly state the GenAI usage in GenAI Report, including:
- Which GenAI models you used
- Your conversations, including your prompts and the responsesAppendix
- Parallelism Checking
To parallel by multi-threading, it’s critical to verify if the computation is independent to avoid raceconditions and the potential use of lock. More specifically, we need to pay special attention tocheck and avoid writing to the same memory location while persisting the correctness.For example, the 1st iteration (outer for-loop) matches the requirement of independence as thecomputation of each row won’t affect others, and the only two writing is out[i] and val. Writing tothe same out[i] can be avoided by separating i between threads. val can be implemented as stackvariables for each thread respectively so no writing to the same memory.Quite the opposite, 2nd iteration (inner for-loop) is not a good example for multi-threading, thoughthe only writing is val. If val is implemented as stack variable, then each thread only holds a part ofcorrect answer. If val is implemented as heap variables to be shared among threads, then valrequires lock for every writing to avoid race writing, which is obviously costly.
- Design of Context
A straightforward solution to the above problem is to let thr_func (thread function) do computation
and exit when finished and let original mat_vec_mul or multi_head_attn to create threads and waitfor threads to exit by pthread_join. This could provide the same synchronization.However, this implementation is problematic because each function call to mat_vec_mul ormulti_head_attn will create ? new threads. Unfortunately, to generate a sentence, GPTs will callmat_vec_mul or multi_head_attn thousands of times, so thousands of threads will be created anddestroyed, which leads to significant overhead to the operation system.Noted that all the calls to mat_vec_mul function or multi_head_attn are doing the same task, i.e.,Matrix-Vector-Multiplication, and the only difference between each function call is the parameters.Thus, a straightforward optimization is to reuse the threads. In high-level, we can create N threads inadvance, and when mat_vec_mul or multi_head_attn is called, we assign new parameters forthread functions and let threads working on new parameters.Moreover, it’s worth noting that mat_vec_mul and multi_head_attn are only valid within the context, i.e., between init_thr_pool and close_thr_pool, or there are no threads other than themain (not yet created or has been destroyed). This kind of context provides efficient and robustcontrol over local variables and has been integrated with high-level languages like Python with.
标签:multi,program,head,thread,Principles,COMP3230,threads,Operating,your From: https://www.cnblogs.com/CSSE2310/p/18526327