首页 > 其他分享 >CSCI-UA.0480-051: Parallel Computing

CSCI-UA.0480-051: Parallel Computing

时间:2024-06-12 12:44:32浏览次数:19  
标签:DAG Computing CSCI MPI points threads each cores 051

CSCI-UA.0480-051: Parallel Computing

Final Exam (May 15th, 2023)

Total: 100 points

Problem 1

Suppose we have the following two DAGs. Each DAG represents a process. That is, DAG 1 is a process and DAG 2 is another process. The two DAGs are totally independent from each other. The table shows the time taken by each task in each process.

 

a. [8 points] What will be the minimum time taken by each process if we execute it alone on one core, two cores, three cores, and four cores? That is, execute DAG 1 on one core, then on two cores, then on three cores, and then on four cores. Do the same for DAG 2. (Hint: You can put the result in the form of a table with three columns: #cores, time for DAG 1, and time for DAG 2. your table will have four rows for one, two, three, and four cores). Assume each core does not have hyperthreading technology.

b. [10 points] Based on the results you calculated in part a above, which DAG benefited more from using more cores? The two DAGs look similar, yet one DAG benefited from more cores. What is the main characteristic that this DAG has and makes it benefit from more cores?

c. [10 points] Suppose DAG 1 is your media player and DAG 2 is your browser. You start these two processes exactly at the same time and you are using a machine with eight cores (with no hyperthreading technology). How long does each DAG take to finish? Justify your answer.

Problem 2

Suppose we have the following MPI+OpenMP piece of code. The whole program is not shown. The lines with dots (i.e. lines 2, 7,  10, and  12) contain some other code not needed for this problem. Some of the variables are used but not declared (e.g. num, procID, n, and arrays A and B). Assume they have been declared in a code not shown here. Assume this program runs on eight cores processor. Each core is a four- way hyperthreading. Each core has its own private level 1 cache. Each level 1 cache is 32KB. Each core also has its own level 2 cache of size 1MB each. There is a shared level 3 cache of size 8MB. The program is executed using the following command: (progname is the name of the program’s executable).

mpiexec -n 4 ./progname

1.     int main(int argc, char **argv){

2.            

3.          float finalresult, sigma;

4.          MPI_Init(&argc, &argv);

5.          MPI_Comm_size(MPI_COMM_WORLD,  &num);

6.          MPI_Comm_rank(MPI_COMM_WORLD, &procID);

7.            

8.          sigma = procID / num;

9.          finalresult = findNum(A,B,  n, sigma);

10.          

11.        MPI_Finalize();

12.          

13.   }

14.

15.      float  findNum(float * A, float  * B, int n, float sigma)

16.     {

17.         float result = 0, total = 0;

18.         int i;

19.         #pragma omp parallel for reduction(+:result) numthreads(4)

20.         for (i=0; i<n; i++)

21.         {

22.             float factor;

23.             factor = A[i] * B[i];

24.             result += factor * sigma *i;

25.         }

26.            MPI_Allreduce(&result, &total,1,MPI_FLOAT,MPI_SUM,MPI_COMM_WORLD);

27.          return total;

28.     }

a. [20 points] Fill in the right column of the following table with a short answer to the questions on the left. Please do not write any justification unless the question asks for it explicitly.

How many processes were created in the whole system?

 

How many threads are generated in total?

 

What is the maximum number of those threads (that you mentioned above) that can execute in parallel?

 

How many copies of the variable ‘sigma ’ were created in the whole system?

 

How many copies of the variable ‘factor’ were created in the whole system?

 

How many threads will execute the code in line 12 (the dots …. )?

 

How many threads will execute the code in line 24?

 

How many threads will execute the code in line 26?

 

Is there a race condition for line 23?

 

Justify your answer to the above question in one line

 

b. [8 points] Is there a situation where threads, in the above code, accessing the arrays A[] and B[] can cause the coherence protocol to start? If yes, what is the situation? If not, why not?

c. [8 points] Given the code above, how many virtual address spaces did the OS create? Justify your answer in 1-2 lines only.

d.  [8  points]  Is there a possibility that a process reaches line 27 before the other processes? If yes, will this cause a problem, and what is that problem? If not, why not?

e. [8 points] If one of the processes created for the above code crashed for some reason, do we risk having a deadlock? Justify.

Problem 3

For each question below, choose all correct answers. That is, a question may have one or more correct answers.

a. [4 points] Suppose a process wants to send data to a subset of processes . That process has the following options:

1. Split the communicator to smaller ones and use collective communication.

2. Make a series of send and receive to each one of the destination processes.

3. Use broadcast call.

4. Split each process into multiple threads and let threads communicate through shared memory.

b. [4 points] A warp in an NVIDIA GPU:

1. is transparent to the programmer.

2. consists of a maximum of 32 threads.

3. has each four threads share the same fetch and decode hardware.

4. suffer from thread divergence, also called branch divergence, possibility.

c.  [4 points] A block in CUDA can be split among two SMs.

1. This statement is always true.

2. This statement is true if the block has more threads than the number of SPs in the SM.

3. This statement is always false.

4. This statement is false only if the number of threads in the block is less than the number of SPs in the SM.

d. [4 points] The following characteristics are needed for a code to be GPU friendly.

1. computation intensive

2. independent computations

3. similar computations

4. large problem size

e. [4 points] Choose all the correct statements from the following one:

1. If one program has a higher speedup than another program, for the same number of cores, it means that the program with the higher speedup also has higher efficiency than the other one.

2. MPI can run on distributed memory machines and shared memory machines.

3. OpenMP can run on distributed memory machines and shared memory machines.

4. Power consumption/dissipation is the main reason we moved from single core to multicore.

标签:DAG,Computing,CSCI,MPI,points,threads,each,cores,051
From: https://www.cnblogs.com/qq99515681/p/18243709

相关文章

  • 知识普及:什么是边缘计算(Edge Computing)?
            边缘计算是一种分布式计算架构,它将数据处理、存储和服务功能移近数据产生的边缘位置,即接近数据源和用户的位置,而不是依赖中心化的数据中心或云计算平台。边缘计算的核心思想是在靠近终端设备的位置进行数据处理,以降低延迟、减少带宽需求、提升数据隐私和增强......
  • 要将dz_book_codebatch表的id字段从现有的大值(如3051571883xxxxxx1)重新设置为从1开始
    --备份数据CREATETABLEdz_book_codebatch_backupLIKEdz_book_codebatch;INSERTINTOdz_book_codebatch_backupSELECT*FROMdz_book_codebatch;--创建新表CREATETABLEdz_book_codebatch_newLIKEdz_book_codebatch;--设置自增初始值ALTERTABLEdz_book_codebatch_......
  • 软件工程日报051
     第一天第二天第三天第四天第五天所花时间(包括上课) 4.3h    代码量(行)310     博客园(篇)1     所学知识物品管理的类设计和ScriptObejct的使用     ......
  • P10513 括号
    P10513括号一、题目简析本题采用线段树求解。节点的定义structnode{ intl,r; intlcnt,rcnt;//lcnt--(的个数;rcnt--)的个数 intans,anti;//ans--()的个数;anti--)(的个数 booltag;//true--需要翻转左右孩子}tree[N......
  • 在友晶DE10-Lite开发板实现8051单片机
    在友晶DE10-Lite开发板实现8051单片机1. 移植过程利用FPGA片内资源构建51系统。软核来自https://www.oreganosystems.at/。还需要添加rom、ram和ramx。rom用来放51单片机的程序,即编译后的.hex文件。ram用来运行51单片机程序。时钟由PLLIP生成20MHz的时钟信号。分配管脚,编译配......
  • 洛谷 P10512 序列合并
    哭死,比赛的时候完全想歪了,想的是考虑一次合并能造成多大的贡献,按照贡献排序然后合并。这样做只能考虑局部造成的贡献,然而最后算的时候要考虑整体,所以并不是很对。正着想没有思路就可以倒着想,考虑枚举答案。合并k次,意味着最后是n-k个数。经典从二进制高位到低位考虑,考虑这一位(假......
  • 20240519比赛总结
    T1Colorhttps://gxyzoj.com/d/hzoj/p/3692显然,答案与元素的位置无关,只与个数有关考虑每个元素能经过若干次操作变成n个的概率,记\(p_i\)为i个数能变到n个数的概率进行一次操作后,会分成三种情况,+1,-1,和不变,所以式子是:\[p_i=\dfrac{i(n-i)}{n(n-1)}p_{i-1}+\dfrac{i(n-i)}{n(n......
  • 20240519刷题总结
    T1(数学化审题)541。观察到其实和最初功率没有关系,功率就是个系数,于是可以把系数提出来。于是定义f[i]为功率为1,i~n最长信息。直接转移就好。#include<iostream>#include<algorithm>#include<cstdio>#include<algorithm>usingnamespacestd;constintN=100010;......
  • 0519 基础特征数列
    1.质数数列235711131719232931+1变形34681214倒序变形29231917131175加和变形3(2+1)5(3+2)8(5+3)11(7+4)16(11+5)19(13+6)24(17+7)2.合数数列468910121415(偶数里夹了1个奇数)3.周期数列1341341......
  • [20240515]vim bccalc_XXX.vim使用插件简介.txt
    [20240515]vimbccalc_XXX.vim使用插件简介.txt--//这是我改写vim.org网站的一个调用bc做计算的一个插件bccalc.vim,感觉自己越写越复杂.做一个介绍,便于自己查阅.--//另外注意如果选择多行,结尾要像C语言一样使用分号(;).--//我已经统一在selectvisualnormal模式都是<leader>作......