首页 > 其他分享 >COMPSCI 753 Algorithms for Massive Data

COMPSCI 753 Algorithms for Massive Data

时间:2024-08-11 08:56:58浏览次数:14  
标签:753 teleport task matrix COMPSCI nodes Data your points

COMPSCI 753

Algorithms for Massive Data

Assignment 1 / Semester 2, 2024

Graph Mining

General instructions and data

This assignment aims at exploring the PageRank algorithm on big real-world network data.

By working on this assignment, you will learn how to implement some of the PageRank

algorithms that we have learned in class.

Data: Download the web-Google web dataset ’web-Google-final.txt’ from the assignment

page on Canvas1 . Each line of the file represents a directed edge from a source node to a

destination node. There are N = 875713 nodes. Nodes are represented by numeric IDs

ranging from 0 to 875712.

Submission

Please submit: (1) a file (.pdf or .html) that reports the answers requested for each task,

and (2) a source code file (.py or .ipynb) that contains your code and detailed comments.

Submit this on the Canvas assignment page by 23:59 NZST, Sunday 11 August. The files

must contain your student ID, UPI and name.

Penalty Dates

The assignment will not be accepted after the last penalty date unless there are special

circumstances (e.g., sickness with certificate). Penalties will be calculated as follows as a

percentage of the marks for the assignment.

  • 23:59 NZST, Sunday 11 August – No penalty
  • 23:59 NZST, Monday 12 August – 25% penalty
  • 23:59 NZST, Tuesday 13 August – 50% penalty

1This dataset is adapted from SNAP http://snap.stanford.edu/data/web-Google.htmlTasks (100 points)

Task 1 [40 points]: Implementation of Power Iteration Algorithm.

In this task you will implement the basic version of the Power Iteration algorithm for PageR

ank. This task involves two sub-tasks, as follows:

(A) [25 points] Implement the power iteration algorithm in matrix form to calculate the

rank vector r, without teleport, using the PageRank formulation:

r (t+1) = M · r (t)

The matrix M is an adjacency matrix representing nodes and edges from your downloaded

dataset, with rows representing destination nodes and columns representing source nodes.

This matrix is sparse2 . Initialize r (0) = [1/N, . . . , 1/N] T . Let the stop criteria of your power

iteration algorithm be ||r (t+1) r (t) ||1 < 0.02 (please note the stop criteria involves the L1

norm). Spider traps and dead ends are not considered in this first task.

(B) [15 points] Run your code on the provided Google web data to calculate the rank score

for all the nodes. Report: (1) The running time of your power iteration algorithm; (2) The

number of iterations needed to stop; (3) The IDs and scores of the top-10 ranked nodes.

Task 2 [10 points]: Understanding dead-ends.

In this task, before extending your code to support dead-ends using teleport, you will run

some analysis on your current implementation from Task 1. This second task involves two

sub-tasks:

(A) [5 points] Calculate and report the number of dead-end nodes in your matrix M.

(B) [5 points] Calculate the leaked PageRank score in each iteration of Task 1 (B). The leaked

PageRank score is the total score you lose in that iteration because of dead-ends (hint: see

example on slide 2 of W1.3 lecture notes). Create a plot that shows how this leaked score

behaves as iterations progress. Explain the phenomenon you observe from this visualization.

2Consider using a sparse matrix (e.g., use scipy.sparse in Python) in your implementation, so that your

algorithm should stop within a few seconds in a basic computer. If your algorithm can’t stop within several

minutes, you may want to check your implementation.

1Task 3 [50 points]: Implementation of Power Iteration with Teleport.

In this task, you will extend your implementation from Task 1 using the teleport mechanism

to handle both dead-ends and spider traps. This task involves three sub-tasks:

(A) [25 points] Extend your PageRank code to handle both spider traps and dead ends using

the idea of teleport. In this task, your implementation will allow to teleport randomly to

any node. Code the PageRank with teleport formulation that, using the sparse matrix M,

for each iteration works in three steps (slide 8 of W1.3 lecture notes):

Step 1: Calculate the r ranks of current iteration r new (in matrix form):

r new = βM · r old

Step 2: Calculate the constant S for teleport:

S = X rj new

j

Step 3: Update r new with teleport:

r new = r new + (1 S)/N

In your implementation, use β = 0.9. Initialize r (0) = [1/N, . . . , 1/N] T . The stop criteria

should be ||r new r old||1 < 0.02.

(B) [15 points] Run your code on the provided Google web data to calculate the rank score

for all the nodes. Report: (1) The running time; (2) The number of iterations needed to

stop; (3) The IDs and scores of the top-10 ranked nodes.

(C) [10 points] Vary the teleport probability β with numbers in the set: {1, 0.9, 0.8, 0.7, 0.6}.

Report the number of iterations needed to stop for each β. Explain, in words, your findings

from this experiment.

2

标签:753,teleport,task,matrix,COMPSCI,nodes,Data,your,points
From: https://www.cnblogs.com/vx-codehelp/p/18353064

相关文章

  • Datawhale X 魔搭 AI夏令营 --生成图
    #环境搭建#在魔搭网址注册登录魔搭社区登录后使用魔搭的免费Notebook实例步骤如下:点击魔搭平台免费实例,选择方式二(GPU环境),再点击下面的启动按钮,等待几分钟。点击查看Notebook跳转到实验平台点击Terminal终端#拉取项目#终端输入gitlfsinstallgitclonehttps://www.......
  • mysql load data file 批量导入数据
    mysql大量数据导入记录工作需要将大量数据导入到mysql中,但是数据量很大且几十个文本数据,每次导入的数据量有限制,所以需要分批导入。为了快速导入记录下使用loaddatainfile方式。1.SQL入数据语句先将数据传入/var/lib/mysql/test/路径mysql>loaddatainfile"/var/li......
  • Datawhale AI夏令营第四期魔搭- AIGC文生图方向 task01笔记
      同学你是否对ai生成图方面感兴趣,同学你想不想进步,同学不要再刷抖音看有声小说里面ai美女了,来吧和我一起探索ai扩图在暑假里面卷鼠他们,重生之我在暑假学AIGC文生图校花开始倒追我现在开始(要是想看专业关于概念或者别的历程之类的,还是跳过我这篇吧,主要我本人也不太喜欢那么......
  • IDERA ER/Studio Data Architect 20.3 Crack
    ER/Studio:不仅仅是数据建模:ER/Studio可以超越基本的数据建模,提供企业数据方法并构建数据生态系统。与ER/Studio合作让我们能够进一步推动数据成为我们业务的推动力。数据发现很容易,而且从输入到跟踪的所有数据移动都足够灵活,让我们能够同时进行多个项目。高级功能前所未......
  • Datawhale X 魔搭 AI夏令营
    一.生成模型的定义二.获得生成模型的步骤三.几种模型的对比四.从跑通最简单的Baseline开始1.在魔搭社区开通阿里云PAI-DSW,创建PAI实例2.下载baseline文件gitlfsinstallgitclonehttps://www.modelscope.cn/datasets/maochase/kolors.git3.安装Data-Juicer和......
  • Datawhale AI夏令营第四期 魔搭-AIGC方向 task01笔记
    DatawhaleAI夏令营第四期魔搭-AIGC方向task01笔记提示词提示词很重要,一般写法:主体描述,细节描述,修饰词,艺术风格,艺术家举个例子【promts】Beautifulandcutegirl,smiling,16yearsold,denimjacket,gradientbackground,softcolors,softlighting,cinematicedge......
  • Datawhale AI夏令营第四期魔搭-AIGC文生图方向Task1笔记
       这是第一次参与文生图方面的训练营,可以说是收获很多吧,以前只是听说过但是上手之后又是另一个感受。首先打开终端,复制如下代码,回车gitlfsinstallgitclonehttps://www.modelscope.cn/datasets/maochase/kolors.git然后进入kolors文件夹,双击baseline,然后一步一步......
  • Datawhale AI夏令营-第四期(AIGC方向)-Task01-可图Kolors-LoRA风格故事挑战赛
    从零入门AI生图原理&实践是Datawhale2024年AI夏令营第四期的学习活动(“AIGC”方向),基于魔搭社区“可图Kolors-LoRA风格故事挑战赛”开展的实践学习。下面将分六部分介绍我的学习&实践情况。一、文生图的历程与基石首先,通过社区提供的学习资料和PPT,对文生图的历程与基石进......
  • IntelliJ IDEA 2024.2 发布:Spring Data JPA即时查询、自动补全cron表达式
    今早看到,IntelliJIDEA2024.2发布的邮件提示,看了一眼这个版本更新的新特性真的太适合我了!也许这些能力对关注DD的小伙伴也有帮助,所以搞篇博客介绍和推荐一下。下面就来一起看看这个版本中推出的几个强大新特性。SpringDataJPA的即时查询在2024.2Ultimate版本中,对Spring......
  • data.includes is not a function
    一.省流:前后端交互的数据格式不一致,前端需要的是一个数组,而后端发送的是一个对象二、情景再现前端需要一个数组,进而显示用户列表letuserList=ref([])由于后端是一星期前写的,写完后端之后就一直在弄前端,所以忘记了后端其实传的的是PageBean对象publicResult<PageBean<......