首页 > 其他分享 >Google三驾马车之二:MapReduce

Google三驾马车之二:MapReduce

时间:2023-10-26 23:01:22浏览次数:30  
标签:map Google key reduce MapReduce 之二 task file output

第一次接触mr还是在入门mit6.824的lab1,最近重新读了一遍原始论文,又有了一些新的想法,简单做一些记录。
作为Google分布式系统的重要组成,本篇文章核心在于map/reduce操作带来的抽象并行化,给出接口之后,编写应用程序的程序员就不需要对底层的机制做过多的处理。而在本质上,mr只是实现了一组分布式的并行框架,而实际依赖的底层分布式infrastructure还是GFS。

MapReduce: Simplified Data Processing on Large Clusters

Programming Model

K/V pairs

original task --map--> intermedicate K/V pairs --shuffle--> --reduce--> result

shuffle: to generate the same key list

the result can be multi-set, the merge work is finished by user function

Here is a word cound app from mit 6.824(golang)

// The map function is called once for each file of input. The first
// argument is the name of the input file, and the second is the
// file's complete contents. You should ignore the input file name,
// and look only at the contents argument. The return value is a slice
// of key/value pairs.
func Map(filename string, contents string) []mr.KeyValue {
	// function to detect word separators.
	ff := func(r rune) bool { return !unicode.IsLetter(r) }

	// split contents into an array of words.
	words := strings.FieldsFunc(contents, ff)

	kva := []mr.KeyValue{}
	for _, w := range words {
		kv := mr.KeyValue{w, "1"}
		kva = append(kva, kv)
	}
	return kva
}

// The reduce function is called once for each key generated by the
// map tasks, with a list of all the values created for that key by
// any map task.
func Reduce(key string, values []string) string {
	// return the number of occurrences of this word.
	return strconv.Itoa(len(values))
}

map (k1,v1) → list(k2,v2)

reduce (k2,list(v2))list(v2)


working flow

  1. split input file
  2. master(coordinator) allocate map-task
  3. do map, generate inter k/v pair
  4. write inter k/v pair in R partition
  5. sort on master, reduce: RPC read inter-file from map machine
  6. final output to GFS
  7. return

After successful completion, the output of the mapreduce execution is available in the R output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these R output files into one file – they often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

In practical grogramming, the atomic operation is important(regardless of C++ or Go or...)

fault tolerant

heartbeat: master <---> slave

Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system.

reduce re-execute if has not read finish from a map-machine(RPC would fail)

Task Granularity

M, R >> machine number -> load balance

common: M > R, to decrease final file number

Backup Tasks

solve straggler: When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks.


Refinements

  1. Partitioning function: pre-define the number of output file: use hash
  2. pre-sort
  3. combiner, eg. in wc map-task, append the same key-value here
  4. input/output: different file type(read by line or offset), database and memory are also useful.
  5. side-effects
  6. error in code: skipping bad records(optional)
  7. sequential on local machine(help to debug)
  8. display the task status(command or gui) -> data collect and analyse
  9. counter(in lib) for sth.

Discussion

in some cases: we can also store the inter-file in the global file system,
thus we dont need re-execute the map-task if machine shutdown,
take the reduce RPC as GFS reading.

the bindwidth can be the essential bottleneck, p2p network can decrease the master's I/O pressure

MapReduce: open source version: hadoop(yahoo/apache)

middle step: shuffle, one key run (not) once????? in reduce

so we need combiner?

  • use combiner: reduce one key once
  • dont use combiner: reduce one key map partition times

but where combiner running?
map-local-disk?: local combine
master?: dont do any logic calculating work
before reduce?: shuffle

shuffle & combine could be bottleneck

task failure: restart tasks
node failure: restart tasks on new node: re-run all finished task for lose inter-file

标签:map,Google,key,reduce,MapReduce,之二,task,file,output
From: https://www.cnblogs.com/kazusarua/p/17790703.html

相关文章

  • MapReduce性能优化秘籍
    1.MapReduce跑的慢的原因MapReduce程序效率的瓶颈在于两点:计算机性能CPU、内存、磁盘、网络I/O操作数据倾斜map和reduce数设置不合理map运行时间太长,导致reduce等待过久小文件过多大量的不可分块的超大文件(例:通过gzip压缩后的文件)spill(溢写)次数过多merge(map端合并......
  • MapReduce自定义GroupingComparator
    需求:有如下订单明细数据0000001 01 222.80000002 06 722.40000001 05 25.80000003 01 222.80000003 01 33.80000002 03 522.80000002 04 122.4第一列是订单编号,第二列是商品id,第三列是商品金额,列与列之间用制表符分隔。现在需要求出每一个订单中最贵的商品。思路:将订单id和商......
  • 最好用的Android APK第三方下载站,替代Google play
    最好用的AndroidAPK第三方下载站,推荐以下7个替代Googleplay方案可通过第三方应用程序下载各种apk历史版本1、APKPure:APKPure 提供:网页、AppAPKPure是知名度很高的免费安卓应用商店,基本上大部分GooglePlay上架的软件都可以在这里找到,但最近也有被屏蔽的倾向。2、APKMirror......
  • Net 高级调试之二:CLR和Windows加载器及应用程序域介绍
    一、简介今天是Net高级调试的第二篇文章,第一篇文章记录了自己学习Net高级调试的第一步,认识一些调试工具,有了工具的倚仗,我们开始仗剑走天涯了,开始Net高级调试正式的征程了。我先说一下,我的文章,【调试测试】这部分一般分为两个部分,第一部分是要用到的所有测试代码样例,......
  • 论文:Going Deeper with Convolutions-GoogleNet
    论文名:GoingDeeperwithConvolutions深入了解卷积了解GoogleNet研究问题:研究方法:主要结论:模型:问题:行文结构梳理:......
  • 大数据mapReduce的学习
    .2MapReduce模型简介•MapReduce将复杂的、运行于大规模集群上的并行计算过程高度地抽象到了两个函数:Map和Reduce•编程容易,不需要掌握分布式并行编程细节,也可以很容易把自己的程序运行在分布式系统上,完成海量数据的计算•MapReduce采用“分而治之”策略,一个存储在分布式文件系......
  • 解决:Exception: URL fetch failure on https://storage.googleapis.com/tensorflow/tf
    首次装载IMDB数据集时可能会出现的错误。解决方案:1、先将数据集单独下载下来:datasets/imdb.npz·黄健/keras-datasets-Gitee.com2、将其复制到 ~/.keras/dataset目录下:cpimdb.npz ~/.keras/dataset ......
  • [911] Read Data from Google Sheets into Pandas without the Google Sheets API (.g
    ref:ReadDatafromGoogleSheetsintoPandaswithouttheGoogleSheetsAPIimportpandasaspdsheet_id="1XqOtPkiE_Q0dfGSoyxrH730RkwrTczcRbDeJJpqRByQ"sheet_name="Sheet1"url=f"https://docs.google.com/spreadsheets/d/{sheet......
  • 软件开发项目文档系列之二如何撰写项目建设方案
    前言建设方案或解决方案是在任何新项目或计划启动之前,必须仔细准备和撰写的关键文档。这个文档扮演着项目的蓝图,将抽象的构想和目标转化为具体的可实施方案。在项目的整个生命周期中,建设方案都具有至关重要的地位,它不仅为项目的启动提供了方向,还为项目的进一步招投标、实施和管理......
  • google三驾马车之一:Bigtable解读(英文版)
    本文重点关注了系统设计相关的内容,paper后半部分的具体应用此处没有过多涉及。从个人笔记修改而来,因此为英文版本。Bigtable:ADistributedStorageSystemforStructuredDataDatamodel:notarelationaldatamodelABigtableisasparse,distributed,persistentmul......