Hadoop文件切分的源码

时间：2023-02-02 10:35:36浏览次数：46

标签：job Hadoop long 切分 length 源码 file blkLocations bytesRemaining

TextInputFormat

Hadoop文件的切分原则：

一按每个文件切分

二文件大小/分片大小《=1.1则划分为一个文件，否则切分为2个文件

三一个切片一个Maptask,一个Maptask代表一个并行度

分片默认设置

Hadoop文件切分的源码_hadoop

分片切分的核心源码

public List<InputSplit> getSplits(JobContext job) throws IOException {
  StopWatch sw = new StopWatch().start();
  long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
  long maxSize = getMaxSplitSize(job);

  // generate splits
  List<InputSplit> splits = new ArrayList<InputSplit>();
  List<FileStatus> files = listStatus(job);

  boolean ignoreDirs = !getInputDirRecursive(job)
    && job.getConfiguration().getBoolean(INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS, false);
    //循环文件
  for (FileStatus file: files) {
    if (ignoreDirs && file.isDirectory()) {
      continue;
    }
    //获取文件的长度值
    Path path = file.getPath();
    long length = file.getLen();
    if (length != 0) {
      BlockLocation[] blkLocations;
      if (file instanceof LocatedFileStatus) {
        blkLocations = ((LocatedFileStatus) file).getBlockLocations();
      } else {
        FileSystem fs = path.getFileSystem(job.getConfiguration());
        blkLocations = fs.getFileBlockLocations(file, 0, length);
      }
      //是否支持切割
      if (isSplitable(job, path)) {
        long blockSize = file.getBlockSize();
        //由集群指定的块大小，最小和最大 下面对切片大小有描述
        long splitSize = computeSplitSize(blockSize, minSize, maxSize);

        long bytesRemaining = length;
        //判断是否可以切成一块，如果大于1.1切成两片，如果小于1.1形成一片
        while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
          splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                      blkLocations[blkIndex].getHosts(),
                      blkLocations[blkIndex].getCachedHosts()));
          bytesRemaining -= splitSize;
        }

        if (bytesRemaining != 0) {
          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
          splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                     blkLocations[blkIndex].getHosts(),
                     blkLocations[blkIndex].getCachedHosts()));
        }
      } else { // not splitable
        if (LOG.isDebugEnabled()) {
          // Log only if the file is big enough to be splitted
          if (length > Math.min(file.getBlockSize(), minSize)) {
            LOG.debug("File is not splittable so no parallelization "
                + "is possible: " + file.getPath());
          }
        }
        splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
                    blkLocations[0].getCachedHosts()));
      }
    } else { 
      //Create empty hosts array for zero length files
      splits.add(makeSplit(path, 0, length, new String[0]));
    }
  }
  // Save the number of input files for metrics/loadgen
  job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
  sw.stop();
  if (LOG.isDebugEnabled()) {
    LOG.debug("Total # of splits generated by getSplits: " + splits.size()
        + ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
  }
  return splits;
}

修改分片规则

//取maxsize和blocksize的最小值
//可以控制minsize和maxsize来控制切片大小
protected long computeSplitSize(long blockSize, long minSize,
                                long maxSize) {
  return Math.max(minSize, Math.min(maxSize, blockSize));
}

流程的总结

Hadoop文件切分的源码_List_02

CombineTextInputFormat

Hadoop文件切分的源码_hadoop_03

Hadoop文件切分的源码_List_04

Hadoop文件切分的源码_List_05

标签：job,Hadoop,long,切分,length,源码,file,blkLocations,bytesRemaining
From： https://blog.51cto.com/u_15063934/6033048

Hadoop的压测
#测试写能力hadoopjarshare/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.3.1-tests.jarTestDFSIO-write-nrFiles1-fileSize128MB#写速度hadoopjarsha......
APIView执行流程、Request对象源码分析、序列化器介绍和快速使用、反序列化、反序列化
APIView执行流程、Request对象源码分析、序列化器介绍和快速使用、反序列化、反序列化的校验、练习一、APIView执行流程1.1基于APIView+JsonResponse编写接口#def提供......
drf入门——APIView执行流程、Request对象源码分析、序列化器、反序列化及其校验
drf入门——APIView执行流程、Request对象源码分析、序列化器、反序列化及其校验目录drf入门——APIView执行流程、Request对象源码分析、序列化器、反序列化及其校验APIV......
AXI4_LITE总线vivado2019.1源码（verilog实现）
1.AXI_SLAVE源码`timescale1ns/1ps modulemyip_v1_0_S00_AXI# ( //Userstoaddparametershere //Userparametersends //Donotmodifythepa......
drf APIView执行流程，Request对象源码分析，序列化器介绍与使用
昨日内容回顾restful规范1接口使用https协议传输数据保证数据的安全性2接口带api标识3接口带版本信息4接口地址即资源，尽量使用名字，或复数，特殊情况可以使用动词5接口......
APIView执行流程、Request对象源码分析、序列化器介绍和快速使用、反序列化、反序列化
APIView执行流程、Request对象源码分析、序列化器介绍和快速使用、反序列化、反序列化的校验APIView执行流程1.APIViewdrf只能在django上进行使用，安装了drf之后，导入一个......
DRF - 源码分析和序列化器
目录一、APIView1.基于APIView+JsonResponse写接口2.基于drf的APIView+Response写接口二、Request对象源码分析三、序列化器的基本使用1.序列化2.序列化器介绍3.序列化类......
drf整体内容 APIView执行流程 request对象源码分析序列化器介绍和快速使用反序列化
目录回顾drf整体内容APIView执行流程基于APIView+JsonResponse编写接口基于APIView+Response写接口APIView的执行流程requset对象源码分析序列化器介绍和快速使用基于序列......
drf基础：APIView执行流程（难，了解）、Request对象源码分析（难，了解）、序列化器介绍和快速使用
目录0drf整体内容1APIView执行流程（难，了解）1.1基于APIView+JsonResponse编写接口1.2基于APIView+Response写接口1.3APIView的执行流程总结:APIView的执行流程2Reques......
drf整体内容-APIView执行流程-Request对象源码分析-序列化器介绍和快速使用-反序列化-
目录drf整体内容-APIView执行流程-Request对象源码分析-序列化器介绍和快速使用-反序列化-反序列化的校验今日内容概要今日内容详细0drf整体内容1APIView执行流程（难了解......

Hadoop文件切分的源码

TextInputFormat

CombineTextInputFormat

相关文章

赞助商

阅读排行