2024/10/15

标签：10 15 WordCount hadoop Hadoop 2024 job apache import

今天完成的是虚拟机mapduirce完成单词统计

好的，下面是使用Java在Hadoop上运行一个完整的WordCount MapReduce作业的详细步骤，包括准备环境、编写代码、编译、运行和查看输出结果。

好的，下面是使用Java在Hadoop上运行一个完整的WordCount MapReduce作业的详细步骤，包括准备环境、编写代码、编译、运行和查看输出结果。

### 步骤 1: 环境准备

1. **安装Hadoop**：确保Hadoop已经安装并配置好，包括环境变量设置。
2. **安装Java**：确保Java JDK已经安装，并设置好JAVA_HOME环境变量。
3. **配置HDFS**：启动Hadoop集群并确认HDFS正常运行。可以使用以下命令启动Hadoop集群：
   ```bash
   start-dfs.sh
   start-yarn.sh
   ```

### 步骤 2: 创建输入文件

1. 在HDFS上创建输入目录：
   ```bash
   hdfs dfs -mkdir -p /user/lty/homework/input
   ```

2. 创建一个文本文件并上传到HDFS（例如，创建一个名为`input.txt`的文件）：
   ```bash
   echo -e "Hello Hadoop\nHello World" > input.txt
   hdfs dfs -put input.txt /user/lty/homework/input/
   ```

### 步骤 3: 编写WordCount程序

1. 创建一个Java文件，命名为 `WordCount.java`，内容如下：

```java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCount {

    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            String[] words = value.toString().split("\\s+");
            for (String w : words) {
                word.set(w);
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
```

### 步骤 4: 编译Java程序

1. 在命令行中，进入到存放 `WordCount.java` 文件的目录。
2. 使用以下命令编译Java程序：
   ```bash
   javac -classpath `hadoop classpath` -d . WordCount.java
   jar -cvf wordcount.jar -C . .
   ```

### 步骤 5: 运行Hadoop作业

使用以下命令在Hadoop集群上运行WordCount作业：

```bash
hadoop jar wordcount.jar WordCount /user/lty/homework/input /user/lty/homework/output
```

### 步骤 6: 查看输出结果

1. 查看输出目录中的结果：
   ```bash
   hdfs dfs -ls /user/lty/homework/output
   ```

2. 读取输出文件内容：
   ```bash
   hdfs dfs -cat /user/lty/homework/output/part-r-00000
   ```

### 步骤 7: 清理输出（可选）

如果需要重新运行作业，可以先删除输出目录：

```bash
hdfs dfs -rm -r /user/lty/homework/output
```

### 总结

以上步骤详细描述了如何使用Java在Hadoop上编写、编译、运行一个简单的WordCount MapReduce作业。如果在任何步骤中遇到问题，请随时告诉我，我会帮助你解决！

但是会遇到以下问题:hdfs dfs -ls /user/lty/homework/output输入命令后没有任何输出

这个时候就需要用

hdfs dfs -cat /user/hadoop/homework/output/*命令来输出。

标签：10,15,WordCount,hadoop,Hadoop,2024,job,apache,import
From： https://www.cnblogs.com/litianyu1969/p/18468752

相关文章

赞助商

阅读排行