大数据入门项目

上一篇文章我们讲了Hadoop分布式集群搭建, 没有环境的朋友先去看上一篇，这篇文章都是在已有的环境上做操作。

我也听过一些大数据的课程以及看过相关书本，刚开始，大家都是通过一个列子来介绍大数据。那么我也不例外，今天文章的落脚点就是统计计算文本中每个单词的出现次数，废话不多说，我们开始吧。

创建测试文件

创建一个test.txt文件，内容如下：

1
2
3

hello world
bye world
this is a test txt

上传HDFS

将我们上一步创建的文件上传到HDFS，执行一下命令：

#在HDFS中创建一个测试目录
hdfs dfs -mkdir /test
#将我们创建的测试文件上传到新建的目录
hdfs dfs -put test.txt /test

查看我们上传的文件(在Web页面查看更方便):

1	hdfs dfs -ls /test

输出结果：

1	-rw-r--r-- 3 ubuntu supergroup 41 2019-11-11 16:39 /test/test.txt

执行程序

官方提供了统计单词的项目包，路径 /usr/local/hadoop/share/hadoop/mapreduce 下的 hadoop-mapreduce-examples-3.2.1.jar 。

直接进入该目录运行：

1	hadoop jar hadoop-mapreduce-examples-3.2.1.jar wordcount /test/test.txt /test/output

/test/test.txt为文件路径
/test/output结果输出路径 (必须是不存在的)

运行日志：

2019-11-11 16:44:35,537 INFO client.RMProxy: Connecting to ResourceManager at master/10.101.18.21:8032
2019-11-11 16:44:36,400 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/ubuntu/.staging/job_1573457581616_0001
2019-11-11 16:44:36,710 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-11 16:44:37,169 INFO input.FileInputFormat: Total input files to process : 1
2019-11-11 16:44:37,235 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-11 16:44:37,275 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-11 16:44:37,299 INFO mapreduce.JobSubmitter: number of splits:1
2019-11-11 16:44:37,563 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-11 16:44:37,696 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1573457581616_0001
2019-11-11 16:44:37,700 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-11-11 16:44:39,644 INFO conf.Configuration: resource-types.xml not found
2019-11-11 16:44:39,685 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-11-11 16:44:41,475 INFO impl.YarnClientImpl: Submitted application application_1573457581616_0001
2019-11-11 16:44:42,139 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1573457581616_0001/
2019-11-11 16:44:42,148 INFO mapreduce.Job: Running job: job_1573457581616_0001
2019-11-11 16:44:52,464 INFO mapreduce.Job: Job job_1573457581616_0001 running in uber mode : false
2019-11-11 16:44:52,466 INFO mapreduce.Job:  map 0% reduce 0%
2019-11-11 16:44:58,565 INFO mapreduce.Job:  map 100% reduce 0%
2019-11-11 16:45:03,608 INFO mapreduce.Job:  map 100% reduce 100%
2019-11-11 16:45:03,625 INFO mapreduce.Job: Job job_1573457581616_0001 completed successfully
2019-11-11 16:45:03,781 INFO mapreduce.Job: Counters: 54
	File System Counters
		FILE: Number of bytes read=89
		FILE: Number of bytes written=451455
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=138
		HDFS: Number of bytes written=51
		HDFS: Number of read operations=8
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
		HDFS: Number of bytes read erasure-coded=0
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=3251
		Total time spent by all reduces in occupied slots (ms)=2725
		Total time spent by all map tasks (ms)=3251
		Total time spent by all reduce tasks (ms)=2725
		Total vcore-milliseconds taken by all map tasks=3251
		Total vcore-milliseconds taken by all reduce tasks=2725
		Total megabyte-milliseconds taken by all map tasks=3329024
		Total megabyte-milliseconds taken by all reduce tasks=2790400
	Map-Reduce Framework
		Map input records=3
		Map output records=9
		Map output bytes=77
		Map output materialized bytes=89
		Input split bytes=97
		Combine input records=9
		Combine output records=8
		Reduce input groups=8
		Reduce shuffle bytes=89
		Reduce input records=8
		Reduce output records=8
		Spilled Records=16
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=417
		CPU time spent (ms)=3340
		Physical memory (bytes) snapshot=671850496
		Virtual memory (bytes) snapshot=5330915328
		Total committed heap usage (bytes)=601882624
		Peak Map Physical memory (bytes)=448380928
		Peak Map Virtual memory (bytes)=2663849984
		Peak Reduce Physical memory (bytes)=223469568
		Peak Reduce Virtual memory (bytes)=2667065344
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=41
	File Output Format Counters 
		Bytes Written=51

查看执行结果

列出output下的文件

1	hdfs dfs -ls /test/output

输出：

1 2	-rw-r--r-- 3 ubuntu supergroup 0 2019-11-11 16:45 /test/output/_SUCCESS -rw-r--r-- 3 ubuntu supergroup 51 2019-11-11 16:45 /test/output/part-r-00000

_SUCCESS表示执行成功

输出文件命名规范： part-m-00000 part-r-00000

m 为 mapper 输出 , r 为 reduce 输出
00000 为 job 任务编号

part-r-00000整个文件为结果输出文件

查看执行结果

1	hdfs dfs -cat /test/output/part-r-00000

输出：

a	1
bye	1
hello	1
is	1
test	1
this	1
txt	1
world	2

结果正确

总结：在平时的开发中，我们只需要关注业务实现就可以了，至于资源调用分配、文件读取存储等等，我们理解他们的设计原理就可以了。