大数据入门项目 | Java提升营

大数据入门项目

上一篇文章我们讲了Hadoop分布式集群搭建, 没有环境的朋友先去看上一篇,这篇文章都是在已有的环境上做操作。

我也听过一些大数据的课程以及看过相关书本,刚开始,大家都是通过一个列子来介绍大数据。那么我也不例外,今天文章的落脚点就是统计计算文本中每个单词的出现次数,废话不多说,我们开始吧。

创建测试文件

创建一个test.txt文件,内容如下:

1
2
3
hello world
bye world
this is a test txt

上传HDFS

将我们上一步创建的文件上传到HDFS,执行一下命令:

1
2
3
4
#在HDFS中创建一个测试目录
hdfs dfs -mkdir /test
#将我们创建的测试文件上传到新建的目录
hdfs dfs -put test.txt /test

查看我们上传的文件(在Web页面查看更方便):

1
hdfs dfs -ls /test

输出结果:

1
-rw-r--r--   3 ubuntu supergroup         41 2019-11-11 16:39 /test/test.txt

执行程序

官方提供了统计单词的项目包,路径 /usr/local/hadoop/share/hadoop/mapreduce 下的 hadoop-mapreduce-examples-3.2.1.jar 。

直接进入该目录运行:

1
hadoop jar hadoop-mapreduce-examples-3.2.1.jar wordcount /test/test.txt /test/output
  • /test/test.txt为文件路径
  • /test/output结果输出路径 (必须是不存在的)

运行日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
2019-11-11 16:44:35,537 INFO client.RMProxy: Connecting to ResourceManager at master/10.101.18.21:8032
2019-11-11 16:44:36,400 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/ubuntu/.staging/job_1573457581616_0001
2019-11-11 16:44:36,710 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-11 16:44:37,169 INFO input.FileInputFormat: Total input files to process : 1
2019-11-11 16:44:37,235 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-11 16:44:37,275 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-11 16:44:37,299 INFO mapreduce.JobSubmitter: number of splits:1
2019-11-11 16:44:37,563 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-11 16:44:37,696 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1573457581616_0001
2019-11-11 16:44:37,700 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-11-11 16:44:39,644 INFO conf.Configuration: resource-types.xml not found
2019-11-11 16:44:39,685 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-11-11 16:44:41,475 INFO impl.YarnClientImpl: Submitted application application_1573457581616_0001
2019-11-11 16:44:42,139 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1573457581616_0001/
2019-11-11 16:44:42,148 INFO mapreduce.Job: Running job: job_1573457581616_0001
2019-11-11 16:44:52,464 INFO mapreduce.Job: Job job_1573457581616_0001 running in uber mode : false
2019-11-11 16:44:52,466 INFO mapreduce.Job: map 0% reduce 0%
2019-11-11 16:44:58,565 INFO mapreduce.Job: map 100% reduce 0%
2019-11-11 16:45:03,608 INFO mapreduce.Job: map 100% reduce 100%
2019-11-11 16:45:03,625 INFO mapreduce.Job: Job job_1573457581616_0001 completed successfully
2019-11-11 16:45:03,781 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=89
FILE: Number of bytes written=451455
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=138
HDFS: Number of bytes written=51
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3251
Total time spent by all reduces in occupied slots (ms)=2725
Total time spent by all map tasks (ms)=3251
Total time spent by all reduce tasks (ms)=2725
Total vcore-milliseconds taken by all map tasks=3251
Total vcore-milliseconds taken by all reduce tasks=2725
Total megabyte-milliseconds taken by all map tasks=3329024
Total megabyte-milliseconds taken by all reduce tasks=2790400
Map-Reduce Framework
Map input records=3
Map output records=9
Map output bytes=77
Map output materialized bytes=89
Input split bytes=97
Combine input records=9
Combine output records=8
Reduce input groups=8
Reduce shuffle bytes=89
Reduce input records=8
Reduce output records=8
Spilled Records=16
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=417
CPU time spent (ms)=3340
Physical memory (bytes) snapshot=671850496
Virtual memory (bytes) snapshot=5330915328
Total committed heap usage (bytes)=601882624
Peak Map Physical memory (bytes)=448380928
Peak Map Virtual memory (bytes)=2663849984
Peak Reduce Physical memory (bytes)=223469568
Peak Reduce Virtual memory (bytes)=2667065344
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=41
File Output Format Counters
Bytes Written=51

查看执行结果

  1. 列出output下的文件
1
hdfs dfs -ls /test/output

输出:

1
2
-rw-r--r--   3 ubuntu supergroup          0 2019-11-11 16:45 /test/output/_SUCCESS
-rw-r--r-- 3 ubuntu supergroup 51 2019-11-11 16:45 /test/output/part-r-00000

_SUCCESS表示执行成功

输出文件命名规范: part-m-00000 part-r-00000

  • m 为 mapper 输出 , r 为 reduce 输出
  • 00000 为 job 任务编号

part-r-00000整个文件为结果输出文件

  1. 查看执行结果
1
hdfs dfs -cat /test/output/part-r-00000

输出:

1
2
3
4
5
6
7
8
a	1
bye 1
hello 1
is 1
test 1
this 1
txt 1
world 2

结果正确

总结:在平时的开发中,我们只需要关注业务实现就可以了,至于资源调用分配、文件读取存储等等,我们理解他们的设计原理就可以了。

给老奴加个鸡腿吧 🍨.