Hadoop HDFS 归档文件archive
archive
每个文件均按块存储,每个块的元数据存储在namenode的内存中,因此Hadoop存储小文件会非常低效。
因为大量的小文件会耗尽namenode中的大部分内存。
但注意,存储小文件所需要的磁盘容量和存储这些文件原始内容所需要的磁盘空间相比也不会增多。
例如,一个1MB的文件以大小为128MB的块存储,使用的是1MB的磁盘空间,而不是128MB。
Hadoop存档文件或HAR文件,是一个更高效的文件存档工具,它将文件存入HDFS块,在减少namenode内存使用的同时,允许对文件进行透明的访问。
透明的访问就是Hadoop存档文件可以用作MapReduce的输入。
创建归档文件
注意:归档文件一定要保证yarn集群启动
创建一个test.txt,里面输入hahaha然后上传到集群的根目录
[root@hadoop01 home]# hadoop fs -cat /test.txt
hahaha
hadoop archive -archiveName myhar.har -p /test.txt /
[root@hadoop01 home]# hadoop archive -archiveName myhar.har -p /test.txt / 19/12/12 11:48:54 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.100.201:8032
19/12/12 11:48:54 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.100.201:8032
19/12/12 11:48:54 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.100.201:8032
19/12/12 11:48:55 INFO mapreduce.JobSubmitter: number of splits:1
19/12/12 11:48:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1576121879720_0001
19/12/12 11:48:56 INFO impl.YarnClientImpl: Submitted application application_1576121879720_0001
19/12/12 11:48:58 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1576121879720_0001/
19/12/12 11:48:58 INFO mapreduce.Job: Running job: job_1576121879720_0001
19/12/12 11:49:11 INFO mapreduce.Job: Job job_1576121879720_0001 running in uber mode : true
19/12/12 11:49:11 INFO mapreduce.Job: map 0% reduce 0%
19/12/12 11:49:16 INFO mapreduce.Job: map 100% reduce 100%
19/12/12 11:49:17 INFO mapreduce.Job: Job job_1576121879720_0001 completed successfully
19/12/12 11:49:18 INFO mapreduce.Job: Counters: 52
File System Counters
FILE: Number of bytes read=166
FILE: Number of bytes written=265
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=608
HDFS: Number of bytes written=313270
HDFS: Number of read operations=49
HDFS: Number of large read operations=0
HDFS: Number of write operations=19
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=0
Total time spent by all reduces in occupied slots (ms)=0
TOTAL_LAUNCHED_UBERTASKS=2
NUM_UBER_SUBMAPS=1
NUM_UBER_SUBREDUCES=1
Total time spent by all map tasks (ms)=4033
Total time spent by all reduce tasks (ms)=615
Total vcore-milliseconds taken by all map tasks=0
Total vcore-milliseconds taken by all reduce tasks=0
Total megabyte-milliseconds taken by all map tasks=0
Total megabyte-milliseconds taken by all reduce tasks=0
Map-Reduce Framework
Map input records=1
Map output records=1
Map output bytes=59
Map output materialized bytes=67
Input split bytes=116
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=67
Reduce input records=1
Reduce output records=0
Spilled Records=2
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=198
CPU time spent (ms)=4260
Physical memory (bytes) snapshot=792809472
Virtual memory (bytes) snapshot=6170763264
Total committed heap usage (bytes)=553648128
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=167
File Output Format Counters
Bytes Written=0
[root@hadoop01 home]#
查看归档文件内容
hdfs dfs -lsr /myhar.har
hdfs dfs -lsr har:///myhar.har
[root@hadoop01 home]# hadoop fs -lsr /myhar.har
lsr: DEPRECATED: Please use 'ls -R' instead.
-rw-r--r-- 2 root supergroup 0 2019-12-12 11:49 /myhar.har/_SUCCESS
-rw-r--r-- 3 root supergroup 55 2019-12-12 11:49 /myhar.har/_index
-rw-r--r-- 3 root supergroup 14 2019-12-12 11:49 /myhar.har/_masterindex
-rw-r--r-- 3 root supergroup 7 2019-12-12 11:49 /myhar.har/part-0
[root@hadoop01 ~]# hadoop fs -lsr har:///myhar.har
lsr: DEPRECATED: Please use 'ls -R' instead.
-rw-r--r-- 3 root supergroup 7 2019-12-12 11:46 har:///myhar.har
解压归档文件
hdfs dfs -mkdir -p /har
hadoop fs -cp har:///myhar.har /har/
hdfs dfs -cat /har/myhar.har
hahaha