Hive的基本操作
Hive基本操作
首先,hive一下,进入交互式hive
[root@hadoop01 sbin]# hive
which: no hbase in (/usr/lib64/qt-3.3/bin::/export/servers/hive-1.1.0-cdh5.14.0//bin::/export/servers/hadoop-2.6.0-cdh5.14.0/bin:/export/servers/hadoop-2.6.0-cdh5.14.0/sbin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/export/servers/jdk1.8.0_144/bin:/root/bin)
Logging initialized using configuration in jar:file:/export/servers/hive-1.1.0-cdh5.14.0/lib/hive-common-1.1.0-cdh5.14.0.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive (default)>
创建数据库
create database if not exists test01;
hive (default)> create database if not exists test01;
OK
Time taken: 2.256 seconds
hive (default)>
使用数据库
use test01;
hive (default)> use test01;
OK
Time taken: 0.016 seconds
hive (test01)>
说明:hive的表存放位置模式是由hive-site.xml当中的一个属性指定的
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
如果你的hive-site.xml没有这个属性,可以加上
hive-site.xml在hive安装目录的conf目录下
我们可以看看这个目录下有什么?
[root@hadoop01 conf]# hadoop fs -ls /user/hive/warehouse
Found 2 items
drwxr-xr-x - root supergroup 0 2019-11-19 11:22 /user/hive/warehouse/myhive.db
drwxr-xr-x - root supergroup 0 2019-11-20 09:41 /user/hive/warehouse/test01.db
[root@hadoop01 conf]#
有俩.db结尾的目录,它们就是你创建的两个数据库。其中test01是我们刚刚创建的。
你也可以在创建数据库时,指定它存放的位置,只需要通过 location :
创建数据库并指定hdfs存储位置
create database test02 location '/test02';
hive (default)> create database test02 location '/test02';
OK
Time taken: 3.922 seconds
hive (default)>
这里我们指定了test02数据库放在/test02下,让我们看看是否成功了
[root@hadoop01 conf]# hdfs dfs -ls /
Found 25 items
-rw-r--r-- 2 root supergroup 16 2019-11-01 08:41 /a.txt
drwxr-xr-x - root supergroup 0 2019-11-18 11:51 /aaaaa
-rw-r--r-- 2 root supergroup 16 2019-11-01 08:38 /abc
-rwxrwxrwx 3 root supergroup 45 2019-11-07 16:59 /bigfile.xml
drwxr-xr-x - root supergroup 0 2019-11-18 09:35 /ccccc
drwxr-xr-x - root supergroup 0 2019-11-07 16:12 /config
drwxr-xr-x - Administrator supergroup 0 2019-11-07 10:23 /hello
-rw-r--r-- 2 root supergroup 185515842 2019-11-12 07:53 /jdk-8u144-linux-x64.tar.gz
drwxr-xr-x - root supergroup 0 2019-11-07 16:55 /opt
drwxr-xr-x - root supergroup 0 2019-11-16 09:50 /outpartition
drwxr-xr-x - Administrator supergroup 0 2019-11-12 10:22 /output
drwxr-xr-x - root supergroup 0 2019-11-16 14:41 /partition_flow
drwxr-xr-x - root supergroup 0 2019-11-16 14:42 /partition_out
drwxr-xr-x - root supergroup 0 2019-11-16 14:50 /partition_out2
drwxr-xr-x - root supergroup 0 2019-11-18 18:10 /partition_out3
drwxr-xr-x - root supergroup 0 2019-11-18 18:15 /partition_out4
drwxr-xr-x - root supergroup 0 2019-11-15 16:34 /partitioner
drwxr-xr-x - root supergroup 0 2019-11-06 14:21 /system
-rw-r--r-- 2 root supergroup 26598111 2019-11-15 10:16 /test.jar
drwxrwxrwx - root supergroup 0 2019-11-20 09:49 /test02
-rw-r--r-- 2 root supergroup 26601944 2019-11-15 16:34 /test2.jar
drwx------ - root supergroup 0 2019-11-19 11:12 /tmp
drwxr-xr-x - root supergroup 0 2019-11-19 11:22 /user
drwxr-xr-x - root supergroup 0 2019-11-15 09:11 /wordcount
drwxr-xr-x - root supergroup 0 2019-11-15 10:40 /wordcount_out
[root@hadoop01 conf]#
可以看到,多出了一个文件夹/test02,但是它是空的,这可能
是因为没有在数据库中创建表导致的。
修改数据库
可以使用alter database 命令来修改数据库的一些属性。
但是数据库的元数据信息是不可更改的,包括数据库的名称以及数据库所在的位置
修改数据库的创建日期
alter database test01 set dbproperties('createtime'='20121221');
运行结果
hive (default)> alter database test01 set dbproperties('createtime'='200121221');
OK
Time taken: 0.118 seconds
hive (default)>
查看数据库详细信息
查看数据库基本信息
desc database test02;
运行结果
hive (default)> desc database test02;
OK
db_name comment location owner_name owner_type parameters
test02 hdfs://hadoop01:8020/test02 root USER
Time taken: 0.009 seconds, Fetched: 1 row(s)
hive (default)>
查看数据库更多详细信息
desc database extended test02;
运行结果
hive (default)> desc database extended test02;
OK
db_name comment location owner_name owner_type parameters
test02 hdfs://hadoop01:8020/test02 root USER {createtime=200121221}
Time taken: 0.008 seconds, Fetched: 1 row(s)
hive (default)>
删除数据库
删除一个空数据库,如果数据库下面有数据表,那么就会报错
drop database test02;
运行结果
hive (default)> drop database test02;
OK
Time taken: 0.106 seconds
hive (default)>
强制删除数据库,包含数据库下面的表一起删除
drop database myhive cascade; 不要执行(危险动作)
运行结果
hive (default)> drop database test01 cascade;
OK
Time taken: 0.036 seconds
hive (default)>
创建数据库表
语法
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
中括号 []里的内容为可选项
说明:
1、CREATE TABLE 创建一个指定名字的表。如果相同名字的表已经存在,则抛出异常;用户可以用 IF NOT EXISTS 判断这个名称的表是否存在。
有了IF NOT EXISTS 之后,如果存在这个表,就不会再创建这个表了,所以可以用IF NOT EXISTS忽略这个异常。
2、EXTERNAL关键字可以让用户创建一个外部表,在建表的同时指定一个指向实际数据的路径(LOCATION),
Hive 创建内部表时,会将数据移动到数据仓库指向的路径;
若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。
在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。
3、LIKE 允许用户复制现有的表结构,但是不复制数据。
4、ROW FORMAT DELIMITED
[FIELDS TERMINATED BY char]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
[LINES TERMINATED BY char] | SERDE serde_name
[WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
用户在建表的时候可以自定义 SerDe 或者使用自带的 SerDe。
如果没有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED,将会使用自带的 SerDe。
在建表的时候,用户还需要为表指定列,用户在指定表的列的同时也会指定自定义的 SerDe,
Hive通过 SerDe 确定表的具体的列的数据。
5、STORED AS SEQUENCEFILE|TEXTFILE|RCFILE
如果文件数据是纯文本,可以使用 STORED AS TEXTFILE。如果数据需要压缩,使用 STORED AS SEQUENCEFILE。
6、CLUSTERED BY
对于每一个表(table)或者分区, Hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围划分。
Hive也是 针对某一列进行桶的组织。
Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。
把表(或者分区)组织成桶(Bucket)有两个理由:
(1)获得更高的查询处理效率。
桶为表加上了额外的结构,Hive 在处理有些查询时能利用这个结构。
具体而言,连接两个在(包含连接列的)相同列上划分了桶的表,可以使用 Map 端连接 (Map-side join)高效的实现。
比如JOIN操作。对于JOIN操作两个表有一个相同的列,如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以,可以大大较少JOIN的数据量。
(2)使取样(sampling)更高效。
在处理大规模数据集时,在开发和修改查询的阶段,如果能在数据集的一小部分数据上试运行查询,会带来很多方便。
hive建表初体验
use myhive;
create table stu(id int,name string);
insert into stu values (1,"zhangsan");
select * from stu;
运行结果
hive (default)> use myhive;
OK
Time taken: 0.02 seconds
hive (myhive)> create table stu(id int,name string);
OK
Time taken: 0.353 seconds
hive (myhive)> insert into stu values(1,"zhangsan");
Query ID = root_20191120114343_dbd136af-303d-446f-ae84-5f43cc443a6e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1574208142489_0001, Tracking URL = http://hadoop01:8088/proxy/application_1574208142489_0001/
Kill Command = /export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop job -kill job_1574208142489_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-11-20 11:43:34,265 Stage-1 map = 0%, reduce = 0%
2019-11-20 11:43:37,438 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.68 sec
MapReduce Total cumulative CPU time: 3 seconds 680 msec
Ended Job = job_1574208142489_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:8020/user/hive/warehouse/myhive.db/stu/.hive-staging_hive_2019-11-20_11-43-16_675_1948556116407410864-1/-ext-10000
Loading data to table myhive.stu
Table myhive.stu stats: [numFiles=1, numRows=1, totalSize=11, rawDataSize=10]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.68 sec HDFS Read: 3785 HDFS Write: 286902 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 680 msec
OK
_col0 _col1
Time taken: 22.25 seconds
hive (myhive)> select * from stu;
OK
stu.id stu.name
1 zhangsan
Time taken: 0.055 seconds, Fetched: 1 row(s)
hive (myhive)>
Hive建表时候的字段类型
分类 | 类型 | 描述 | 字面量示例 |
---|---|---|---|
原始类型 | BOOLEAN | true/false | TRUE |
TINYINT | 1字节的有符号整数 -128~127 | 1Y | |
SMALLINT | 2个字节的有符号整数,-32768~32767 | 1S | |
INT | 4个字节的带符号整数 | 1 | |
BIGINT | 8字节带符号整数 | 1L | |
FLOAT | 4字节单精度浮点数1.0 | ||
DOUBLE | 8字节双精度浮点数 | 1.0 | |
DEICIMAL | 任意精度的带符号小数 | 1.0 | |
STRING | 字符串,变长 | “a”,’b’ | |
VARCHAR | 变长字符串 | “a”,’b’ | |
CHAR | 固定长度字符串 | “a”,’b’ | |
BINARY | 字节数组 | 无法表示 | |
TIMESTAMP | 时间戳,毫秒值精度 | 122327493795 | |
DATE | 日期 | ‘2016-03-29’ | |
INTERVAL | 时间频率间隔 | ||
复杂类型 | ARRAY | 有序的的同类型的集合 | array(1,2) |
MAP | key-value,key必须为原始类型,value可以任意类型 | map(‘a’,1,’b’,2) | |
STRUCT | 字段集合,类型可以不同 | struct(‘1’,1,1.0), named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0) | |
UNION | 在有限取值范围内的一个值 | create_union(1,’a’,63) |
创建表并指定字段之间的分隔符
use myhive;
create table if not exists stu2(id int ,name string) row format delimited fields terminated by '\t' stored as textfile location '/user/stu2';
insert into stu2 values (1,"zhangsan");
insert into stu2 values (2,"lisi");
insert into stu2 values (3,"wangwu");
[root@hadoop01 conf]# hive
which: no hbase in (/usr/lib64/qt-3.3/bin::/export/servers/hive-1.1.0-cdh5.14.0//bin::/export/servers/hadoop-2.6.0-cdh5.14.0/bin:/export/servers/hadoop-2.6.0-cdh5.14.0/sbin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/export/servers/jdk1.8.0_144/bin:/root/bin)
Logging initialized using configuration in jar:file:/export/servers/hive-1.1.0-cdh5.14.0/lib/hive-common-1.1.0-cdh5.14.0.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive (default)> use myhive;
OK
Time taken: 5.369 seconds
hive (myhive)> create table if not exists stu2(id int,name string) row format delimited fields terminated by '\t' stored as textfile location '/user/stu2'
> ;
OK
Time taken: 0.299 seconds
hive (myhive)> insert into stu2 values(1,"zhangsan");
Query ID = root_20191120142828_d05ff958-aa4b-4356-bd8e-79bfc5f42275
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1574208142489_0002, Tracking URL = http://hadoop01:8088/proxy/application_1574208142489_0002/
Kill Command = /export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop job -kill job_1574208142489_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-11-20 14:29:05,817 Stage-1 map = 0%, reduce = 0%
2019-11-20 14:29:06,894 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.06 sec
MapReduce Total cumulative CPU time: 1 seconds 60 msec
Ended Job = job_1574208142489_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:8020/user/stu2/.hive-staging_hive_2019-11-20_14-28-59_931_8492332465920242136-1/-ext-10000
Loading data to table myhive.stu2
Table myhive.stu2 stats: [numFiles=1, numRows=1, totalSize=11, rawDataSize=10]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.06 sec HDFS Read: 3684 HDFS Write: 287171 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 60 msec
OK
_col0 _col1
Time taken: 8.243 seconds
hive (myhive)> insert into stu2 values(2,"lisi");
Query ID = root_20191120142929_ac1af6e0-ee28-49dd-b733-8f2d5eceb322
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1574208142489_0003, Tracking URL = http://hadoop01:8088/proxy/application_1574208142489_0003/
Kill Command = /export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop job -kill job_1574208142489_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-11-20 14:29:25,283 Stage-1 map = 0%, reduce = 0%
2019-11-20 14:29:26,340 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.56 sec
MapReduce Total cumulative CPU time: 3 seconds 560 msec
Ended Job = job_1574208142489_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:8020/user/stu2/.hive-staging_hive_2019-11-20_14-29-19_614_3556800740188177467-1/-ext-10000
Loading data to table myhive.stu2
Table myhive.stu2 stats: [numFiles=2, numRows=2, totalSize=18, rawDataSize=16]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.56 sec HDFS Read: 3777 HDFS Write: 287839 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 560 msec
OK
_col0 _col1
Time taken: 8.035 seconds
hive (myhive)> insert into stu2 values(3,"wangwu");
Query ID = root_20191120142929_ef7231de-d5e3-4e27-adba-fa2eb4ed79c9
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1574208142489_0004, Tracking URL = http://hadoop01:8088/proxy/application_1574208142489_0004/
Kill Command = /export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop job -kill job_1574208142489_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-11-20 14:29:41,526 Stage-1 map = 0%, reduce = 0%
2019-11-20 14:29:42,586 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.07 sec
MapReduce Total cumulative CPU time: 1 seconds 70 msec
Ended Job = job_1574208142489_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:8020/user/stu2/.hive-staging_hive_2019-11-20_14-29-36_744_787325926187923568-1/-ext-10000
Loading data to table myhive.stu2
Table myhive.stu2 stats: [numFiles=3, numRows=3, totalSize=27, rawDataSize=24]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.07 sec HDFS Read: 3778 HDFS Write: 287850 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 70 msec
OK
_col0 _col1
Time taken: 7.033 seconds
hive (myhive)>
根据查询结果创建表
create table stu3 as select * from stu2;
hive (myhive)>
> select * from stu2;
OK
stu2.id stu2.name
1 zhangsan
2 lisi
3 wangwu
Time taken: 0.069 seconds, Fetched: 3 row(s)
hive (myhive)> create table stu3 as select * from stu2;
Query ID = root_20191120143535_defb7bd8-d614-437f-a7f7-dd867d9988c1
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1574208142489_0005, Tracking URL = http://hadoop01:8088/proxy/application_1574208142489_0005/
Kill Command = /export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop job -kill job_1574208142489_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-11-20 14:35:25,318 Stage-1 map = 0%, reduce = 0%
2019-11-20 14:35:29,299 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.02 sec
MapReduce Total cumulative CPU time: 6 seconds 20 msec
Ended Job = job_1574208142489_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:8020/user/hive/warehouse/myhive.db/.hive-staging_hive_2019-11-20_14-35-21_406_5003276756024536463-1/-ext-10001
Moving data to: hdfs://hadoop01:8020/user/hive/warehouse/myhive.db/stu3
Table myhive.stu3 stats: [numFiles=1, numRows=3, totalSize=27, rawDataSize=24]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 6.02 sec HDFS Read: 3433 HDFS Write: 286001 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 20 msec
OK
stu2.id stu2.name
Time taken: 9.08 seconds
hive (myhive)>
看一下创建的stu3有什么特点
hive (myhive)> desc stu3;
OK
col_name data_type comment
id int
name string
Time taken: 0.057 seconds, Fetched: 2 row(s)
hive (myhive)>
hive (myhive)> select * from stu3;
OK
stu3.id stu3.name
1 zhangsan
2 lisi
3 wangwu
Time taken: 0.042 seconds, Fetched: 3 row(s)
hive (myhive)>
可以发现,stu3的字段和类型都是stu2查询出的结果的,而且内容也是查询出的内容。
这个功能很好用。
根据已经存在的表结构创建表
create table stu4 like stu2;
hive (myhive)> create table stu4 like stu2;
OK
Time taken: 0.049 seconds
hive (myhive)> desc stu4;
OK
col_name data_type comment
id int
name string
Time taken: 0.057 seconds, Fetched: 2 row(s)
hive (myhive)> select * from stu4;
OK
stu4.id stu4.name
Time taken: 0.027 seconds
hive (myhive)>
新建的stu4表结构和stu2一样,但是没有任何内容。
查询表的类型
desc formatted stu2;
hive (default)> use myhive;
OK
Time taken: 2.38 seconds
hive (myhive)> desc formatted stu2;
OK
col_name data_type comment
# col_name data_type comment
id int
name string
# Detailed Table Information
Database: myhive
Owner: root
CreateTime: Wed Nov 20 14:20:41 CST 2019
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://hadoop01:8020/user/stu2
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles 3
numRows 3
rawDataSize 24
totalSize 27
transient_lastDdlTime 1574231383
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.748 seconds, Fetched: 33 row(s)
hive (myhive)>