Hive的基本操作

Hive的基本操作

Hive基本操作

首先,hive一下,进入交互式hive

[root@hadoop01 sbin]# hive
which: no hbase in (/usr/lib64/qt-3.3/bin::/export/servers/hive-1.1.0-cdh5.14.0//bin::/export/servers/hadoop-2.6.0-cdh5.14.0/bin:/export/servers/hadoop-2.6.0-cdh5.14.0/sbin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/export/servers/jdk1.8.0_144/bin:/root/bin)

Logging initialized using configuration in jar:file:/export/servers/hive-1.1.0-cdh5.14.0/lib/hive-common-1.1.0-cdh5.14.0.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive (default)> 

创建数据库

create database if not exists test01;
hive (default)> create database if not exists test01;
OK
Time taken: 2.256 seconds
hive (default)> 

使用数据库

use  test01;
hive (default)> use test01;
OK
Time taken: 0.016 seconds
hive (test01)> 

说明:hive的表存放位置模式是由hive-site.xml当中的一个属性指定的

<property>
	<name>hive.metastore.warehouse.dir</name>
	<value>/user/hive/warehouse</value>
</property>

如果你的hive-site.xml没有这个属性,可以加上
hive-site.xml在hive安装目录的conf目录下
我们可以看看这个目录下有什么?

[root@hadoop01 conf]# hadoop fs -ls /user/hive/warehouse
Found 2 items
drwxr-xr-x - root supergroup 0 2019-11-19 11:22 /user/hive/warehouse/myhive.db
drwxr-xr-x - root supergroup 0 2019-11-20 09:41 /user/hive/warehouse/test01.db
[root@hadoop01 conf]#

有俩.db结尾的目录,它们就是你创建的两个数据库。其中test01是我们刚刚创建的。

你也可以在创建数据库时,指定它存放的位置,只需要通过 location :
创建数据库并指定hdfs存储位置

create database test02 location '/test02';
hive (default)> create database test02 location '/test02';
OK
Time taken: 3.922 seconds
hive (default)> 

这里我们指定了test02数据库放在/test02下,让我们看看是否成功了

[root@hadoop01 conf]# hdfs dfs -ls /
Found 25 items
-rw-r--r--   2 root          supergroup         16 2019-11-01 08:41 /a.txt
drwxr-xr-x   - root          supergroup          0 2019-11-18 11:51 /aaaaa
-rw-r--r--   2 root          supergroup         16 2019-11-01 08:38 /abc
-rwxrwxrwx   3 root          supergroup         45 2019-11-07 16:59 /bigfile.xml
drwxr-xr-x   - root          supergroup          0 2019-11-18 09:35 /ccccc
drwxr-xr-x   - root          supergroup          0 2019-11-07 16:12 /config
drwxr-xr-x   - Administrator supergroup          0 2019-11-07 10:23 /hello
-rw-r--r--   2 root          supergroup  185515842 2019-11-12 07:53 /jdk-8u144-linux-x64.tar.gz
drwxr-xr-x   - root          supergroup          0 2019-11-07 16:55 /opt
drwxr-xr-x   - root          supergroup          0 2019-11-16 09:50 /outpartition
drwxr-xr-x   - Administrator supergroup          0 2019-11-12 10:22 /output
drwxr-xr-x   - root          supergroup          0 2019-11-16 14:41 /partition_flow
drwxr-xr-x   - root          supergroup          0 2019-11-16 14:42 /partition_out
drwxr-xr-x   - root          supergroup          0 2019-11-16 14:50 /partition_out2
drwxr-xr-x   - root          supergroup          0 2019-11-18 18:10 /partition_out3
drwxr-xr-x   - root          supergroup          0 2019-11-18 18:15 /partition_out4
drwxr-xr-x   - root          supergroup          0 2019-11-15 16:34 /partitioner
drwxr-xr-x   - root          supergroup          0 2019-11-06 14:21 /system
-rw-r--r--   2 root          supergroup   26598111 2019-11-15 10:16 /test.jar
drwxrwxrwx   - root          supergroup          0 2019-11-20 09:49 /test02
-rw-r--r--   2 root          supergroup   26601944 2019-11-15 16:34 /test2.jar
drwx------   - root          supergroup          0 2019-11-19 11:12 /tmp
drwxr-xr-x   - root          supergroup          0 2019-11-19 11:22 /user
drwxr-xr-x   - root          supergroup          0 2019-11-15 09:11 /wordcount
drwxr-xr-x   - root          supergroup          0 2019-11-15 10:40 /wordcount_out
[root@hadoop01 conf]# 

可以看到,多出了一个文件夹/test02,但是它是空的,这可能是因为没有在数据库中创建表导致的。

修改数据库

可以使用alter database 命令来修改数据库的一些属性。
但是数据库的元数据信息是不可更改的,包括数据库的名称以及数据库所在的位置

修改数据库的创建日期

alter  database  test01 set  dbproperties('createtime'='20121221');

运行结果

hive (default)> alter database test01 set dbproperties('createtime'='200121221');
OK
Time taken: 0.118 seconds
hive (default)> 

查看数据库详细信息

查看数据库基本信息

desc  database  test02;

运行结果

hive (default)> desc database test02;
OK
db_name comment location        owner_name      owner_type      parameters
test02          hdfs://hadoop01:8020/test02     root    USER
Time taken: 0.009 seconds, Fetched: 1 row(s)
hive (default)> 

查看数据库更多详细信息

desc database extended  test02;   

运行结果

hive (default)> desc database extended test02;
OK
db_name comment location        owner_name      owner_type      parameters
test02          hdfs://hadoop01:8020/test02     root    USER    {createtime=200121221}
Time taken: 0.008 seconds, Fetched: 1 row(s)
hive (default)> 

删除数据库

删除一个空数据库,如果数据库下面有数据表,那么就会报错

drop  database  test02;

运行结果

hive (default)> drop database test02;
OK
Time taken: 0.106 seconds
hive (default)> 

强制删除数据库,包含数据库下面的表一起删除

drop  database  myhive  cascade;   不要执行(危险动作)  

运行结果

hive (default)> drop database test01 cascade;
OK
Time taken: 0.036 seconds
hive (default)> 

创建数据库表

语法

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name 
   [(col_name data_type [COMMENT col_comment], ...)] 
   [COMMENT table_comment] 
   [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 
   [CLUSTERED BY (col_name, col_name, ...) 
   [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] 
   [ROW FORMAT row_format] 
   [STORED AS file_format] 
   [LOCATION hdfs_path]

中括号 []里的内容为可选项

说明:

1、CREATE TABLE 创建一个指定名字的表。如果相同名字的表已经存在,则抛出异常;用户可以用 IF NOT EXISTS 判断这个名称的表是否存在。
有了IF NOT EXISTS 之后,如果存在这个表,就不会再创建这个表了,所以可以用IF NOT EXISTS忽略这个异常。

2、EXTERNAL关键字可以让用户创建一个外部表,在建表的同时指定一个指向实际数据的路径(LOCATION),
Hive 创建内部表时,会将数据移动到数据仓库指向的路径;

若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。
在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。

3、LIKE 允许用户复制现有的表结构,但是不复制数据。

4、ROW FORMAT DELIMITED 
[FIELDS TERMINATED BY char] 
[COLLECTION ITEMS TERMINATED BY char]  
[MAP KEYS TERMINATED BY char] 
[LINES TERMINATED BY char] | SERDE serde_name 
[WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

用户在建表的时候可以自定义 SerDe 或者使用自带的 SerDe。
如果没有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED,将会使用自带的 SerDe。
在建表的时候,用户还需要为表指定列,用户在指定表的列的同时也会指定自定义的 SerDe,
Hive通过 SerDe 确定表的具体的列的数据。

5、STORED AS SEQUENCEFILE|TEXTFILE|RCFILE
如果文件数据是纯文本,可以使用 STORED AS TEXTFILE。如果数据需要压缩,使用 STORED AS SEQUENCEFILE。

6、CLUSTERED BY
对于每一个表(table)或者分区, Hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围划分。
Hive也是 针对某一列进行桶的组织。
Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。 


把表(或者分区)组织成桶(Bucket)有两个理由:
(1)获得更高的查询处理效率。

桶为表加上了额外的结构,Hive 在处理有些查询时能利用这个结构。

具体而言,连接两个在(包含连接列的)相同列上划分了桶的表,可以使用 Map 端连接 (Map-side join)高效的实现。

比如JOIN操作。对于JOIN操作两个表有一个相同的列,如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以,可以大大较少JOIN的数据量。

(2)使取样(sampling)更高效。

在处理大规模数据集时,在开发和修改查询的阶段,如果能在数据集的一小部分数据上试运行查询,会带来很多方便。

hive建表初体验

use myhive;
create table stu(id int,name string);
insert into stu values (1,"zhangsan");
select * from stu;

运行结果

hive (default)> use myhive;
OK
Time taken: 0.02 seconds

hive (myhive)> create table stu(id int,name string);
OK
Time taken: 0.353 seconds

hive (myhive)> insert into stu values(1,"zhangsan");
Query ID = root_20191120114343_dbd136af-303d-446f-ae84-5f43cc443a6e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1574208142489_0001, Tracking URL = http://hadoop01:8088/proxy/application_1574208142489_0001/
Kill Command = /export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop job  -kill job_1574208142489_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-11-20 11:43:34,265 Stage-1 map = 0%,  reduce = 0%
2019-11-20 11:43:37,438 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.68 sec
MapReduce Total cumulative CPU time: 3 seconds 680 msec
Ended Job = job_1574208142489_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:8020/user/hive/warehouse/myhive.db/stu/.hive-staging_hive_2019-11-20_11-43-16_675_1948556116407410864-1/-ext-10000
Loading data to table myhive.stu
Table myhive.stu stats: [numFiles=1, numRows=1, totalSize=11, rawDataSize=10]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.68 sec   HDFS Read: 3785 HDFS Write: 286902 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 680 msec
OK
_col0   _col1
Time taken: 22.25 seconds

hive (myhive)> select * from stu;
OK
stu.id  stu.name
1       zhangsan
Time taken: 0.055 seconds, Fetched: 1 row(s)
hive (myhive)> 

Hive建表时候的字段类型

分类类型描述字面量示例
原始类型BOOLEANtrue/falseTRUE
TINYINT1字节的有符号整数 -128~1271Y
SMALLINT2个字节的有符号整数,-32768~327671S
INT4个字节的带符号整数1
BIGINT8字节带符号整数1L
FLOAT4字节单精度浮点数1.0
DOUBLE8字节双精度浮点数1.0
DEICIMAL任意精度的带符号小数1.0
STRING字符串,变长“a”,’b’
VARCHAR变长字符串“a”,’b’
CHAR固定长度字符串“a”,’b’
BINARY字节数组无法表示
TIMESTAMP时间戳,毫秒值精度122327493795
DATE日期‘2016-03-29’
INTERVAL时间频率间隔
复杂类型ARRAY有序的的同类型的集合array(1,2)
MAPkey-value,key必须为原始类型,value可以任意类型map(‘a’,1,’b’,2)
STRUCT字段集合,类型可以不同struct(‘1’,1,1.0), named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0)
UNION在有限取值范围内的一个值create_union(1,’a’,63)

创建表并指定字段之间的分隔符

use myhive;

create  table if not exists stu2(id int ,name string) row format delimited fields terminated by '\t' stored as textfile location '/user/stu2';

insert into stu2 values (1,"zhangsan");
insert into stu2 values (2,"lisi");
insert into stu2 values (3,"wangwu");
[root@hadoop01 conf]# hive
which: no hbase in (/usr/lib64/qt-3.3/bin::/export/servers/hive-1.1.0-cdh5.14.0//bin::/export/servers/hadoop-2.6.0-cdh5.14.0/bin:/export/servers/hadoop-2.6.0-cdh5.14.0/sbin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/export/servers/jdk1.8.0_144/bin:/root/bin)
Logging initialized using configuration in jar:file:/export/servers/hive-1.1.0-cdh5.14.0/lib/hive-common-1.1.0-cdh5.14.0.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.


hive (default)> use myhive;
OK
Time taken: 5.369 seconds

hive (myhive)> create table if not exists stu2(id int,name string) row format delimited fields terminated by '\t' stored as textfile location '/user/stu2'
             > ;
OK
Time taken: 0.299 seconds


hive (myhive)> insert into stu2 values(1,"zhangsan");
Query ID = root_20191120142828_d05ff958-aa4b-4356-bd8e-79bfc5f42275
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1574208142489_0002, Tracking URL = http://hadoop01:8088/proxy/application_1574208142489_0002/
Kill Command = /export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop job  -kill job_1574208142489_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-11-20 14:29:05,817 Stage-1 map = 0%,  reduce = 0%
2019-11-20 14:29:06,894 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.06 sec
MapReduce Total cumulative CPU time: 1 seconds 60 msec
Ended Job = job_1574208142489_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:8020/user/stu2/.hive-staging_hive_2019-11-20_14-28-59_931_8492332465920242136-1/-ext-10000
Loading data to table myhive.stu2
Table myhive.stu2 stats: [numFiles=1, numRows=1, totalSize=11, rawDataSize=10]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.06 sec   HDFS Read: 3684 HDFS Write: 287171 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 60 msec
OK
_col0   _col1
Time taken: 8.243 seconds


hive (myhive)> insert into stu2 values(2,"lisi");
Query ID = root_20191120142929_ac1af6e0-ee28-49dd-b733-8f2d5eceb322
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1574208142489_0003, Tracking URL = http://hadoop01:8088/proxy/application_1574208142489_0003/
Kill Command = /export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop job  -kill job_1574208142489_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-11-20 14:29:25,283 Stage-1 map = 0%,  reduce = 0%
2019-11-20 14:29:26,340 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.56 sec
MapReduce Total cumulative CPU time: 3 seconds 560 msec
Ended Job = job_1574208142489_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:8020/user/stu2/.hive-staging_hive_2019-11-20_14-29-19_614_3556800740188177467-1/-ext-10000
Loading data to table myhive.stu2
Table myhive.stu2 stats: [numFiles=2, numRows=2, totalSize=18, rawDataSize=16]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.56 sec   HDFS Read: 3777 HDFS Write: 287839 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 560 msec
OK
_col0   _col1
Time taken: 8.035 seconds


hive (myhive)> insert into stu2 values(3,"wangwu");
Query ID = root_20191120142929_ef7231de-d5e3-4e27-adba-fa2eb4ed79c9
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1574208142489_0004, Tracking URL = http://hadoop01:8088/proxy/application_1574208142489_0004/
Kill Command = /export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop job  -kill job_1574208142489_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-11-20 14:29:41,526 Stage-1 map = 0%,  reduce = 0%
2019-11-20 14:29:42,586 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.07 sec
MapReduce Total cumulative CPU time: 1 seconds 70 msec
Ended Job = job_1574208142489_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:8020/user/stu2/.hive-staging_hive_2019-11-20_14-29-36_744_787325926187923568-1/-ext-10000
Loading data to table myhive.stu2
Table myhive.stu2 stats: [numFiles=3, numRows=3, totalSize=27, rawDataSize=24]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.07 sec   HDFS Read: 3778 HDFS Write: 287850 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 70 msec
OK
_col0   _col1
Time taken: 7.033 seconds
hive (myhive)> 

根据查询结果创建表

create table stu3 as select * from stu2;
hive (myhive)> 
             > select * from stu2;
OK
stu2.id stu2.name
1       zhangsan
2       lisi
3       wangwu
Time taken: 0.069 seconds, Fetched: 3 row(s)


hive (myhive)> create table stu3 as select * from stu2;
Query ID = root_20191120143535_defb7bd8-d614-437f-a7f7-dd867d9988c1
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1574208142489_0005, Tracking URL = http://hadoop01:8088/proxy/application_1574208142489_0005/
Kill Command = /export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop job  -kill job_1574208142489_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-11-20 14:35:25,318 Stage-1 map = 0%,  reduce = 0%
2019-11-20 14:35:29,299 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.02 sec
MapReduce Total cumulative CPU time: 6 seconds 20 msec
Ended Job = job_1574208142489_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop01:8020/user/hive/warehouse/myhive.db/.hive-staging_hive_2019-11-20_14-35-21_406_5003276756024536463-1/-ext-10001
Moving data to: hdfs://hadoop01:8020/user/hive/warehouse/myhive.db/stu3
Table myhive.stu3 stats: [numFiles=1, numRows=3, totalSize=27, rawDataSize=24]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 6.02 sec   HDFS Read: 3433 HDFS Write: 286001 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 20 msec
OK
stu2.id stu2.name
Time taken: 9.08 seconds
hive (myhive)> 

看一下创建的stu3有什么特点

hive (myhive)> desc stu3;
OK
col_name        data_type       comment
id                      int                                         
name                    string                                      
Time taken: 0.057 seconds, Fetched: 2 row(s)
hive (myhive)> 


hive (myhive)> select * from stu3;
OK
stu3.id stu3.name
1       zhangsan
2       lisi
3       wangwu
Time taken: 0.042 seconds, Fetched: 3 row(s)
hive (myhive)> 

可以发现,stu3的字段和类型都是stu2查询出的结果的,而且内容也是查询出的内容。
这个功能很好用。

根据已经存在的表结构创建表

create table stu4 like stu2;
hive (myhive)> create table stu4 like stu2;
OK
Time taken: 0.049 seconds


hive (myhive)> desc stu4;
OK
col_name        data_type       comment
id                      int                                         
name                    string                                      
Time taken: 0.057 seconds, Fetched: 2 row(s)


hive (myhive)> select * from stu4;
OK
stu4.id stu4.name
Time taken: 0.027 seconds
hive (myhive)> 

新建的stu4表结构和stu2一样,但是没有任何内容。

查询表的类型

desc formatted  stu2;
hive (default)> use myhive;
OK
Time taken: 2.38 seconds


hive (myhive)> desc formatted stu2;
OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
id                      int                                         
name                    string                                      
                 
# Detailed Table Information             
Database:               myhive                   
Owner:                  root                     
CreateTime:             Wed Nov 20 14:20:41 CST 2019     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop01:8020/user/stu2   
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   true                
        numFiles                3                   
        numRows                 3                   
        rawDataSize             24                  
        totalSize               27                  
        transient_lastDdlTime   1574231383          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe    
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat     
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \t                  
        serialization.format    \t                  
Time taken: 0.748 seconds, Fetched: 33 row(s)
hive (myhive)>