[大数据] hive高阶

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> hive高阶—HQL -> 正文阅读

[大数据]hive高阶—HQL

四、HQL（hive SQL）
（二）DML数据操纵语言
1.数据导入:
（1）直接上传数据

--临时表
没有表文件目录，无法上传
 
--内部表
hadoop fs -put emp.txt /user/hive/warehouse/ningxw.db/managed_emp01/aaa.txt
--查询表中数据
select * from managed_emp01;

--外部表
hadoop fs -put emp.txt /user/hive/warehouse/ningxw.db/external_emp01
select * from external_emp01;

--分区表
--单级分区
hadoop fs -put emp.txt /user/hive/warehouse/ningxw.db/partition_emp/age=20/
select * from partition_emp;
--多级分区
hadoop fs -put emp.txt /user/hive/warehouse/ningxw.db/partition_emp2/month=5/day=20
select * from partition_emp2;
注意：如果创建完分区表后，手动创建分区字段目录（mkdir），再上传数据后，
需要修复表才可以用
msck repair table tabName;   --修复表的语句

--分桶表
hadoop fs -put emp.txt /user/hive/warehouse/ningxw.db/buck_emp
select * from buck_emp;
发现在HDFS上的文件目录中，并没有分成多个文件，也就是并没有按照
指定字段进行分桶

（2）向表中装载数据（Load）

语法：load data [local] inpath 'path/target.log' [overwrite] into table tab [partition (partcol1=val1,…)];

说明：
load data:表示加载数据
local：表示从(服务端启动的节点)本地加载数据到hive表；否则从HDFS加载数据到hive表
inpath：表示加载数据的路径
overwrite：表示覆盖表中已有数据，否则表示追加
into table：表示加载到哪张表
tab：表示具体的表名
partition：表示上传到指定分区


示例：
--为临时表加载数据
load data local inpath "/home/ningxw/tmp/emp.txt" into table temporary_emp;
select * from temporary_emp;

--为内部表加载数据
load data local inpath "/home/ningxw/tmp/emp.txt" into table managed_emp01;
load data local inpath "/home/ningxw/tmp/emp.txt" overwrite into table managed_emp01;
select * from managed_emp01;

--为外部表加载数据
load data local inpath "/home/ningxw/tmp/emp.txt" into table external_emp01;
select * from external_emp01;

--为分区表加载数据（单分区）
load data local inpath "/home/ningxw/tmp/emp.txt" into table partition_emp  
partition(age=18);
select * from partition_emp;

--为分区表加载数据（多级分区）
load data local inpath "/home/ningxw/tmp/emp.txt" into table partition_emp2 partition(month="05",day="20");

--为分桶表加载数据
load data local inpath "/home/ningxw/tmp/emp.txt" into table buck_emp; 
hadoop fs -put emp.txt /test/
load data inpath "/test/emp.txt" into table buck_emp;
select * from buck_emp;

（3）通过查询语句向表中插入数据（Insert）
(3.1)insert values

语法：
insert into table tab [partition (partcol1[=val1], partcol2[=val2] ...)] values (value [, value ...])
示例：

--临时表
insert into table temporary_emp(id,name) values(1,"hadoop"); 
select * from temporary_emp;

--内部表
insert into table managed_emp01(id,name) values(1,"hadoop"); 
select * from managed_emp01;

--外部表
insert into table external_emp01(id,name) values(1,"hadoop"); 
select * from external_emp01;

--分区表
insert into table partition_emp(id,name,age) values(1,"hadoop",18); 
select * from partition_emp;
insert into table partition_emp2(id,name,month,day) values(1,"hadoop",5,20); 
select * from partition_emp2;
insert into table partition_emp partition(age=20) values(2,"saprk");  --单级分区
insert into table partition_emp2 partition(month=7,day=13) values(2,"spark",5); --多级分区
 --分桶表
 insert into table buck_emp(id,name) values(1,"hadoop"); 
 select * from buck_emp;

(3.2)insert select

语法：
insert overwrite table tablename [partition (partcol1=val1, partcol2=val2 ...)]    select_statement1 from from_statement;
insert into table tablename [partition (partcol1=val1, partcol2=val2 ...)] select_statement1 from from_statement;

示例：

--临时表
insert overwrite table temporary_emp select * from emp02;
select * from temporary_emp;

--内部表
insert into table managed_emp01 select * from emp02;
select * from managed_emp01;

--外部表
insert into table external_emp01 select * from emp02;
select * from external_emp01;

--分区表
insert into table partition_emp partition(age=45) select * from emp02;
select * from partition_emp;

insert into table partition_emp2 partition(month="3",day="18") 
select * from emp02;  --会报错
[X 列对不上]
insert into table partition_emp2 partition(month="3",day="18") 
select * from partition_emp;  --此时不会报错
select * from partition_emp2;  --过来查看数据

--分桶表
Hive3.x之前版本，需要开启以下参数：
set hive.enforce.bucketing=true;--开启分桶操作
set hive.enforce.sorting=true;--开启排序操作
insert into table buck_emp select * from partition_emp;
select * from  buck_emp;

(3.3)多重插入

语法：
from from_statement
insert overwrite table tab [partition (partcol1=val1, partcol2=val2 ...)] 
select_statement1
[insert overwrite table tab2 [partition ...] select_statement2]
[insert into table tab2 [partition ...] select_statement2];

说明：当我们需要将一张表中的部分数据分别插入多张表时，就可以使用多重插入
示例：
将partition_emp表中的部分数据分别替换、插入到managed_emp01、external_emp01表中:
from partition_emp
insert overwrite table managed_emp01 
select id,name
insert into table external_emp01
select id,name;

(3.4)动态分区

语法：
insert overwrite table tablename partition (partcol1[=val1], partcol2[=val2] ...)
select_statement from from_statement;
insert into table tablename partition (partcol1[=val1], partcol2[=val2] ...)  
select_statement from from_statement;
说明：
往hive分区表中插入数据时，如果需要创建的分区很多，比如以表中某个字段
进行分区存储，则需要复制粘贴修改很多sql去执行，效率低。
因为hive是批处理系统，所以hive提供了一个动态分区功能
示例：
--设置非严格模式
set hive.exec.dynamic.partition.mode=nonstrict;
--创建动态分区表
create table dynamic_emp(
id int) 
partitioned by (name string) 
row format delimited fields terminated by '\t';
--动态插入数据
insert overwrite table dynamic_emp partition(name)
select id,name from managed_emp01;
--查看全部分区
show partitions dynamic_emp;

(3.5)CTE表达式

语法：
CTE(Common Table Expression) 公用表表达式,它是在单个语句的执行范围内
定义的临时结果集,只在查询期间有效,它可以自引用，也可在同一查询中多次引用，
实现了代码段的重复利用.
CTE最大的好处是提升SQL的可读性,可以更加优雅简洁的方式实现递归等复杂的查询

用法示例：

--选择语句中的CTE
with t1 as (select name from emp01 where emp01.id=2)
select * from t1;

-- from风格
with t1 as (select name from emp01 where emp01.id=2)
from t1   select *;

-- 串联式
with t1 as ( select * from emp01 where emp01.id=2),  
t2 as ( select name from t1)
select * from (select name from t2) haha;

-- union案例
with t1 as (select * from emp01 where emp01.id=2),
     t2 as (select * from emp01 where emp01.id=3)
     select * from t1 union all select * from t2;

-- 插入数据
create table emp03 like emp01;
with t1 as ( select * from emp01 where emp01.id=2)
from t1
insert overwrite table emp03  --对表emp03进行插入数据
select *;

select * from emp03;   --查询新插入数据的表


-- 创建表
create table emp05 as
with t1 as ( select * from emp01 where emp01.id=2)
select * from t1;

select * from emp05; --查看创建的表emp05中有没有数据

（4）创建表时通过Location指定加载数据路径

create table if not exists emp4(
id int, name string
)
row format delimited fields terminated by '\t'
location '/test';

（5）Import数据到指定Hive表中

create table if not exists emp5(
id int, name string
)row format delimited fields terminated by '\t';
import table emp5 from '/test/';
注意：先将数据导出，再进行导入

2、数据导出
（1）Insert导出
（1.1）将查询的结果导出到本地

insert overwrite local directory '/home/ningxw/tmp/export/emp'
select * from managed_emp01;

(1.2)将查询的结果格式化导出到本地

insert overwrite local directory '/home/offcn/tmp/export/emp2'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from managed_emp01;

(1.3)将查询的结果导出到HDFS上(没有local)

insert overwrite directory   '/test/export/emp' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
select * from managed_emp01;

（2）Hadoop命令导出到本地

dfs -get /user/hive/warehouse/offcn.db/emp02/emp.txt	
/home/offcn/tmp/export/emp2/emp.txt;

（3）Hive Shell 命令导出

hive -e 'select * from offcn.emp02;' > /home/offcn/tmp/export/emp.txt;

（4）Export导出到HDFS上

export table offcn.emp02 to
'/tmp/export/emp';
export和import主要用于两个Hadoop平台集群之间Hive表迁移。

（三）DQL数据查询语言

数据准备：
create table if not exists dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';


create table if not exists emp(
empno int,
ename string,
job string,
mgr int,
hiredate string, 
sal double, 
comm double,
deptno int)
row format delimited fields terminated by '\t';

load data local inpath '/home/offcn/tmp/dept.txt' into table dept;
load data local inpath '/home/offcn/tmp/emp.txt' into table emp;
dept.txt  文件
10	ACCOUNTING	1700
20	RESEARCH	1800
30	SALES	1900
40	OPERATIONS	1700
emp.txt
7369	SMITH	CLERK	7902	1980-12-17	800.00		20
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.00	300.00	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.00	500.00	30
7566	JONES	MANAGER	7839	1981-4-2	2975.00		20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.00	1400.00	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.00		30
7782	CLARK	MANAGER	7839	1981-6-9	2450.00		10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.00		null
7839	KING	PRESIDENT		1981-11-17	5000.00		10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.00	0.00	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.00		20
7900	JAMES	CLERK	7698	1981-12-3	950.00		30
7902	FORD	ANALYST	7566	1981-12-3	3000.00		20
7934	MILLER	CLERK	7782	1982-1-23	1300.00		10

1、基本查询（Select…From）

全表查询
select * from emp;
选择特定列查询

2、条件查询

查询出薪水大于1000的所有员工
select * from emp where sal >1000;
关系运算符:

在这里插入图片描述

(2.1)Between、IN、is null

询出薪水等于5000的所有员工
 select * from emp where sal =5000;
查询工资在500到1000的员工信息
 select * from emp where sal between 500 and 1000;
查询comm为空的所有员工信息
 select * from emp where comm is null;
查询工资是1500或5000的员工信息
 select * from emp where sal IN (1500, 5000);

（2.2）Like和RLike

选择条件可以包含字符或数字:
% 代表零个或多个字符(任意个字符)。
_ 代表一个字符。
RLIKE子句：
RLIKE子句是Hive中这个功能的一个扩展，其可以通过Java的正则表达式
这个更强大的语言来指定匹配条件
查找以”S”开头的员工信息
 select * from emp where ename LIKE 'S%';
 查找第二个数值为”S”的薪水的员工信息
  select * from emp where ename LIKE '_S%';
 查找名字中含有“I”的员工信息
 select * from emp where ename RLIKE '[I]';

(2.3)逻辑运算符（And/Or/Not）

操作符	含义
AND	逻辑并
OR	逻辑或
NOT	逻辑否
查询薪水大于1000，部门是30
select * from emp where sal>1000 and deptno=30;
查询薪水大于1000，或者部门是30
select * from emp where sal>1000 or deptno=30;
查询除了20部门和30部门以外的员工信息
select * from emp where deptno not IN(30, 20);

（2.4）limit语句

LIMIT子句用于限制返回的行数不同于mysql的是，limit后只能跟一个参数
select * from emp limit 5;

3、分组查询
（3.1）Group By语句

GROUP BY语句通常会和聚合函数一起使用,按照一个或者多个列队结果进行分组，
然后对每个组执行聚合操作。

注意：分组语句中，select后的字段只能是分组字段或者聚合函数！

计算emp表每个部门的平均工资
select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;

计算emp每个部门中每个岗位的最高薪水
select t.deptno, t.job, max(t.sal) max_sal from emp t group by t.deptno, t.job;

（3.2）Having语句

having与where不同点
where后面不能写分组函数，而having后面可以使用分组函数。
having只用于group by分组统计语句
例如：
求每个部门的平均薪水大于2000的部门
第一步：
求每个部门的平均工资
select deptno, avg(sal) from emp group by deptno;
第二步：
求每个部门的平均薪水大于2000的部门
select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;

4、连接查询
在这里插入图片描述
（4.1）内连接

select e.*,d.*
from emp e inner join dept d
on e.deptno=d.deptno;

采用交叉方式实现：
select e.*,d.*
from emp e left outer join dept d
on e.deptno=d.deptno
intersect
select e.*,d.*
from emp e right outer join dept d
on e.deptno=d.deptno;

（4.2）外连接

 4.2.1.左链接
 --左连接
select e.*,d.*
from emp e left outer join dept d
on e.deptno=d.deptno;

4.2.2右连接
--右连接
select e.*,d.*
from emp e right outer join dept d
on e.deptno=d.deptno;

4.2.3左独有
--左独有
select e.*,d.*
from emp e left outer join dept d
on e.deptno=d.deptno
where d.deptno is null;

4.2.4右独有
--右独有
select e.*,d.*
from emp e right outer join dept d
on e.deptno=d.deptno
where e.deptno is null;

4.2.5全连接
--全连接
select e.*,d.*
from emp e full outer join dept d
on e.deptno=d.deptno;

--用union实现全连接
select e.*,d.*
from emp e left outer join dept d
on e.deptno=d.deptno
union
select e.*,d.*
from emp e right outer join dept d
on e.deptno=d.deptno;

4.2.6左右独有
--左右独有
select e.*,d.*
from emp e full outer join dept d
on e.deptno=d.deptno
where e.deptno is null or d.deptno is null;

--用union all实现左右独有
select e.*,d.dname,d.loc
from emp e left outer join dept d
on e.deptno=d.deptno
where d.deptno is null
union all
select e.empno ,e.ename ,e.job,e.mgr,e.hiredate,e.sal,e.comm,d.*
from emp e right outer join dept d
on e.deptno=d.deptno
where e.deptno is null;

--except/minus：差集实现
select e.*,d.*
from emp e full outer join dept d
on e.deptno=d.deptno
except/minus
select e.*,d.*
from emp e inner join dept d
on e.deptno=d.deptno;

（4.3）交叉连接

交叉连接：得到笛卡尔积，有隐式、显式两种写法
加上on条件，相当于内连接

--隐式写法
select e.*,d.*
from emp e,dept d;
加上on条件，相当于内连接
select e.*,d.*
from emp e,dept d
where e.deptno=d.deptno;

--显示写法
select e.*,d.*
from emp e cross join dept d;
加上on条件，相当于内连接
select e.*,d.*
from emp e cross join dept d
on e.deptno=d.deptno;

（4.4）左半开连接（left semi-join）

当左边表的一条数据，在右边表中存在时，Hive就停止扫描，因此效率比join高
但是左半开连接的select和where关键字后面只能出现左边表的字段
不能出现右边表的字段。Hive不支持右半开连接

select *
from dept d left semi join emp e
on d.deptno=e.deptno;

select d.*
from dept d left semi join emp e
on d.deptno=e.deptno;

执行以下语句，左半连接用来代替in操作或者exists操作

select * from user left semi join job on user.id=job.user_id;
该语句相当于如下语句
select * from user where id in (select user_id from job);
但是！！！  hive不支持in子句。所以只能变通，使用left semi子句。

（4.5）自连接

普通方法查询：
SELECT * FROM AREA
WHERE parent_code=(SELECT area_code
FROM AREA
WHERE area_name="内蒙古自治区");
自连接查询：
SELECT a.*,b.area_name
FROM AREA a JOIN AREA b
ON a.parent_code=b.area_code
WHERE b.area_name="内蒙古自治区";

（6）多表连接

SELECT e.ename, d.dname, l.loc_name
FROM   emp e 
JOIN   dept d
ON     d.deptno = e.deptno 
JOIN   location l
ON     d.loc = l.loc;
Hive会对每对JOIN连接对象启动一个MapReduce任务。本例中会首先
启动一个 MapReduce job对表e和表d进行连接操作，然后会再
启动一个MapReduce job将第一个MapReduce job的输出和表l;进行连接操作
优化：当对3个或者更多表进行join连接时，如果每个on子句都使用
     相同的连接键的话，那么只会产生一个MapReduce job

（7）hiveJoin的注意事项

（7.1）允许使用复杂的联接表达式,支持非等值连接
SELECT a.* FROM a JOIN b ON (a.id = b.id)
SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department)
SELECT a.* FROM a LEFT OUTER JOIN b ON (a.id <> b.id)

（7.2）同一查询中可以连接2个以上的表
SELECT a.val, b.val, c.val FROM a JOIN b 
ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
（7.3）如果每个表在联接子句中使用相同的列，
则Hive将多个表上的联接转换为单个MR作业

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
--由于联接中仅涉及b的key1列，因此被转换为1个MR作业来执行

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
--会转换为两个MR作业，因为在第一个连接条件中使用了b中的key1列，而在第二个连接条件中使用了b中的key2列。

（7.4）join时的最后一个表会通过reducer流式传输，并在其中缓冲之前的其他表，因此，将大表放置在最后有助于减少reducer阶段缓存数据所需要的内存
（7.5）在join的时候，可以通过语法STREAMTABLE提示指定要流式传输的表
      如果省略STREAMTABLE提示，则Hive将流式传输最右边的表
（7.6）join在WHERE条件之前进行
（7.7）如果除一个要连接的表之外的所有表都很小，则可以将其作为仅map作业执行（mapjoin）

5.排序
（5.1）全局排序（Order By）

Order By：全局排序，只有一个Reducer
使用 ORDER BY 子句排序
ASC（ascend）: 升序（默认）
DESC（descend）: 降序

ORDER BY 子句在SELECT语句的结尾
--查询员工信息按工资升序排列
select * from emp order by sal;
--查询员工信息按工资降序排列
select * from emp order by sal desc;

（5.2）每个MapReduce内部排序（Sort By）

对于大规模的数据集order by的效率非常低，在很多情况下，并不需要全局排序，
此时可以使用sort by
Sort by为每个reducer产生一个排序文件，每个Reducer内部进行排序，对全局结果集来说不是排序
--设置reduce个数
set mapreduce.job.reduces=3;
--查看设置reduce个数
set mapred.reduce.tasks;

--根据部门编号降序查看员工信息
select * from emp sort by deptno desc;

--将查询结果导入到文件中（按照部门编号降序排序）
insert overwrite local directory '/home/offcn/tmp/sortby-result'
select * from emp sort by deptno desc;

（5.3）分区排序（Distribute By）

在有些情况下，我们需要控制某个特定行应该到哪个reducer，通常是为了进行后续的聚集操作
distribute by 子句可以做这件事
distribute by类似MR中partition（自定义分区），进行分区，结合sort by使用

对于distribute by进行测试，一定要分配多reduce进行处理，
否则无法看到distribute by的效果
--先按照部门编号分区，再按照员工编号降序排序
set mapreduce.job.reduces=3;
insert overwrite local directory '/home/offcn/tmp/distribute-result'
 select * from emp
 distribute by deptno sort by empno desc;
 注意：
 distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后，
 余数相同的分到一个区
 Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前

（5.4）Cluster By

当distribute by和sort by字段相同时，可以使用cluster by方式
cluster by除了具有distribute by的功能外还兼具sort by的功能
但是排序只能是升序排序，不能指定排序规则为ASC或者DESC

–以下两种写法等价

1.select * from emp cluster by deptno;
2.select * from emp distribute by deptno sort by deptno;

大数据最新文章

实现Kafka至少消费一次

亚马逊云科技：还在苦于ETL？Zero ETL的时代

初探MapReduce

【SpringBoot框架篇】32.基于注解+redis实现

Elasticsearch：如何减少 Elasticsearch 集

Go redis操作

Redis面试题

专题五 Redis高并发场景

基于GBase8s和Calcite的多数据源查询

Redis——底层数据结构原理

加:2021-07-27 16:17:12 更:2021-07-27 16:20:18

360图书馆购物三丰科技阅读网日历万年历 2025年7日历

-2025/7/2 10:26:59-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码