使用Hive SQL窗口函数进行商务数据分析

相关主题
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

trailing_avg -- 滚动求平均 FROM orders ORDER BY 1 结果输出:
order_ customer cit
add_time
id _id
y
amou amount_t running_s running_sum_by_cus trailing_a
nt otal
um
tomer
month, revenue, lag(revenue) over (order by month) as prev_month_revenue FROM monthly_revenue ) SELECT month, revenue, round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth FROM prev_month_revenue ORDER BY 1 结果输出:
SELECT *, row_number() over (partition by order_id order by add_time desc) as rank FROM orders )t WHERE rank=1 ) ,monthly_revenue as ( SELECT trunc(add_time,'MM') as month, sum(amount) as revenue FROM orders_cleaned GROUP BY 1 ) ,prev_month_revenue as ( SELECT
SELECT trunc(add_time,'MM') as month, sum(amount) as revenue FROM orders GROUP BY 1 ) SELECT month, revenue, sum(revenue) over (order by month rows between unbounded preceding and current row) as running_total FROM monthly_revenue ORDER BY 1 结果输出
(5,"D","上海","2020-02-05 12:00:00.000000",300),显然需要对其进行清洗去重,保留最新的一 条数据,SQL 如下: 我们先进行分组排名,然后保留最新的那条数据即可: SELECT * FROM (
SELECT *, row_number() over (partition by order_id order by add_time desc) as rank FROM orders )t WHERE rank=1 结果输出:
000
3400
2550 500
316.6666 67
9F
上 2020-03-09 海 00:00:00.000 250
000
3400
2800 250
291.6666 67
10 B
上 2020-03-21 海 00:00:00.000 600
000
3400
3400 850
需求 3:处理重复数据 从 上 面 的 数 据 可 以 看 出 , 存 在 两 条 重 复 的 数 据 (5,"D"," 上 海 ","2020-02-05 00:00:00.000000",250),
266.6666 67
6C
北 2020-02-19 京 00:00:00.000 300
000
3400
1900 500
283.3333 33
7A
上 2020-03-01 海 00:00:00.000 150
000
3400
2050 750
266.6666 67
8E
北 2020-03-05 京 00:00:00.000 500
SELECT trunc(add_time,'MM') as month,
city, sum(amount) as revenue FROM orders GROUP BY 1,2 ) ,prev_month_revenue as ( SELECT month, city, revenue, lag(revenue) over (partition by city order by month) as prev_month_revenue FROM monthly_revenue ) SELECT month, city, revenue, round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth FROM prev_month_revenue ORDER BY 2,1 结果输出
SELECT trunc(add_time,'MM') as month, sum(amount) as revenue FROM orders GROUP BY 1 ) ,prev_month_revenue as ( SELECT
month, revenue, lag(revenue) over (order by month) as prev_month_revenue -- 上一月收入 FROM monthly_revenue ) SELECT month, revenue, prev_month_revenue, round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth FROM prev_month_revenue ORDER BY 1 结果输出
month city revenue revenue_growth
2020-01-01 上海 450 NULL
2020-02-01 上海 950 111.1
2020-03-01 上海 1000 5.3
2020-01-01 北京 200 NULL
2020-02-01 北京 300 50
2020-03-01 北京 500 66.7 需求 2:累计求和 累计汇总,即当前元素和所有先前元素的总和,如下面的 SQL: WITH monthly_revenue as (
4A
上 2020-02-04 海 00:00:00.000 400
000
3400
1050 600
262.5
上 2020-02-05
5D
海 00:00:00.000 250 3400
1300 250
260
000
5D
上 2020-02-05 海 12:00:00.000 300
000
3400
1600 550
1
9
Fห้องสมุดไป่ตู้
上海 2020-03-09 00:00:00.000000 250
1
10
B
上海 2020-03-21 00:00:00.000000 600
1
经过上面的清洗过程,对数据进行了去重。重新计算上面的需求 1,正确 SQL 脚本为: WITH orders_cleaned as (
SELECT * FROM (
vg
上 2020-01-01
1A
海 00:00:00.000 200 3400
200
200
200
000
上 2020-01-05
2B
海 00:00:00.000 250 3400
450
250
225
000
3C
北 2020-01-12
京 00:00:00.000 200 3400
650
200
000
216.6666 67
使用 Hive SQL 窗口函数进行商务数据分析
本文会从一个商务分析案例入手,说明 SQL 窗口函数的使用方式。通过本文的 5 个需求分 析,可以看出 SQL 窗口函数的功能十分强大,不仅能够使我们编写的 SQL 逻辑更加清晰, 而且在某种程度上可以简化需求开发。 数据准备 本文主要分析只涉及一张订单表 orders,操作过程在 Hive 中完成,具体数据如下: -- 建表 CREATE TABLE orders(
month revenue running_total
2020-01-01 650 650
2020-02-01 1250 1900
2020-03-01 1500 3400 我们还可以使用下面的组合方式进行分析,SQL 如下: SELECT
order_id, customer_id, city, add_time, amount, sum(amount) over () as amount_total, -- 所有数据求和 sum(amount) over (order by order_id rows between unbounded preceding and current row) as running_sum, -- 累计求和 sum(amount) over (partition by customer_id order by add_time rows between unbounded preceding and current row) as running_sum_by_customer, avg(amount) over (order by add_time rows between 5 preceding and current row) as
order_id int, customer_id string, city string, add_time string, amount decimal(10,2));
-- 准备数据 INSERT INTO orders VALUES (1,"A","上海","2020-01-01 00:00:00.000000",200), (2,"B","上海","2020-01-05 00:00:00.000000",250), (3,"C","北京","2020-01-12 00:00:00.000000",200), (4,"A","上海","2020-02-04 00:00:00.000000",400), (5,"D","上海","2020-02-05 00:00:00.000000",250), (5,"D","上海","2020-02-05 12:00:00.000000",300), (6,"C","北京","2020-02-19 00:00:00.000000",300), (7,"A","上海","2020-03-01 00:00:00.000000",150), (8,"E","北京","2020-03-05 00:00:00.000000",500), (9,"F","上海","2020-03-09 00:00:00.000000",250), (10,"B","上海","2020-03-21 00:00:00.000000",600); 需求 1:收入增长 在业务方面,第 m1 个月的收入增长计算如下:100 *(m1-m0)/ m0 其中,m1 是给定月份的收入,m0 是上个月的收入。因此,从技术上讲,我们需要找到每 个月的收入,然后以某种方式将每个月的收入与上一个收入相关联,以便进行上述计算。计 算当时如下: WITH monthly_revenue as (
month revenue prev_month_revenue revenue_growth
2020-01-01 650 NULL
NULL
2020-02-01 1250 650
92.3
2020-03-01 1500 1250
20
我们还可以按照按城市分组进行统计,查看某个城市某个月份的收入增长情况 WITH monthly_revenue as (
t.order_id t.customer_id t.city t.add_time
t.amount t.rank
1
A
上海 2020-01-01 00:00:00.000000 200
1
2
B
上海 2020-01-05 00:00:00.000000 250
1
3
C
北京 2020-01-12 00:00:00.000000 200
1
4
A
上海 2020-02-04 00:00:00.000000 400
1
5
D
上海 2020-02-05 12:00:00.000000 300
1
6
C
北京 2020-02-19 00:00:00.000000 300
1
7
A
上海 2020-03-01 00:00:00.000000 150
1
8
E
北京 2020-03-05 00:00:00.000000 500
相关文档
最新文档