我们先看几组结果,再来分析row_number()、rank()、dense_rank() 的不同之处
spark-sql> with test1 as
> (select 100024 as apptypeid,10 as pv
> union all
> select 100024 as apptypeid,20 as pv
> union all
> select 100024 as apptypeid,30 as pv
> union all
> select 100027 as apptypeid,20 as pv
> union all
> select 100027 as apptypeid,50 as pv
> union all
> select 100027 as apptypeid,50 as pv)
> select
> apptypeid,
> pv,
> row_number() over(distribute by apptypeid sort by pv desc) as rank
> from test1;
100024 30 1
100024 20 2
100024 10 3
100027 50 1
100027 50 2
100027 20 3
Time taken: 2.169 seconds, Fetched 6 row(s)
spark-sql> with test1 as
> (select 100024 as apptypeid,10 as pv
> union all
> select 100024 as apptypeid,20 as pv
> union all
> select 100024 as apptypeid,30 as pv
> union all
> select 100027 as apptypeid,20 as pv
> union all
> select 100027 as apptypeid,50 as pv
> union all
> select 100027 as apptypeid,50 as pv)
> select
> apptypeid,
> pv,
> rank() over(distribute by apptypeid sort by pv desc) as rank
> from test1;
100024 30 1
100024 20 2
100024 10 3
100027 50 1
100027 50 1
100027 20 3
spark-sql> with test1 as
> (select 100024 as apptypeid,10 as pv
> union all
> select 100024 as apptypeid,20 as pv
> union all
> select 100024 as apptypeid,30 as pv
> union all
> select 100027 as apptypeid,20 as pv
> union all
> select 100027 as apptypeid,50 as pv
> union all
> select 100027 as apptypeid,50 as pv)
> select
> apptypeid,
> pv,
> dense_rank() over(distribute by apptypeid sort by pv desc) as rank
> from test1;
100024 30 1
100024 20 2
100024 10 3
100027 50 1
100027 50 1
100027 20 2
Time taken: 3.968 seconds, Fetched 6 row(s)
总结: 从上面几组结果我们可以看出: 1.row_number()如果有两个相同的数据进行排序,相同的数据会随机顺序排序(1、2、3) 2.rank()如果有两个相同的数据进行排序,相同的数据顺序一样,排同样的顺序,而下面的数据排序会跳过上面的计数排序(1、1、3) 3.dense_rank()如果有两个相同的数据进行排序,相同的数据顺序一样,排同样的顺序,下面的数据排序不会跳过计数排序,依然会顺序排序(1、1、2) 4.rank()和dense_rank()的区别是一样跳过计数排序,一个不跳过计数排序
|