（在说正事之前，我要推荐一个福利：你还在原价购买阿里云、腾讯云、华为云服务器吗？那太亏啦！来这里，新购、升级、续费都打折，能够为您省60%的钱呢！2核4G企业级云服务器低至69元/年，点击进去看看吧>>>)

一：首先自我介绍

二：数据倾斜

? ? ? ?2.1.是什么？

? ? ? ?2.2.为什么

一：首先自我介绍

? ? ? ?谈一谈为什么会选择这个岗位？

? ? ? ?谈一谈你对大数据技术栈的认识？? ? ? ?

? ? ? ?难道真的是我热爱数据，喜欢钻研大数据技术吗？哈哈哈哈哈哈哈哈哈😂

? ? ? ?目前对Hadoop生态了解比较多，但是钻研不够深入，后续着重学习hive基础，技术，hql深入那种，熟悉Hadoop，

? ? ? ?然后牢记mapreduce，flume，Kafka，组件等原理概念。也要尽快着手学习spark技术栈的学习

二：数据倾斜

? ? ? ? 什么是数据倾斜？

? ? ? ? 为什么会产生数据倾斜？

? ? ? ? 你在实际应用中是否遇到过数据倾斜的问题？怎么解决数据倾斜？

? ? ? ?2.1.是什么？

? ? ? ? ? ? 任务进度长时间维持在99%，查看监控页面发现只有某几个reduce子任务尚未完成。

? ? ? ?2.2.为什么

? ? ? ? ? 为什么会出现数据倾斜这种情况呢？简单来讲，例如wordcount中某个key对应的数据量非常大的话，就会产生数据倾斜。

? ? ? ? ?一般由什么操作导致？一般由于count(distinct *), group by(),? join()操作引起，导致某个reduce处理的数据过多，

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?引起处理时间非常耗时。

? ? ? ? 1）group by()? ??

注：group by 优于 distinct group
情形：group by 维度过小，某值的数量过多
后果：处理某值的 reduce 非常耗时
解决方式：采用 sum() group by 的方式来替换 count(distinct)完成计算。

? ? ? ? 2) count(*)

count(distinct)
情形：某特殊值过多
后果：处理此特殊值的 reduce 耗时；只有一个 reduce 任务
解决方式：count distinct 时，将值为空的情况单独处理，比如可以直接过滤空值的行，
在最后结果中加 1。如果还有其他计算，需要进行 group by，可以先将值为空的记录单独处
理，再和其他计算结果进行 union。

? ? ? ?3)不同数据类型关联产生数据倾斜

情形：比如用户表中 user_id 字段为 int，log 表中 user_id 字段既有 string 类型也有 int 类
型。当按照 user_id 进行两个表的 Join 操作时。
后果：处理此特殊值的 reduce 耗时；只有一个 reduce 任务
默认的 Hash 操作会按 int 型的 id 来进行分配，这样会导致所有 string 类型 id 的记录都分配
到一个 Reducer 中。
解决方式：把数字类型转换成字符串类型
select * from users a 
left outer join logs b
on a.usr_id = cast(b.user_id as string)

? ? ? 2. 3.导致后果？

? ? ? ? ? ? ?拖慢整个job执行时间，（其他以已经完成的结点都在等这个还在做的结点）

? ? ? 2. 4.分类？

? ? ? ?（借鉴spark中数据倾斜举例）（不属于HQL中知识，后期再看这块）

? ? ? 2.5 数据倾斜分类补充

? ? ? ? ?1）聚合倾斜

? ? ? ? ? ? ? （局部聚合+全局聚合）

? ? ? ? ?2） join倾斜

三：写编程题目：数组中最小k 个数

3.1 两种方法

方法一：暴力法：直接调用库函数排序，输出结果即可.(面试官需要的肯定不是这个答案)

class Solution {
    public int[] smallestK(int[] arr, int k) {
        // 暴力法：先调用库函数排序，直接输出结果即可
        int[] res = new int[k];
        Arrays.sort(arr);
        for(int i =0;i<k;i++){
            res[i] = arr[i];
        }
        return res;
    }
}

方法二：使用优先队列实现栈。

用大顶堆进行临时存储k个元素，然后取堆顶与其余元素做比较

class Solution {
    public int[] smallestK(int[] arr, int k) {
        int[] res = new int[k];
        if(k ==0){return res;}
        PriorityQueue<Integer> pqueue = new PriorityQueue<Integer>((a,b)->b-a); //构建大顶堆
        for(int i=0;i<k;i++){
            pqueue.offer(arr[i]);
        }
        for(int i =k;i<arr.length;i++){
            if(pqueue.peek()>arr[i]){
                pqueue.poll();
                pqueue.offer(arr[i]);
            }
        }
        for(int i =0;i<k;i++){
            res[i] = pqueue.poll();
        }
        return res;
    }
}

3.2此处涉及优先队列实现堆知识点总结：

? ? 1.Java中优先队列PriorityQueue的用法

? ??PriorityQueue<Integer> pqueue = new PriorityQueue<Integer>();

? ?当不指定comparator时，默认为小顶堆，初始容量为11.

? 通过传入自定义的comparator函数时可以实现大顶堆

? PriorityQueue<Integer> pqueue = new PriorityQueue<Integer>(new Comparator<Integer>(){
? ? ? ? public int compare(Integer a,Integer b){
? ? ? ? ? ?return b-a;
? ? ? ?}??
? ?};

?或者简介版

? ?PriorityQueue<Integer> pqueue = new PriorityQueue<Integer>((a,b)->b-a);?

?3.3.自定义比较类知识总结

类似与（a,b）->b-a的用法(即自定义比较类)

? ? ?例如之前的一道编程题目需要使用“用最小数量的箭引爆气球”中我们首先需要对数据按照它的第二维度进行排序。

比如[左区间，右区间]，我们需要按照右区间大小进行排序。

我们使用Arrays.sort(points,(a,b)->a[1]>b[1]?1:-1)

四：写SQL

4.1两种方法解决

? ? ? ? 对于下表zijie_ads.求每个自然周，新用户，完播率排名前5的用户的网页跳转来源？

day (date)	id (int)	user_type (int)	play_rate (int)	resource (string)?
2021-01-04	1	1	0.4	type_a
2021-09-22	2	0	0.4	type_b
......	......	......	......	......

? ? ? ??

? ?

? ? ? ? 思路：此问题中有两个难点：1)完播率排名前5如何求？

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?2）?如何把范围规定到每个自然周，即每个自然周的表示方法？

第一个问题：完播率前5名如何求，可以参考我的

博客https://blog.csdn.net/yezonghui/article/details/115283626?中题目四--部门工资前三高的员工

此处方法一：mysql

select h1.resource 
from zijie_ads h1
where user_type = 1
      and
(select count(distinct h2.play_rate)
 from haokan h2
 where h2.play_rate > h1.play_rate
) <5

此处的方法二：dense_rank() over()

? ?也就是说dense_rank()不一定要有partition by分组，但是一般要有按照什么字段排序喔

select temp_table.resource
from
(
  select resource, dense_rank() over(order by play_rate desc) as orderrank
  from zijie_ads
  where user_type = 1
) temp_table
where orderrank<6;

[注意]：方法一中count(distinct)加了distinct与方法二中dense_rank对应；

? ? ? ? ? ? ? ? ? ? ? ? ? ?如果不加distinct就和方法二中rank对应。

第二个问题接下来如何处理自然周呢？

? ? 这就涉及到我们hive中日期处理函数，例如weekofyear():可以求出当前日期对应的自然周

那么如何按照自然周分组呢？和问题一中一样也有两种方法，

即weekofyear(h1.day1) = weekofyear(h2.day1)

或者使用窗口函数dense_rank() over(partition by weekof(year) )进行分组

综述上面两个问题：我们最终答案如下：

方法一：

select h1.resource?
from haokan_ads_test02 h1?
where user_type = 1?
and?      (select count( h2.play_rate)?       
from haokan_ads_test02 h2?       
where h2.play_rate > h1.play_rate?       
and weekofyear(h1.day1) = weekofyear(h2.day1)?      ) <3?;

方法二：

select temp_table.resource
?from?(select resource, 
dense_rank() 
over (partition by weekofyear(day1) order by play_rate desc) as ordrrank? 
from haokan_ads_test02? where user_type = 1?) temp_table?where 
ordrrank <3;

4.2实战该题目

1）建表

create table if not exists haokan_ads_test02?(?    
user_id   int,?    
user_type int,?    
day1      date,?    
play_rate double,?    
resource  string?)?row format delimited fields terminated by ' '?
lines terminated by '\n';

2）准备数据

1 1 2021-01-02 0.6 ads1
2 1 2021-01-08 0.9 ads2
3 0 2021-01-03 0.52 ads3
4 1 2021-01-07 0.62 ads4
5 1 2021-01-11 0.19 ads5
6 0 2021-01-02 0.18 ads6
7 1 2021-01-02 0.49 ads7
8 0 2021-01-03 0.39 ads8
9 0 2021-01-09 0.21 ads9
10 0 2021-01-03 0.39 ads10
11 0 2021-01-04 0.25 ads11
12 0 2021-01-03 0.35 ads12
13 0 2021-01-09 0.1 ads13

3）把本地Linux上面数据文件上传到hdfs上面

在Linux命令行中到达文件指定目录

输入指令

hdfs dfs -put haokan_ads_test02.txt /user/hive/warehouse

4）把hdfs上面的数据导入到建好的表格中

（可以从本地导入，也可以从hdfs导入）

load data local inpath '/home/atguigu/bin/haokan_ads_test02.txt' 
overwrite into table haokan_ads_test02;

?5）select题目的要求

五：问简历，问项目

? ? ? 简历上面知识点可以写含蓄一点，多用了解，掌握。一旦写在简历上面的知识，一定要非常数量，提前多演练几遍。? ? ?

心中提前准备好面试官会提到的问题。? ? ??

六：总结

hive实战基础有点差，一些窗口函数，日期函数和思维还没有建立起来。

多动手，多思考，学习一个知识点，要么完全掌握学会，也不要是是而非含含糊糊的。

多记忆，多理解

；原文链接：https://blog.csdn.net/yezonghui/article/details/115412927

2021年字节跳动大数据研发岗面试复盘

一：首先自我介绍

二：数据倾斜

? ? ? ?2.1.是什么？

? ? ? ?2.2.为什么

? ? ? 2. 3.导致后果？

? ? ? 2. 4.分类？

? ? ? 2.5 数据倾斜分类补充

三：写编程题目：数组中最小k 个数

3.1 两种方法

3.2此处涉及优先队列实现堆知识点总结：

?3.3.自定义比较类知识总结

四：写SQL

4.1两种方法解决

4.2实战该题目

五：问简历，问项目

六：总结