目录
- hive outline
- hive parse_url
- hive parse_url_tuple
hive outline
链接
对以下数据建表:tb_url
1 http://facebook.com/path/p1.php?query=1
2 http://tongji.baidu.com/news/index.jsp?uuid=frank
3 http://www.jdwz.com/index?source=baidu
4 http://www.itcast.cn/index?source=alibaba
hive parse_url
该函数属于UDF类型
语法:
parse_url(url, partToExtract[, key]) - extracts a part from a URL
Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO key
需求:实现对URL进行分析,从URL中获取每个ID对应HOST、PATH以及QUERY
id | url |
1 | |
2 | |
3 | |
4 |
select id,
parse_url(url, "HOST") as host,
parse_url(url, "PATH") as path,
parse_url(url, "QUERY") as query
from tb_url;
id | host | path | query |
1 | facebook.com | /path/p1.php | query=1 |
2 | tongji.baidu.com | /news/index.jsp | uuid=frank |
3 | www.jdwz.com | /index | source=baidu |
4 | www.itcast.cn | /index | source=alibaba |
注意:
使用parse_url函数每次只能解析一个参数,导致构建多列需要调用多次该函数,我们希望能实现调用一次函数,就可以将多个参数进行解析,得到多列结果。parse_url_tuple可以实现该效果
hive parse_url_tuple
该函数属于UDTF类型(常结合lateral view侧视图)
语法:
parse_url_tuple(url, partname1, partname2, ..., partnameN) - extracts N (N>=1) parts from a URL.
It takes a URL and one or multiple partnames, and returns a tuple. All the input parameters and output column types are string.
Partname: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:<KEY_NAME>
parse_url在使用时可以指定多个参数
第一个参数:url:指定要解析的URL
第二个参数:key1:指定要解析的内容1
……
第N个参数:keyN:指定要解析的内容N
select a.id,
b.host,
b.path,
b.query
from tb_url a lateral view parse_url_tuple(url, "HOST", "PATH", "QUERY") b as host, path, query;
id | host | path | query |
1 | facebook.com | /path/p1.php | query=1 |
2 | tongji.baidu.com | /news/index.jsp | uuid=frank |
3 | www.jdwz.com | /index | source=baidu |
4 | www.itcast.cn | /index | source=alibaba |