欢迎关注Hadoop、Spark、Flink、Hive、Hbase、Flume等大数据资料分享微信公共账号:iteblog_hadoop
  1. 文章总数:978
  2. 浏览总数:11,978,651
  3. 评论:3938
  4. 分类目录:106 个
  5. 注册用户数:6129
  6. 最后更新:2018年12月15日
过往记忆博客公众号iteblog_hadoop
欢迎关注微信公众号:
iteblog_hadoop
大数据技术博客公众号bigdata_ai
大数据猿:
bigdata_ai

Hive创建索引

  索引是标准的数据库技术,hive 0.7版本之后支持索引。Hive提供有限的索引功能,这不像传统的关系型数据库那样有“键(key)”的概念,用户可以在某些列上创建索引来加速某些操作,给一个表创建的索引数据被保存在另外的表中。 Hive的索引功能现在还相对较晚,提供的选项还较少。但是,索引被设计为可使用内置的可插拔的java代码来定制,用户可以扩展这个功能来满足自己的需求。 当然不是说有的查询都会受惠于Hive索引。用户可以使用EXPLAIN语法来分析HiveQL语句是否可以使用索引来提升用户查询的性能。像RDBMS中的索引一样,需要评估索引创建的是否合理,毕竟,索引需要更多的磁盘空间,并且创建维护索引也会有一定的代价。 用户必须要权衡从索引得到的好处和代价。
  下面说说怎么创建索引:
  1、先创建表:

hive> create table user( id int, name string)  
    > ROW FORMAT DELIMITED  
    > FIELDS TERMINATED BY '\t'
    > STORED AS TEXTFILE;

  2、导入数据:

hive> load data local inpath '/export1/tmp/wyp/row.txt' 
    > overwrite into table user;

  3、创建索引之前测试

hive> select * from user where id =500000;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Cannot run job locally: Input Size (= 356888890) is larger than 
hive.exec.mode.local.auto.inputbytes.max (= 134217728)
Starting Job = job_1384246387966_0247, Tracking URL = 
http://l-datalogm1.data.cn1:9981/proxy/application_1384246387966_0247/
Kill Command=/home/q/hadoop/bin/hadoop job -kill job_1384246387966_0247
Hadoop job information for Stage-1: number of mappers:2; number of reducers:0
2013-11-13 15:09:53,336 Stage-1 map = 0%,  reduce = 0%
2013-11-13 15:09:59,500 Stage-1 map=50%,reduce=0%, Cumulative CPU 2.0 sec
2013-11-13 15:10:00,531 Stage-1 map=100%,reduce=0%, Cumulative CPU 5.63 sec
2013-11-13 15:10:01,560 Stage-1 map=100%,reduce=0%, Cumulative CPU 5.63 sec
MapReduce Total cumulative CPU time: 5 seconds 630 msec
Ended Job = job_1384246387966_0247
MapReduce Jobs Launched:
Job 0: Map: 2   Cumulative CPU: 5.63 sec   
HDFS Read: 361084006 HDFS Write: 357 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 630 msec
OK
500000 wyp.
Time taken: 14.107 seconds, Fetched: 1 row(s)

一共用了14.107s
  4、对user创建索引

hive> create index user_index on table user(id) 
    > as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 
    > with deferred rebuild
    > IN TABLE user_index_table;
hive> alter index user_index on user rebuild;
hive> select * from user_index_table limit 5; 
0       hdfs://mycluster/user/hive/warehouse/table02/000000_0   [0]
1       hdfs://mycluster/user/hive/warehouse/table02/000000_0   [352]
2       hdfs://mycluster/user/hive/warehouse/table02/000000_0   [704]
3       hdfs://mycluster/user/hive/warehouse/table02/000000_0   [1056]
4       hdfs://mycluster/user/hive/warehouse/table02/000000_0   [1408]
Time taken: 0.244 seconds, Fetched: 5 row(s)

这样就对user表创建好了一个索引。
  5、对创建索引后的user再进行测试

hive> select * from user where id =500000;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Cannot run job locally: Input Size (= 356888890) is larger than 
hive.exec.mode.local.auto.inputbytes.max (= 134217728)
Starting Job = job_1384246387966_0247, Tracking URL = 
http://l-datalogm1.data.cn1:9981/proxy/application_1384246387966_0247/
Kill Command=/home/q/hadoop/bin/hadoop job -kill job_1384246387966_0247
Hadoop job information for Stage-1: number of mappers:2; number of reducers:0
2013-11-13 15:23:12,336 Stage-1 map = 0%,  reduce = 0%
2013-11-13 15:23:53,240 Stage-1 map=50%,reduce=0%, Cumulative CPU 2.0 sec
2013-11-13 15:24:00,253 Stage-1 map=100%,reduce=0%, Cumulative CPU 5.27 sec
2013-11-13 15:24:01,650 Stage-1 map=100%,reduce=0%, Cumulative CPU 5.27 sec
MapReduce Total cumulative CPU time: 5 seconds 630 msec
Ended Job = job_1384246387966_0247
MapReduce Jobs Launched:
Job 0: Map: 2   Cumulative CPU: 5.63 sec   
HDFS Read: 361084006 HDFS Write: 357 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 630 msec
OK
500000 wyp.
Time taken: 13.042 seconds, Fetched: 1 row(s)

时间用了13.042s这和没有创建索引的效果差不多。

  在Hive创建索引还存在bug:如果表格的模式信息来自SerDe,Hive将不能创建索引:

hive> CREATE INDEX employees_index
    > ON TABLE employees (country)
    > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
    > WITH DEFERRED REBUILD
    > IDXPROPERTIES ('creator' = 'me','created_at' = 'some_time')
    > IN TABLE employees_index_table
    > COMMENT 'Employees indexed by country and name.';
FAILED: Error in metadata: java.lang.RuntimeException:             \
Check the index columns, they should appear in the table being indexed.
FAILED: Execution Error, return code 1 from                       \
org.apache.hadoop.hive.ql.exec.DDLTask

这个bug发生在Hive0.10.0、0.10.1、0.11.0,在Hive0.12.0已经修复了,详情请参见:https://issues.apache.org/jira/browse/HIVE-4251

本博客文章除特别声明,全部都是原创!
转载本文请加上:转载自过往记忆(https://www.iteblog.com/)
本文链接: 【Hive创建索引】(https://www.iteblog.com/archives/836.html)
喜欢 (13)
分享 (0)
发表我的评论
取消评论

表情
本博客评论系统带有自动识别垃圾评论功能,请写一些有意义的评论,谢谢!
(3)个小伙伴在吐槽
  1. 意思是hive的索引很鸡肋?我可以这么说嘛?
    朝夕奔梦2017-06-16 22:10 回复
  2. 你好,我看了您的很多文章,但是本文是有错误的,你没有启动索引,如果启动索引后是只有一个mappper的,可以参考一下别人的文章比如:http://www.cnblogs.com/end/archive/2013/01/22/2871147.html,还有希望作者再写点有关索引的文章,毕竟这个对hive的效率有很大影响,谢谢!
    xuanjinlee2014-06-22 20:07 回复