大数据-Hadoop+Hive+Spark集群安装(三)

陈德兵 1年前 ⋅ 3245 阅读

接上文:大数据-Hadoop+Hive+Spark集群安装(二)


一、Hive命令:

beeline

!connect jdbc:hive2://localhost:10000

用户名密码登陆 hadoop,123456789

见hive-site.xml配置

hive.jdbc_passwd.auth.hadoop

123456789

 


 本地直接Hive命令


  



建表:

CREATE TABLE goods_ids(

  id string, 

  goods_id BIGINT, 

  source int,

  cat_id_1 int, 

  cat_id_2 int, 

  cat_id_3 int, 

  cat_id_4 int, 

  cat_name_1 string, 

  cat_name_2 string, 

  cat_name_3 string, 

  cat_name_4 string, 

  category string

  ) row format delimited fields terminated by "," stored as orc;

orc是压缩快,快速列存取 ,效率比rcfile高,是rcfile的改良版本,加上红色部分即可。如果不加默认为TextFile格式。


如果是textfile格式,可以直接将txt数据文件导入到表中,导入命令:

load data local inpath '/data/data/goods_ids.txt' into table goods_ids1;


可以建一个orc的表,再建一个textfile的表goods_ids1,把文件导入goods_ids,在通过sql将数据同步到orc表中,删除textfile表:insert into goods_ids select * from goods_ids1 。


索引:

create index goods_ids_goodsid_index on table goods_ids (goods_id)

as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 

with deferred rebuild 

in table goods_ids_goodsid_index_table; 创建


alter index goods_ids_goodsid_index on goods_ids rebuild;  rebuild成功


show index on goods_ids; 查看


drop index goods_ids_goodsid_index  on goods_ids; 删除



如果启动下面的错误:修改mysql hivedb VERSION表,SCHEMA_VERSION值改成2.3.0

Exception in thread "main" MetaException(message:Hive Schema version 2.3.0 does not match metastore's schema version 1.2.0 Metastore is not upgraded or corrupt)

或者(永久性办法)

hive-site.xml补充:

hive.metastore.schema.verification

false


二、优化-1

大字段的分区代替索引,索引用起来比较繁琐,新增数据索引表也不会自动变化,需要重新构建。

大字段分区可以会根据虚拟字段进行分文件夹存储。

CREATE TABLE goods_info_daysales(

  id string, 

  goods_id BIGINT, 

  goods_name string,

  sales int, 

  `date` date

  ) PARTITIONED BY (day string) row format delimited fields terminated by "," stored as orc;

虚拟day分区字段。

insert into goods_info_daysales partition(day='2020-08-24') select * from goods_info_daysales1 where `date`='2020-08-24';

select * from goods_info_daysales where day='2020-08-24';查询速度根据分区条件进行定位,增加查询速度。



三、运维

--  删除库drop database if exists db_name;

--  强制删除库drop database if exists db_name cascade;

--  删除表drop table if exists employee;

--  清空表truncate table employee;

--  清空表,第二种方式insert overwrite table employee select * from employee where 1=0;

--  删除分区alter table employee_table drop partition (stat_year_month>='2018-01');

--  按条件删除数据insert overwrite table employee_table select * from employee_table where id>'180203a15f';


四、java

java -cp "yourFile.jar:jars/*" package.className


五、参考配置

配置文件


注意:本文归作者所有,未经作者允许,不得转载

全部评论: 0

    我有话说: