Hi I have a hive table on HBASE that has 200gb of records . I am running simple hive query to fetch 20 gb records . But this takes around 4 hours of time . I can not create partition on HIVE table cause it is integrated on HBASE. Please suggest any idea to improve performance This is my HIVE query INSERT OVERWRITE LOCAL DIRECTORY '/hadoop/user/m6034690/FSDI/FundamentalAnalytic/FundamentalAnalytic_2014.txt' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from hbase_table_FundamentalAnalytic where FilePartition='ThirdPartyPrivate' and FilePartitionDate='2014';
Use Phoenix instead of Hive. Phoenix will give better performance. If your HBase table is already there, you can create a view in Phoenix on the HBase table. You could also use Phoenix to create HBase tables.
Here is an example in which I'm creating a view on an HBase table for the column family named pageview:
create view "pageviews_hbase"(event_id varchar not null primary key,"pageview"."event_time" varchar,"pageview"."url" varchar, "pageview"."country" varchar,"pageview"."session_id" varchar,"pageview"."user_cookie" varchar,"pageview"."os" varchar,"pageview"."browser" varchar)
Phoenix is not installed on my cluster as it is not bundeled in CDH 5 distribution ,Any other idea
Spark is another option; however, this will require you to write either Java or Scala code using the Spark DataFrames API.
IMO, and I have tested this many time with high volumes of data, Hive on HBase is not a good solution for SQL analytics of your HBase data.
Phoenix does work with CDH. Here is a link on how to install: