About mathieu.d

mathieu.d · ‎08-16-2016

The "multiplier" is not a parameter. You directly set the number of vcpu you want yarn to use. I guess you have to do the math yourself before setting the value.

mathieu.d · ‎08-11-2016

Ok, since the default behaviour is unefficient I have search for a way to make the "bulk load" more efficient. I think I found a more efficient way, but there seems to be a blocker bug on that (referenced here : https://issues.apache.org/jira/browse/HIVE-13539 ) 1- The point is to set these two properties before runing the insert command : SET hive.hbase.generatehfiles=true; SET hfile.family.path=/<a_path>/<thecolumn_family_name>; 2- Then run the insert query which will prepare HFile at the designated location (instead of directly loading the HBase table). 3- And then only, performe a bulkload on HBase using the HFiles prepared. export HADOOP_CLASSPATH=`hbase classpath` yarn jar /usr/hdp/current/hbase-client/lib/hbase-server.jar completebulkload /<a_path>/<thecolumn_family_name> Problem, the query creating the HFile is failing because it "found" multiple column family because it look at the wrong folder. I'm doing my test on CDH5.7.1 Does someone already test this method ? If yes, is there some properties to set I have forgotten ? Or is this really a blocker issue ? Then I'll raise this to the support. regards, mathieu

mathieu.d · ‎08-09-2016

Did you specify the POST parameter "execute" with the query ? https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Hive#WebHCatReferenceHive-URL

mathieu.d · ‎08-09-2016

In the role HDFS there is a "NFS gateway service" that let you mount an NFS image of the HDFS. That is one way (you can directly copy file to it). (Check the performance). Hue (web ui) also let you upload files into HDFS (this is a more manual approach). In our enterprise, for an automated process, we are using a custom Java application that is using the HCatWriter API for writting into Hive tables. But you can also use the httpFs or the webHdfs.

mathieu.d · ‎08-09-2016

Thank you for this explanation. This will help me a lot for the next steps.

mathieu.d · ‎08-08-2016

Hi, We are facing some performance issue while loading data into HBase (using Hive queries). The Hive query is quite simple : INSERT INTO TABLE <hive_table_name_targeting_hbase_table> SELECT * FROM <hive_table> The table "<hive_table_name_targeting_hbase_table>" is an Hive table using the HBaseStorageHandler (So there is a Hbase table as the storage). The table "<hive_table>" is a regular Hive table. There is millions of lines in the <hive_table> and the <hive_table_name_targeting_hbase_table> is empty. When running the query we can see that the Yarn job generate "177 mapper" (less or more depending on the data size in <hive_table>). This part is quite "normal". But when I check the execution log of each mapper, I can see that some mapper take A LOT MORE TIME than others. Some mapper can take up to an hour (whereas the normal time of a mapper is around 10 minutes). In the log file of the "slow" mappers I can see a lot of retry on HBase operation (and finaly some exception about NotServingHBaseRegion. After some time (and a lot of retry) it's OK. But unfortunatly, this is slowing down the treatment a lot. Does someone has already encounter this ? (while loading a HBase table using Hive queries) ? Could it be related to region being split during the write ? If yes, why ? Is there some bug in the HBaseStorageHandler with too much data ? Of course the HBase table is online and can accessed normaly after loading the data. So no HBase configuration issue here (at least not a basic one). HBase compaction is set to 0 (and is launched manualy). Log sample : 2016-08-08 10:18:25,962 INFO [htable-pool1-t31] org.apache.hadoop.hbase.client.AsyncProcess: #2, table=prd_piste_audit_gsie_traite_001, attempt=13/35 failed=28ops, last exception: null on <a_host>,60020,1467474218569, tracking started null, retrying after=20126ms, replay=28ops 2016-08-08 10:18:46,091 INFO [htable-pool1-t31] org.apache.hadoop.hbase.client.AsyncProcess: #2, table=prd_piste_audit_gsie_traite_001, attempt=14/35 failed=28ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region prd_piste_audit_gsie_traite_001,15a55dd4-5c6e-41b3-9d2e-304015aae5e9,1470642880612.e8868eaa5ac33c4612632c2c89474ecc. is not online on <a_host>,60020,1467474218569 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2786) at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:922) at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1893) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32213) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) on <a_host>,60020,1467474218569, tracking started null, retrying after=20099ms, replay=28ops

mathieu.d · ‎05-19-2016

I'm not seeing the same issue here. Check the yarn application logs. It will surely contain information about the issue.

mathieu.d · ‎04-27-2016

Not sure this is the latest documentation for Impala but the Hive "Date" type is not supported in Impala. Use TIMESTAMP instead for example. Check "Impala supported data types" in google. (Sorry I can't paste the url I don't know why).

mathieu.d · ‎03-03-2016

Why not creating an Hive table on top of the text file and then simply use an hive query to load the data into the avro table ?

mathieu.d · ‎12-18-2015

That's great to know. best regards.

Online	Offline
Last Visited	‎01-17-2018 02:52 AM

Member Since	‎07-16-2015 01:41 AM
Last Visited	‎01-17-2018 02:52 AM
Posts	177
Kudos received	28

Cloudera Community

Re: Unable to delete HDFS Corrupt files

Re: Hive partitions based on date from timestamp

Re: Partition Hive Table to Hbase Handler ?

Re: yarn logs location on disk

Re: Increase Flume graceful restart time

Re: Yarn Physical to Virtual Core multiplier

Re: HBase slow bulk loading using Hive

Re: Access to WebHCat

Re: How to copy files from remote windows system t...

Re: HBase slow bulk loading using Hive

HBase slow bulk loading using Hive

Re: hive.auto.convert.join Execution Error, return...

Re: Impala was not able to query hive table with c...

Re: convert text file into avro file..

Re: Release SOLR JVM memory after index delete