Me and my team are using Hive version 3.1.0 as part of HDInisght 4.0 in Azure.
We have a dimensional model which stores data in Hive managed tables (ACID as default) with underlying file system ADLS Gen2 and using Azure SQL DB as a metastore.
To populate the data in those tables, after some conforming done with Spark, we use MERGE Statements .
The Merges perform Upsert.
We are currently faced with one, very important problem, which goes like this:
- Sometimes when a long running query is executed and it uses a table which just had a merge executed against it before, that query fails, because, as far as I can see, the compactor runs in the same time on the underlying data and it deletes some of the files (merges them) used by that specific table, making the query running on top of it fail.
The error is this one (I replaced the ADLS path with something dummy for anonymity):
ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 27, vertexId=vertex_1584051627112_5934_1_12, diagnostics=[Task failed, taskId=task_1584051627112_5934_1_12_000565, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1584051627112_5934_1_12_000565_0:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.io.FileNotFoundException: Operation failed: "The specified path does not exist.", 404, GET, https:/myADLSUri.dfs.core.windows.net/myADLSContainer/hive/warehouse/managed/new_repartition.db/call_fact/country_code%3DBR/year%3D2020/month%3D2/delete_delta_0000005_0000005_0001/bucket_00015?timeout=90, PathNotFound, "The specified path does not exist. RequestId:bde0f6fa-201f-0095-043a-008c49000000 Time:2020-03-22T11:08:04.4571403Z"
at java.security.AccessController.doPrivileged(Native Method)
Caused by: java.lang.RuntimeException: java.io.IOException: java.io.FileNotFoundException: Operation failed: "The specified path does not exist.", 404, GET, https://myADLSUri.dfs.core.windows.net/myADLSContainer/hive/warehouse/managed/new_repartition.db/call_fact/country_code%3DBR/year%3D2020/month%3D2/delete_delta_0000005_0000005_0001/bucket_00015?timeout=90, PathNotFound, "The specified path does not exist. RequestId:bde0f6fa-201f-0095-043a-008c49000000 Time:2020-03-22T11:08:04.4571403Z"
... 16 more
Do you guys have any ideas on what might be causing this - the query and the compactor conflict?
Also, any suggestions on how to fix it? (I know that we can disable compaction and run them manually, but we have all sorts of dynamic partitions and we are talking about 200 tables with multiple ETL flows running in parallel - we will see a great performance decrease and it will be a mess handling it).
... View more
Hi, First of all, I am new to cloudera and big data solutions and I am trying to set up a training lab. I am using a virtual machine on an ESXi server ( hypervisor) which has 24GB of RAM allocated to it. I have downloaded the vm from the cloudera site : Cloudera Express 5.7.0 Vanilla and deployed it succesfully. The problems I am facing right now are about Java Heap allocation. I saw that the manager automatically assgined values to Java Heap sizes for the NameNode and Secondary NameNode which are not enough ( I don't understand why, because the machine has 24GB and I saw that 8+ should be enough ). I manually altered those values to 4GB for each node, but I got another error regarding space issues - Memory on host quickstart.cloudera is overcommitted. The total memory allocation is 21.3 GiB bytes but there are only 23.5 GiB bytes of RAM (4.7 GiB bytes of which are reserved for the system). Visit the Resources tab on the Host page for allocation details. Reconfigure the roles on the host to lower the overall memory allocation. Note: Java maximum heap sizes are multiplied by 1.3 to approximate JVM overhead So basically, if I increase the Heap for the NameNodes to the suggested size, I get the above error, if I keep it lower, then I get errors on not having enough... My questions would be: 1. Is 24 GB enough for the current version of virtual machine I am using ? - it has only one data node and 1 name node + secondary node, as it is a training machine I espect low processing on it. 2. If 24 is not ok, what size do you recommend ? - I think we can pump it up to 32. 3. What settings should I use for the heap allocation, can I keep the defaults ones or should I increase them to 4GB ( this is what the tool is suggesting ) or some other value ? 4. Any other tips? Thanks in advance for your answers, Cristi.
... View more