Member since
07-10-2014
11
Posts
2
Kudos Received
0
Solutions
03-27-2018
06:41 PM
Hi, When sqoop connects to a DB describe an object, it does so from the host where sqoop is run. But the actual data extract runs on the cluster - can the worker nodes also connect to the DB? Sometimes firewall rules get in the way here. If you can't fix connectivity from the cluster to the DB, and you're just running a single map job, you could also run the extract in local mode with "-jt local" to run the whole extract on the sqoop host. Regards, Mark
... View more
03-27-2018
06:26 PM
10 hours is a long time for a refresh. Since this is a partitoned table, if you know the partitions being added then you could use the new "refresh table ... partition ..." syntax to only look at those partitons. Alternatively, recover partitions is faster than refresh for partitioned tables if you're only adding new partitions (vs. updating existing ones): https://www.cloudera.com/documentation/enterprise/5-11-x/topics/impala_refresh.html Regards, Mark
... View more
03-27-2018
06:19 PM
Wish we could have fixed this for you... The only ideas I had involved some serious low level debugging / tracing while it was stuck in this state. Hope things are better now.
... View more
03-27-2018
05:58 PM
1 Kudo
Hi, When I've had to support data rollback in Hive (and every other data warehouse project concerned with reliability), I've leveraged partitions to support logical rollback, vs physically moving data around or restoring the metastore. The process flow looked something like: 1. Create a table, partitioned by load date or id. 2. Insert data into that partition. 3. Create a view in an interface layer (another DB) that queries from the base table, and filters to that one partition. Query tools / clients only query from this interface layer. 4. Next load, add a new partition. 5. Insert data into the new partition. 6. Update the interface view to point to the new partition. When rollback is needed, it's easy to update all the views to point to a previous partition that represents an earlier point-in-time. With the views filtering by a partition key, Hive and Impala (but not Spark SQL) will use partition pruning to only select from the correct partitions. Another benefit of the above approach is you have point-in-time snapshots of data over time, which makes it very easy to build a data warehouse with snapshot facts. It can also be made to work with tables that are appended to (vs replaced or merged into), assuming you aren't updating data in place, e.g., you're using new partitions with each load. The downside of the above approach is you can end up with lots of partitions, which can be a problem with Impala (it scales less well than Hive on partition count), but that's mitigated with a good retention policy or table archive implementation. Regards, Mark
... View more
05-23-2016
08:44 AM
No, as CDH 5.7 includes Hive 1.1.0 plus backported fixes. But HPL/SQL is also available as a separate download and claims to work with any Hive version per their download page: http://www.hplsql.org/download
... View more
05-21-2016
09:31 AM
1 Kudo
Hive 2.0.0 includes HPL/SQL, which includes an execute function for dynamic SQL: http://www.hplsql.org/execute
... View more