About mzkdm

mzkdm · ‎03-27-2018

Hi, When sqoop connects to a DB describe an object, it does so from the host where sqoop is run. But the actual data extract runs on the cluster - can the worker nodes also connect to the DB? Sometimes firewall rules get in the way here. If you can't fix connectivity from the cluster to the DB, and you're just running a single map job, you could also run the extract in local mode with "-jt local" to run the whole extract on the sqoop host. Regards, Mark

mzkdm · ‎03-27-2018

10 hours is a long time for a refresh. Since this is a partitoned table, if you know the partitions being added then you could use the new "refresh table ... partition ..." syntax to only look at those partitons. Alternatively, recover partitions is faster than refresh for partitioned tables if you're only adding new partitions (vs. updating existing ones): https://www.cloudera.com/documentation/enterprise/5-11-x/topics/impala_refresh.html Regards, Mark

mzkdm · ‎03-27-2018

Wish we could have fixed this for you... The only ideas I had involved some serious low level debugging / tracing while it was stuck in this state. Hope things are better now.

mzkdm · ‎03-27-2018

Hi, When I've had to support data rollback in Hive (and every other data warehouse project concerned with reliability), I've leveraged partitions to support logical rollback, vs physically moving data around or restoring the metastore. The process flow looked something like: 1. Create a table, partitioned by load date or id. 2. Insert data into that partition. 3. Create a view in an interface layer (another DB) that queries from the base table, and filters to that one partition. Query tools / clients only query from this interface layer. 4. Next load, add a new partition. 5. Insert data into the new partition. 6. Update the interface view to point to the new partition. When rollback is needed, it's easy to update all the views to point to a previous partition that represents an earlier point-in-time. With the views filtering by a partition key, Hive and Impala (but not Spark SQL) will use partition pruning to only select from the correct partitions. Another benefit of the above approach is you have point-in-time snapshots of data over time, which makes it very easy to build a data warehouse with snapshot facts. It can also be made to work with tables that are appended to (vs replaced or merged into), assuming you aren't updating data in place, e.g., you're using new partitions with each load. The downside of the above approach is you can end up with lots of partitions, which can be a problem with Impala (it scales less well than Hive on partition count), but that's mitigated with a good retention policy or table archive implementation. Regards, Mark

mzkdm · ‎05-23-2016

No, as CDH 5.7 includes Hive 1.1.0 plus backported fixes. But HPL/SQL is also available as a separate download and claims to work with any Hive version per their download page: http://www.hplsql.org/download

mzkdm · ‎05-21-2016

Hive 2.0.0 includes HPL/SQL, which includes an execute function for dynamic SQL: http://www.hplsql.org/execute

Online	Offline
Last Visited	‎01-24-2019 07:58 PM

Member Since	‎07-10-2014 12:00 PM
Last Visited	‎01-24-2019 07:58 PM
Posts	11
Kudos received	2

Cloudera Community

Re: sqoop import from oracle fails

Re: Impala refresh took over 36385 seconds to comp...

Re: hbase Master failed to become active master

Re: Rollback

Re: Impala Dynamic SQL

Re: Impala Dynamic SQL