About mbigelow

mbigelow · ‎01-19-2017

Ah that will do it as all new tables inherent the DB path unless specified in the Create table statement. There is no way to alter it through HIve/Impala. You will need to log into the metastore DB and change it there. You can find it in the <metastore_db_name>.DBS and I believe the column is just called LOCATION. Find the id for the default DB and run something like 'update DBS set LOCATION = 'hdfs://NN_URI:8020/user/hive/warehouse' where id = <default_db_id>;'

mbigelow · ‎01-19-2017

Does the user 'administrator' exist on the HS2 node, and preferable the rest of the nodes. Does the user have a HDFS user directory, /user/administrator, with full access to it? These items are what is needed for users to access the cluster and run jobs regardless of the means of authentication.

mbigelow · ‎01-19-2017

This is going to be rough. You could manually copy the data from the CM server over to each node. You could also deploy a new cluster to those some nodes. I got a feeling that either way the old configs will not be present any longer. Before doing anything I would try to take a backup of the cluster using the CM API. Then you can try to restore the configs from that if you end up with a new cluster with default configs. https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cm_intro_api.html

mbigelow · ‎01-19-2017

This may be a silly question, but does the test table exist prior to running the CTAS statement?

mbigelow · ‎01-19-2017

It sounds like Hive Impersonation is not turned on. Can you verify? Do you have this same issue from Beeline or other JDBC connections? hive.server2.enable.doAs=true https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-Impersonation

mbigelow · ‎01-17-2017

I was wondering if stats were needed to have describe extended output the actual file size. I recall something like that.

mbigelow · ‎01-17-2017

On the setting changes, stats, as stated will help with counts as that info is precalculates and stored in the metadata. The CBO and stats also help a lot with joins. It is possible that the OS cache is more to do with the improvement if this was a subsequent run with little activity. You could look at Hive on Spark for better consistent performance. Set hive.execution.engine = spark; On the times, the big impact between job submission and start is the the scheduler. That is a deep topic. It is best if you read up on them and review your settings and ask any specific questions that come up, preferably in a new topic. The other factor, not captured on the job stats, is the time it takes to return the results to the client. This will vary depending on the client and there isn't much to do about it. In general small result sets can be handle by the hive CLI. You can increase the client heap if needed. Otherwise use HS2 connections like beeline or HUE.

mbigelow · ‎01-16-2017

Yeah, that is expected behavior. Each batch writes to the staging directory and when it is done, the data is moved to the actual table/partition directory. I have experienced these same staging directories being left behind. In general, if the data is successfully moved then there will be no data left behind. I ended up having a separate process that would check for entries since the last run (regular Spark jobs, not streaming) and then check if the directory was empty; remove, and repeat. I also employed these directories checks to see if something had go wrong in a job as the data would remain.

mbigelow · ‎01-16-2017

The configuration file it is set in is hive-site.xml. CM provides Advance Configuration snippets where this can be added. The trick is getting it in the right one. I don't know for sure as I haven't tested it. The settings would apply to the specific jobs being launched through Hive. I would think that at a minimum you needed it on the Gateway and also the HiveServer2. You could play it safe and filter by Advance and then search for hive-site.xml in CM and then add it to all ACS.

mbigelow · ‎01-16-2017

Disclaimer: I haven't done this at all. Did you change the service.sdl so that HDFS wasn't a requirement or so that Isilon is one? I don't think that matters for what you are experiencing as the dependencies and everything else in service.sdl comes into play when you go through the Add a Service wizard in CM. The parcel.json should define what comes in the parcel. I would think it being listed in the packages/components is what is need to have it show up as a service that can be added. Did it create the NIFI user? Did it unpack the parcel on all nodes? Those are a couple of items that may point to whether the parcel deployed correctly. "users": { "nifi": { "longname" : "NiFi", "home" : "/var/lib/nifi", "shell" : "/bin/bash", "extra_groups": [] }

Online	Offline
Last Visited	‎03-25-2019 05:55 PM

Member Since	‎08-16-2016 08:51 PM
Last Visited	‎03-25-2019 05:55 PM
Posts	642
Kudos received	129

Cloudera Community

Re: Configuring the HDFS superuser in Kerberos

Re: Hive process crash

Re: Upgrade from CDH 5.11 Express to Enterprise

Re: Adding user to Cloudera Manager using REST AP...

Re: Running in non-interactive mode, and data appe...

Re: CREATE TABLE AS SELECT returns error 'Failed t...

Re: Failed to validate proxy privilege of hue_hive...

Re: Extrated CDH5.4.8 folders deleted from /opt/cl...

Re: CREATE TABLE AS SELECT returns error 'Failed t...

Re: Failed to validate proxy privilege of hue_hive...

Re: Can we check size of Hive tables? If so - how?

Re: Hive Queries run slowly

Re: CDH 5.4.7 spark-streaming(1.3.0) kafka messag...

Re: Hive hive.exec.parallel property

Re: Installing Nifi with cloudera