About mbigelow

mbigelow · ‎06-15-2017

Make sure that the location value is in the last column in the DF. Then add the .partitionBy('location') to your DF write statement. empDF.write.partitionBy('location').format('parquet').mode(org.apache.spark.sql.SaveMode.Append).save("/user/hive/warehouse/emptab")

mbigelow · ‎06-14-2017

@csguna no it isn't a requirement. They just all need to resolved the same on all hosts. For me it is about keeping it within enterprise standards and like I said I usually use DNS, which usually contains multiple zones. As a user of it, it find it cleaner to have and . use the FQDN versus just the hostname when it comes to the URLs (although alias can be used as well).

mbigelow · ‎06-13-2017

I just replied on the other topic. Lets continue the discussion over there.

mbigelow · ‎06-13-2017

TL:DR The results will be identical if used in the same manner but the runtime and resource requirements will be different. If I understand the question correctly you are asking this: If there is a timestamp column that you use to create the partition columns, is there a difference in querying on each. This goes back to partition columns being virtual columns. If you set a partitions column based on an actual column and just change the name, then the physical column (timestamp) remains and the virtual columns (YMD) exist in the form of the directory structure in HDFS. When you query on the partitions columns it will perform partition pruning, on the other side it will no. But in effect the results will be the same for the aggregation. This is the same if you partition by subsets, i.e. year/month/day.

mbigelow · ‎06-13-2017

In the yarn-site.xml you have the mapreduce... settings but in the mapred-site.xml you have the mapred... settings. The settings for MRv2 containers need to be in the mapred-site.xml and they should use the MRv2 API which has the settings starting with mapreduce... The error "GC overhead limit exceeded" indicates that it spent too much time in GC while freeing up too little of the heap. You could increase the container and heap values. You can also try adding the '-XX:-UseGCOverheadLimit' option to the MR container java opts.

mbigelow · ‎06-13-2017

Oh, if you are searching within CM for hosts to install the agent on, include all IPs for each hosts, which would be the ones you entered into the hosts for all hosts.

mbigelow · ‎06-13-2017

@mercedes012345 The hosts file is a mapping between IPs and hostnames. Each lines corresponds to a single entry. The first column if for the IP address, the second column is for the fully qualified domain name, and the last column is for the hostname (columns can be separated by spaces or tabs). You can have as many entires as needed. I honestly haven't done an install that didn't use DNS first, but unless my memory is failing all hosts need to be resolve each other. So if you are only using hosts files for name resolution you need to include all hosts in the cluster. @csguna provided some examples. I would flesh it out by include all three columns /etc/host 192.168.1.1 master.example.com master 192.168.1.2 worker.example.com worker The hostname in the network file should be in the format of HOSTNAME=master.example.com

mbigelow · ‎06-13-2017

@Borg that link is not working for me.

mbigelow · ‎06-13-2017

I have been told that in theory you could install CM and get it to manage an existing CDH cluster. I don't the specifics though. It probably includes manually updating the CM db so CM is aware what services are on what host. I imagine it will be painful. If possible, install CM, set up a new CDH cluster and the migrate. If not, backup any HDFS data and metadata, configs, etc. Anything in an external DB like the Hive metadata should be fine and you can put in the DB configs to reconnect to it.

mbigelow · ‎06-13-2017

You have two exit codes: 143 and 255. I have never seen the latter but based on the exception and messages I think it is trying to write out some log info due to the failure and failing at that as well. To be more clear, the job failed, and then the RM tried to write something to /tmp/hadoop-yarn, but failed to do so. The permissions on that folder do not include the yarn account. The yarn account should be part of the hadoop group (run 'id yarn' to confirm). So you should be able to run 'sudo chown -R hdfs:hadoop". I am not positive but that folder should only be used by process ran by yarn, so it should be safe to just give yarn ownership over it as well. Now, on to the job failure. There is only the exit code for each of the containers, which is 143. You would need to access the container logs for each to get more specific information. Anyway, generally that error code indicates a Out of Memory event. Either the container ran out of physical memory or exceed the virtual memory. The heap itself could also have been exhausted. Can you provide the following settings for the job? mapreduce.map.memory.mb mapreduce.reduce.memory.mb mapreduce.map.java.opts mapreduce.reduce.java.opts

Online	Offline
Last Visited	‎03-25-2019 05:55 PM

Member Since	‎08-16-2016 08:51 PM
Last Visited	‎03-25-2019 05:55 PM
Posts	642
Kudos received	129

Cloudera Community

Re: Configuring the HDFS superuser in Kerberos

Re: Hive process crash

Re: Upgrade from CDH 5.11 Express to Enterprise

Re: Adding user to Cloudera Manager using REST AP...

Re: Running in non-interactive mode, and data appe...

Re: How to use spark to load data into a Hive part...

Re: Host file

Re: Exception from container-launch MapReduce Appl...

Re: query on partition question

Re: MapReduce application failed with OutOfMemoryE...

Re: Host file

Re: Host file

Re: Install Cloudera Manager in a running CHD clus...

Re: Install Cloudera Manager in a running CHD clus...

Re: Exception from container-launch MapReduce Appl...