Member since
08-16-2016
642
Posts
131
Kudos Received
68
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 3976 | 10-13-2017 09:42 PM | |
| 7474 | 09-14-2017 11:15 AM | |
| 3798 | 09-13-2017 10:35 PM | |
| 6033 | 09-13-2017 10:25 PM | |
| 6601 | 09-13-2017 10:05 PM |
06-15-2017
12:21 PM
Make sure that the location value is in the last column in the DF. Then add the .partitionBy('location') to your DF write statement. empDF.write.partitionBy('location').format('parquet').mode(org.apache.spark.sql.SaveMode.Append).save("/user/hive/warehouse/emptab")
... View more
06-14-2017
08:58 PM
@csguna no it isn't a requirement. They just all need to resolved the same on all hosts. For me it is about keeping it within enterprise standards and like I said I usually use DNS, which usually contains multiple zones. As a user of it, it find it cleaner to have and . use the FQDN versus just the hostname when it comes to the URLs (although alias can be used as well).
... View more
06-13-2017
01:07 PM
I just replied on the other topic. Lets continue the discussion over there.
... View more
06-13-2017
01:03 PM
1 Kudo
TL:DR The results will be identical if used in the same manner but the runtime and resource requirements will be different. If I understand the question correctly you are asking this: If there is a timestamp column that you use to create the partition columns, is there a difference in querying on each. This goes back to partition columns being virtual columns. If you set a partitions column based on an actual column and just change the name, then the physical column (timestamp) remains and the virtual columns (YMD) exist in the form of the directory structure in HDFS. When you query on the partitions columns it will perform partition pruning, on the other side it will no. But in effect the results will be the same for the aggregation. This is the same if you partition by subsets, i.e. year/month/day.
... View more
06-13-2017
12:56 PM
In the yarn-site.xml you have the mapreduce... settings but in the mapred-site.xml you have the mapred... settings. The settings for MRv2 containers need to be in the mapred-site.xml and they should use the MRv2 API which has the settings starting with mapreduce... The error "GC overhead limit exceeded" indicates that it spent too much time in GC while freeing up too little of the heap. You could increase the container and heap values. You can also try adding the '-XX:-UseGCOverheadLimit' option to the MR container java opts.
... View more
06-13-2017
12:44 PM
Oh, if you are searching within CM for hosts to install the agent on, include all IPs for each hosts, which would be the ones you entered into the hosts for all hosts.
... View more
06-13-2017
12:43 PM
@mercedes012345 The hosts file is a mapping between IPs and hostnames. Each lines corresponds to a single entry. The first column if for the IP address, the second column is for the fully qualified domain name, and the last column is for the hostname (columns can be separated by spaces or tabs). You can have as many entires as needed. I honestly haven't done an install that didn't use DNS first, but unless my memory is failing all hosts need to be resolve each other. So if you are only using hosts files for name resolution you need to include all hosts in the cluster. @csguna provided some examples. I would flesh it out by include all three columns /etc/host
192.168.1.1 master.example.com master
192.168.1.2 worker.example.com worker The hostname in the network file should be in the format of HOSTNAME=master.example.com
... View more
06-13-2017
12:32 PM
@Borg that link is not working for me.
... View more
06-13-2017
12:32 PM
I have been told that in theory you could install CM and get it to manage an existing CDH cluster. I don't the specifics though. It probably includes manually updating the CM db so CM is aware what services are on what host. I imagine it will be painful. If possible, install CM, set up a new CDH cluster and the migrate. If not, backup any HDFS data and metadata, configs, etc. Anything in an external DB like the Hive metadata should be fine and you can put in the DB configs to reconnect to it.
... View more
06-13-2017
12:12 PM
You have two exit codes: 143 and 255. I have never seen the latter but based on the exception and messages I think it is trying to write out some log info due to the failure and failing at that as well. To be more clear, the job failed, and then the RM tried to write something to /tmp/hadoop-yarn, but failed to do so. The permissions on that folder do not include the yarn account. The yarn account should be part of the hadoop group (run 'id yarn' to confirm). So you should be able to run 'sudo chown -R hdfs:hadoop". I am not positive but that folder should only be used by process ran by yarn, so it should be safe to just give yarn ownership over it as well. Now, on to the job failure. There is only the exit code for each of the containers, which is 143. You would need to access the container logs for each to get more specific information. Anyway, generally that error code indicates a Out of Memory event. Either the container ran out of physical memory or exceed the virtual memory. The heap itself could also have been exhausted. Can you provide the following settings for the job? mapreduce.map.memory.mb mapreduce.reduce.memory.mb mapreduce.map.java.opts mapreduce.reduce.java.opts
... View more