Member since
01-05-2016
55
Posts
37
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
810 | 10-21-2019 05:16 AM | |
3768 | 01-29-2018 07:05 AM | |
2584 | 06-27-2017 06:42 AM | |
37072 | 05-26-2016 04:05 AM | |
26746 | 05-17-2016 02:15 PM |
09-08-2021
03:19 AM
Hi, I am having the same issue on CDP 7.1.6 with Oozie 5.1.0. But the suggested solution does not seem to work anymore. Setting <property> <name>oozie.launcher.yarn.app.mapreduce.am.env</name> <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark/</value> </property> has no effect. Is there anything else I can do? Did the setting change?
... View more
02-15-2021
12:08 PM
Good Day, Effective January 31, 2021, all Cloudera software requires a valid subscription and is only accessible from behind the paywall. This includes all legacy versions for Cloudera Distribution including Apache Hadoop (CDH), Hortonworks Data Platform(HDP), Data Flow (HDF/CDF),and Cloudera Data Science Workbench (CDSW). Information regarding paywall access will be available in technical documentation by software type and version. https://www.cloudera.com/downloads/paywall-expansion.html If you have a valid Cloudera Subscription, you can obtain your credentials for downloads following directions outlined here: https://docs.cloudera.com/cdp-private-cloud-base/latest/installation/topics/cdpdc-cm-download-information.html
... View more
10-21-2019
05:16 AM
1 Kudo
You can query the API exposed by Cloudera Manager and simplify your life. For example, you can run the following: curl -u <CM_USER>:<CM_PASSWD> http://<CM_IP_ADDRESS>:7180/api/v19/clusters/<CLUSTER_NAME>/services/hive2 You'll get a Json answer in reply to your Query, with all the details related to the desired service's status. You can finally parse your Json answer (e.g. using "jq" or directly inside your bash script) and take the desired actions HTH
... View more
01-29-2018
07:26 AM
1) Apparently, yes 2) The name of the user you're trying to use to log in to the remote system, I suppose. Pls note that the user you specify here would be the user "oozie" will run as, so you'd eventually get other problems of unpredictable nature when using Oozie 3) I don't really know, sorry about that... The fact is that even if I'm pretty sure to have understood the cause of your issue, I never had to deal with it directly myself. Maybe the easier way could be to follow the additional suggestions I wrote in my previous answer (give permissions to OS User "yarn" to "ssh" and/or "su"). Or, maybe, another possibility would be for you to create a "yarn" user on the remote system and grant this user with the correct permissions to get to the final working directory I hope you'll manage to get through the problems and make it 🙂
... View more
07-22-2017
08:45 AM
Thanks mbigelow, following your suggestions I solved the massive error logging issue. I've processed in a Json validator the specific log file referenced in the Java stack trace: /user/spark/applicationHistory/application_1494352758818_0117_1 But the format was correct, according to the validator. So I just moved it away in a temporary directory. As soon as I did it, the error messages stopped clogging the system logs. So it was probably corrupted in a very subtle way... But it was definitely corrupted That Json file has been indeed generated by the Spark Action that is giving me problems, but it was an OLD file. New instances of that Spark Action are generating new Json logs, but they are not giving any troubles to the History Server (stopped having tons of exceptions logged as I just said) Unfortunately, the Spark job itself is still failing and it's needing further investigation on my side, so apparently this is not related to that specific error message. But I've solved an annoying problem, and at the same time I have cleared out the possibility of the Spark Action issue being related to that java exception Thanks!
... View more
06-27-2017
06:42 AM
In the end I've been able to solve the issue. I've been tricked by the fact that applying again from scratch the "YARN Resources Allocation Tuning Guide" proposed a (in my opinion) misleading way of calculating a few important parameters. Guide can be found here: https://www.cloudera.com/documentation/enterprise/5-10-x/topics/cdh_ig_yarn_tuning.html In a matter of fact, the Guide contains a downloadable XLS file which is a tool for calculating optimal parameters. This XLS automatically calculates and proposes a few values to be assigned to YARN configuration: As you can see above, at step 4 I got proposed "2" for "yarn.nodemanager.resource.cpu-vcores" and "5632" for "yarn.nodemanager.resource.memory-mb" I later found out that the correct values to be assigned to those configurations are the 2 values proposed at "step 5" Definitely, partly my fault (I do not have deep knowledge of YARN configuration). But partly misleading doc indeed. I am now fine tuning, trying different settings for the various java heap sizes etc Still I have no idea why everything was working fine until recently and stopped working after upgrading to 5.11, as I did not change any configuration while upgrading and physical resources are identical
... View more
01-17-2017
05:20 PM
1 Kudo
A schema or protocol may not contain multiple definitions of a fullname. Further, a name must be defined before it is used ("before" in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come "before" the messages attribute.)
... View more
10-17-2016
02:18 PM
Ah, my error was not using HDFS: for the .py. Thanks!
... View more
08-15-2016
05:15 AM
Thank you. Useful insight and crystal clear argumentation, as usual from you. I have to say in the meanwhile I had the chance to study a bit more, and in the end I came to a conclusion which matches your considerations, therefore I'm glad that apparently I moved in the right direction. In a matter of fact I've seen this Open Source project here http://opentsdb.net , and I've seen that generally speaking the approach they use is the last that you explained. To provide a practical example, in my case: - A new record every week for the same Customer Entity - Therefore, column Versioning is NOT used at all! (like you suggested) - "Speaking" record key e.g. "<CUST_ID> + <YYYY-MM-DD>" - This sort of Key is not monotonically increasing, because the "CUST_ID" part is "stable", so this approach should be good also on a "Table Splitting" perspective (when the Table grows, it will split up "evenly" and all the Splits will take care of a part of the future inserts, balancing the Machines Load evenly) - Same set of columns for each record containing the new sampled value for that field for that week e.g. "<Total progressive time used Service X>" This is the approach I used in the end, which has nothing to do with my original idea of using Versions but perfectly matching the last approach you described in your answer. Regarding the Fixed Values (e.g. "Name", "Surname") I've decided to replicate them every week too, as if they were Time Series themselves... I know, waste of storage. Planning on modifying this structure soon and move the Fixed Values in another Table (Hive or HBase, don't know yet) and pick up the information I'd eventually need at the moment (for instance, during Data Processing, I'll join in the relevant Anagraphic Data in the relevant Dataframes via Join). I just wanted to write a few more lines about the issue for the posterity. I hope this post will be useful to people 🙂 Thanks again!
... View more