Member since
01-05-2016
58
Posts
40
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1117 | 10-21-2019 05:16 AM | |
4341 | 01-29-2018 07:05 AM | |
3158 | 06-27-2017 06:42 AM | |
38932 | 05-26-2016 04:05 AM | |
30348 | 05-17-2016 02:15 PM |
07-19-2017
07:05 AM
Additional info. If I run spark CLI (where my spark procedures are working, btw, differently from when they are launched in Oozie), as soon as I try to define a Dataframe I receive the following warning that I've never seen before the upgrade: In [17]: utenti_DF = sqlContext.table("xxxx.yyyy")
17/07/19 15:48:58 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.1.0
17/07/19 15:48:58 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException Anyway, as I repeat, from CLI things work. I just thought this could be relevant
... View more
07-19-2017
06:47 AM
Hi all, after recently upgrading to CDH 5.11 I get tons of the following "Unexpected end-of-input" log entries related to "SPARK" (running on YARN) and classified as "ERRORS". I'm experiencing malfunctionings (failed Oozie Jobs) and I believe they are related to these errors, so I'd really like to solve the causing issue and see if the situation gets any better. In the logs, "source" is: FsHistoryProvider And "message" is: Exception encountered when attempting to load application log hdfs://xxxxx.xxxxx.zz:8020/user/spark/applicationHistory/application_1494352758818_0117_1
com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input: was expecting closing quote for a string value
at [Source: java.io.StringReader@1fec7fc4; line: 1, column: 3655]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1369)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:599)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportInvalidEOF(ParserMinimalBase.java:532)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._finishString2(ReaderBasedJsonParser.java:1517)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._finishString(ReaderBasedJsonParser.java:1505)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.getText(ReaderBasedJsonParser.java:205)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:20)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:28)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:2888)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2034)
at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19)
at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44)
at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:583)
at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$16.apply(FsHistoryProvider.scala:410)
at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$16.apply(FsHistoryProvider.scala:407)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:407)
at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:309)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748) Any suggestions/ideas? Thanks!
... View more
Labels:
- Labels:
-
Apache Spark
-
Apache YARN
06-27-2017
06:42 AM
In the end I've been able to solve the issue. I've been tricked by the fact that applying again from scratch the "YARN Resources Allocation Tuning Guide" proposed a (in my opinion) misleading way of calculating a few important parameters. Guide can be found here: https://www.cloudera.com/documentation/enterprise/5-10-x/topics/cdh_ig_yarn_tuning.html In a matter of fact, the Guide contains a downloadable XLS file which is a tool for calculating optimal parameters. This XLS automatically calculates and proposes a few values to be assigned to YARN configuration: As you can see above, at step 4 I got proposed "2" for "yarn.nodemanager.resource.cpu-vcores" and "5632" for "yarn.nodemanager.resource.memory-mb" I later found out that the correct values to be assigned to those configurations are the 2 values proposed at "step 5" Definitely, partly my fault (I do not have deep knowledge of YARN configuration). But partly misleading doc indeed. I am now fine tuning, trying different settings for the various java heap sizes etc Still I have no idea why everything was working fine until recently and stopped working after upgrading to 5.11, as I did not change any configuration while upgrading and physical resources are identical
... View more
06-26-2017
12:37 PM
As I believe that the problem is definitely due to differences betweek CDH 5.7 and CDH 5.11 in how resources are allocated to containers by YARN, I've tried to follow again from scratch the YARN Tuning Guide. The latest version of the YARN Tuning Guide available is apparently for CDH 5.10: https://www.cloudera.com/documentation/enterprise/5-10-x/topics/cdh_ig_yarn_tuning.html In that page, an XLS Sheet is available to help out planning the various parameters in a correct and working fashion. No luck. I always find myself with jobs stuck in "ACCEPTED" mode and never starting to run. I also found this interesting thread suggesting how to configure Dynamic Resource Pools for YARN: https://www.cloudera.com/documentation/enterprise/5-10-x/topics/cm_mc_resource_pools.html#concept_xkk_l1d_wr__section_c3f_vwf_4n I tried to limit the "number of concurrent jobs" to just 2 in the relevant Configuration Page of the Dynamic Resource Pools, but again, no success. Can anybody please point me out whatever new feature that could have been implemented in CDH 5.11 and related to YARN Resources Allocation (and that I have not mentioned here), because my Workflows were running smoothly before the upgrade, and now I'm facing heavy troubles! Workarounds are welcome too, as well as methods for monitoring/tracing resources usage in a way allowing me to understand what parameters I've been set up in a way that is not functional anymore in CDH 5.11 Thanks a lot for any hints or insights!
... View more
06-21-2017
11:31 AM
Hello, after successfully upgrading a small (5 nodes) CDH 5.7 cluster to CDH 5.11, I am experiencing various problems on existing Oozie Workflows that used to work correctly. The most significant example: I have this Workflow scheduling 8 jobs in parallel (mix of Hive, Shell and Sqoop actions). The 8 jobs are acquired and start running. But the 8 sub-jobs performing the action are stuck in "ACCEPTED" status and never switch to "RUNNING" state. After hours of work, I've not been able to find anything significant in the logs, apart from a few complaining about log4j. So I decided to upgrade JDK from 1.7 to 1.8 too, but without any improvement in the situation. Any help or suggestion pointing me in the right direction in solving this would be very very much appreciated! Thanks
... View more
Labels:
01-03-2017
12:38 PM
Hi AdrianMonter, sorry to say I haven't found a specific solution for the Avro file format in the meanwhile. I'm sticking to Parquet file format since I had this problem, and for now it covers all my needs... Maybe in the latest CDH/Spark releases this has been fixed? Maybe somebody from @Former Member can tell us someting more?
... View more
10-17-2016
01:40 PM
2 Kudos
Hi aj, yes I did manage to solve it. Please, take a look at the following thread and see if it can be of help. It may seem a bit unrelated from the "test.py not found" issue, but it contains detailed info about how to specify all the needed parameters to let the whole thing run smoothly: http://community.cloudera.com/t5/Batch-Processing-and-Workflow/Oozie-workflow-Spark-action-using-simple-Dataframe-quot-Table/m-p/40834 HTH
... View more
08-15-2016
05:15 AM
Thank you. Useful insight and crystal clear argumentation, as usual from you. I have to say in the meanwhile I had the chance to study a bit more, and in the end I came to a conclusion which matches your considerations, therefore I'm glad that apparently I moved in the right direction. In a matter of fact I've seen this Open Source project here http://opentsdb.net , and I've seen that generally speaking the approach they use is the last that you explained. To provide a practical example, in my case: - A new record every week for the same Customer Entity - Therefore, column Versioning is NOT used at all! (like you suggested) - "Speaking" record key e.g. "<CUST_ID> + <YYYY-MM-DD>" - This sort of Key is not monotonically increasing, because the "CUST_ID" part is "stable", so this approach should be good also on a "Table Splitting" perspective (when the Table grows, it will split up "evenly" and all the Splits will take care of a part of the future inserts, balancing the Machines Load evenly) - Same set of columns for each record containing the new sampled value for that field for that week e.g. "<Total progressive time used Service X>" This is the approach I used in the end, which has nothing to do with my original idea of using Versions but perfectly matching the last approach you described in your answer. Regarding the Fixed Values (e.g. "Name", "Surname") I've decided to replicate them every week too, as if they were Time Series themselves... I know, waste of storage. Planning on modifying this structure soon and move the Fixed Values in another Table (Hive or HBase, don't know yet) and pick up the information I'd eventually need at the moment (for instance, during Data Processing, I'll join in the relevant Anagraphic Data in the relevant Dataframes via Join). I just wanted to write a few more lines about the issue for the posterity. I hope this post will be useful to people 🙂 Thanks again!
... View more
07-31-2016
08:06 AM
Hi all, I have the following design question for my new table in HBase. Scenario: ------------- - Table will contain Customers Information - Table would be refreshed every week by a procedure, inserting new info (see below) - Row Key would be "Customer ID" (fixed) - There would be fixed contents columns, e.g. "Name", "Surname" - There would be variable contents columns, e.g. "Credit", "No. of Services subscribed", "Total Time used Service X" The question: ------------------ - Should I take advantage of Column Versioning, e.g. every week putting in a new version for Column (e.g) "Total Time used Service X" ? So that the Table would have a fixed number of Columns, some of them with versions and others fixed? - Or is it a better approach to NOT use Column Versioning, and for every new week of Data coming in just add a new Column named (e.g.) "Total Time used Service X - WEEK YY" ? In this case I'd put in the Week Number in the Column Name to be able to look up for it in later analysis Please keep in mind that: ---------------------------------- - The main use will be to process the "Variable Information" columns later using a Spark procedure, therefore it is of CRITICAL IMPORTANCE the ability to process each and every "Time Series" easily, on the fly, without convoluted workarounds to manage e.g. Column Name and then loop through Columns in weird ways (this is why at the moment I'm thinking the "Column Versioning" solution would be the best one, but my knowledge of HBase is just basic and I'd like to hear other voices too before making a mistake) - I'm proposing that the Row Key would be FIXED, but I'm open to other suggestions (e.g. multiple rows with variable Key for the same Customer Entity) if this would be the best approach in the described scenario. I just didn't want to mess up things too much describing my problem Any insight and/or link to examples for the particular case will be very much appreciated! Thanks
... View more
Labels:
- Labels:
-
Apache HBase