Member since
01-05-2016
55
Posts
37
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
259 | 10-21-2019 05:16 AM | |
2808 | 01-29-2018 07:05 AM | |
1783 | 06-27-2017 06:42 AM | |
33198 | 05-26-2016 04:05 AM | |
21572 | 05-17-2016 02:15 PM |
10-29-2020
08:49 AM
Hi, I have a Cloudera Express 6.3.2 setup where I'm trying to submit an Oozie Spark Action that reads from an HBase Table (correctly mapped in Hive) If I get into Pyspark CLI , adding the relevant "--jars" when calling it, I can read from the Table without any problems The list of Jars I'm adding is the following: hive-hbase-handler-2.1.1-cdh6.3.2.jar
hbase-client-2.1.0-cdh6.3.2.jar
guava-11.0.2.jar
hbase-common-2.1.0-cdh6.3.2.jar
hbase-hadoop-compat-2.1.0-cdh6.3.2.jar
hbase-hadoop2-compat-2.1.0-cdh6.3.2.jar
hbase-protocol-2.1.0-cdh6.3.2.jar
hbase-server-2.1.0-cdh6.3.2.jar
htrace-core4-4.2.0-incubating.jar But if I try to run the same script via Oozie + Spark Action, I get the following exception: Now, the same exact thing is working on a CDH 5 setup, but nevertheless I've tried several things hoping to make it work here too, but to no avail. What I've tried: ------------------ - Adding a "hive.aux.jars.path" section to BOTH my "hive-site.xml" and "hbase-site.xml" that I'm passing to my Spark Action woth "--files": - Configuring "Oozie Sharelib" and putting the Jar Files under BOTH the "spark" and "hive" HDFS directories relevant to the Sharelib itself (and of course setting "oozie.use.system.libpath" to "true" in my Oozie Workflow Configuration) - Also tried to set a " oozie . libpath" option, same way as the previous step, pointing to a complete list of the involved HBase Jars (then removed this config as I was getting an error telling me that it was not possible loading the relevant Jars multiple times, as they are already present in the Sharelib) - Passing a relevant "SPARK_CLASSPATH" along with a "SPARK_HOME" as "oozie.launcher.yarn.app.mapreduce.am.env" Hadoop Properties in my Oozie Workflow Configuration, with the list of all the Jars involved - Tried to add a "--conf "spark.driver.extraClassPath=..." and a "--conf "spark.executor.extraClassPath=..." configuration options, BOTH inside the Python script bering called and in my Oozie Workflow Spark Action window in the GUI in Hue - I've extensively searched the Community Forum and the web before posting this, but no joy. Also, this is the first time this is happening, I have other setups where this works as expected! Don't know what to try anymore. Any help would be greatly appreciated. Here below I'm posting the full stack trace. Thank you for any insights! Log Type: stdout
Log Upload Time: Thu Oct 29 15:36:22 +0100 2020
Log Length: 9563
Traceback (most recent call last):
File "reportServicesCreditExtraction.py", line 74, in <module>
hbase_utenti_DF = sqlContext.table("msgnet.hbase_utenti")
File "/data/1/yarn/nm/usercache/msgnet/appcache/application_1602174532153_0765/container_1602174532153_0765_02_000001/pyspark.zip/pyspark/sql/context.py", line 371, in table
File "/data/1/yarn/nm/usercache/msgnet/appcache/application_1602174532153_0765/container_1602174532153_0765_02_000001/pyspark.zip/pyspark/sql/session.py", line 791, in table
File "/data/1/yarn/nm/usercache/msgnet/appcache/application_1602174532153_0765/container_1602174532153_0765_02_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/data/1/yarn/nm/usercache/msgnet/appcache/application_1602174532153_0765/container_1602174532153_0765_02_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/data/1/yarn/nm/usercache/msgnet/appcache/application_1602174532153_0765/container_1602174532153_0765_02_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o86.table.
: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/mapreduce/TableInputFormatBase
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:246)
at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:235)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.hadoop.hive.hbase.HBaseStorageHandler.getInputFormatClass(HBaseStorageHandler.java:133)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7$$anonfun$12$$anonfun$apply$10.apply(HiveClientImpl.scala:463)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7$$anonfun$12$$anonfun$apply$10.apply(HiveClientImpl.scala:463)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7$$anonfun$12.apply(HiveClientImpl.scala:463)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7$$anonfun$12.apply(HiveClientImpl.scala:463)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7.apply(HiveClientImpl.scala:462)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7.apply(HiveClientImpl.scala:376)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:376)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:374)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
at org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:374)
at org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:81)
at org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:84)
at org.apache.spark.sql.hive.HiveExternalCatalog.getRawTable(HiveExternalCatalog.scala:120)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:737)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:737)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
at org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:736)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getTable(ExternalCatalogWithListener.scala:146)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:701)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:730)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:685)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:715)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:708)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:708)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:654)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:121)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:106)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:637)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:633)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.TableInputFormatBase
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:255)
at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:235)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 84 more
15:36:20.966 [Driver] ERROR org.apache.spark.deploy.yarn.ApplicationMaster - User application exited with status 1
... View more
05-26-2020
05:01 AM
Hi, as we all know CDH 6.3.3 and subsequent versions are not available anymore under "Express licensing" model. Yet, I was in the process of setting up a 6.3.1 installation and apparently this is not possible, because a valid authentication (enterprise) is required. What am I doing wrong? Below, a screenshot from the Installation Guide: But when I try to import the 6.3.1 Repository (WITHOUT Username and Password as stated on the documentation itself for version < 6.3.3 I get the following error: # rpm --import https://archive.cloudera.com/p/cm6/6.3.1/redhat7/yum/RPM-GPG-KEY-cloudera
curl: (22) The requested URL returned error: 401 Authentication required
errore: https://archive.cloudera.com/p/cm6/6.3.1/redhat7/yum/RPM-GPG-KEY-cloudera: lettura importazione fallita(2). What am I doing wrong? @Cloudera1 Is this a mistake I'm committing or is it a problem with Cloudera implementing their policies in a wrong way? Thanks for any insights
... View more
05-22-2020
09:31 AM
Hi all, I have the following situation that I can't solve even after extensive search and trials. I have 2 clusters, "Cluster 1" is Hortonworks HDP-2.6.5.0 and "Cluster 2" is Cloudera CDH 5.13.1 Enterprise The final goal is to run a Pyspark script on "Cluster 1" and remotely create a Hive Table on "Cluster 2". Script (of course this is just an example to illustrate the issue) is the following: from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pyspark.sql.functions as func
sconf = SparkConf().setAppName("rob_example")
sc = SparkContext(conf=sconf)
sqlContext = HiveContext(sc)
sqlContext.setConf("fs.defaultFS","hdfs://<REMOTE_HDFS_HOST>:8020")
sqlContext.sql("create table rob.test_output_table as select * from rob.test_input_table"); Now, what happens it that if I use Spatk 1.6.3 on "Cluster 1", the above script runs just fine, and if I log in into Hive on "Cluster 2" I can see the destination table and data are created correctly. But if I use Spark 2.3 on "Cluster 1" (and I need to use 2.3) I get the following exception instead: pyspark.sql.utils.AnalysisException: u'java.lang.IllegalArgumentException: Wrong FS: hdfs://<REMOTE_HDFS_HOST>/user/hive/warehouse/rob.db/test_output_table/.hive-staging_hive_2020-05-22_17-59-00_296_5401117384982980379-1/-ext-10000/part-00000-733a2646-fc76-47f0-80d6-a14b28677f7e-c000, expected: hdfs://cluster1_host1.domain1.local:8020;' First of all, I notice a suspect ";" trailing the "expected" Filesystem where Spark is trying to read/write (maybe pointing to wrong Hive Metastore?). In fact, it's also strange that it's expecting me to tell it to read/write on "Cluster 1" itself, differently from what I specified in my Spark session Hive Context configuration (fs.defaultFS parameter inside the Script) More than this, even if I try setting the following parameters I can't make it work. And it's strange, because as I said if I use Spark 1.6 everything runs smoothly (even without the following additional configurations): sqlContext.setConf("default.fs.name","hdfs://<REMOTE_HDFS_NODE>:8020")
sqlContext.setConf("hive.metastore.uris","thrift://<REMOTE_THRIFT_NODE>:9083") Also please note that if I'd use a pure Spark Dataframes approach as follows, things would work, but I need to use Spark SQL otherwise the output table I'll get won't be in Hive compatible format: test_DF = sqlContext.sql("select * from rob.test_input_table")
test_DF.write.mode("overwrite").saveAsTable("rob.test_output_table") So, the above last approach is not feasible (not Hive compatible format of the output table). I found a lot of people having similar issues, but none of the cases I found applied exactly to my case, and I'm stuck at this point. Any help/hints would be greatly appreciated! Thank you for any insights on this
... View more
Labels:
10-21-2019
05:16 AM
1 Kudo
You can query the API exposed by Cloudera Manager and simplify your life. For example, you can run the following: curl -u <CM_USER>:<CM_PASSWD> http://<CM_IP_ADDRESS>:7180/api/v19/clusters/<CLUSTER_NAME>/services/hive2 You'll get a Json answer in reply to your Query, with all the details related to the desired service's status. You can finally parse your Json answer (e.g. using "jq" or directly inside your bash script) and take the desired actions HTH
... View more
- Tags:
- The
10-20-2019
12:32 PM
Hello Community, after upgrading from CDH 5.11 to CDH 6.3 I've noticed that when launching an Oozie Workflow from the HUE GUI, I can't see the "progress bars" on the single Actions making up a Workflow Graph. I have just one progress bar in the last Action of the Graph, but it doesn't help in understanding the real progress in complex Workflows made up of many sub-actions, maybe taking place in parallel etc. It shouldn't be a browser issue, as I've tried on Chrome, Safari and Firefox Am I doing something wrong here? Configurations parameters that I should set up in HUE 4.4.0 Config, Oozie Config etc? Thanks in advance for any hints and helpful insights!
... View more
Labels:
11-12-2018
12:09 PM
Hi, late reply but I hope iy can still be useful. To achieve what you want you should do something like: dataframe2 = dataframe1.persist(StorageLevel.MEMORY_AND_DISK) HTH
... View more
11-12-2018
04:17 AM
1 Kudo
Hello community, In Spark >= 2.0, the following statement should be correct: df2_DF = df1_DF.checkpoint() But of course if I try to perform checkpointing on a dataframe in Spark 1.6, I get the following exception, as checkpointing wasn't yet implemented: AttributeError: 'DataFrame' object has no attribute 'checkpoint' Is there a quick way to implement this in Spark 1.6.0 using a workaround? What I've tried until now is (similar to what I saw in a post on Stack Overflow) the following approach, using an intermediate rdd conversion for the original Dataframe df1: ...
sc.setCheckpointDir("hdfs:///tmp")
...
df1 = sqlContext.table("<TABLE_NAME>")
...
df1.rdd.checkpoint()
df1.rdd.count
df2 = sqlContext.createDataFrame(df1.rdd, df1.schema) But I'd like to ask if: - Everybody else is using this approach? Can you pls give feedback if you do? - If there is a different, better approach? What I'm doing in reality is looping over a Dataframe and it gets "heavier" at each loop - If the above is the only option, does it make sense if in the last statement I directly reassign the "should-now-be-checkpointed-rdd" to the original Dataframe? Something like: df1 = sqlContext.createDataFrame(df1.rdd, df1.schema) - Any other observations/comments? Thanks in advance for any insight or help
... View more
Labels:
01-29-2018
07:26 AM
1) Apparently, yes 2) The name of the user you're trying to use to log in to the remote system, I suppose. Pls note that the user you specify here would be the user "oozie" will run as, so you'd eventually get other problems of unpredictable nature when using Oozie 3) I don't really know, sorry about that... The fact is that even if I'm pretty sure to have understood the cause of your issue, I never had to deal with it directly myself. Maybe the easier way could be to follow the additional suggestions I wrote in my previous answer (give permissions to OS User "yarn" to "ssh" and/or "su"). Or, maybe, another possibility would be for you to create a "yarn" user on the remote system and grant this user with the correct permissions to get to the final working directory I hope you'll manage to get through the problems and make it 🙂
... View more
01-29-2018
07:19 AM
Hi all, I have this File Action (part of an Oozie Workflow) that runs every 15 minutes, and moves all the files in a "receiving" HDFS directory, putting them into an "in_process" HDFS directory.
Everything is OK unless for whatever reason the number of files in the "receiving" HDFS directory grows too much. If, let's say, that number gets to be > 30000 (not exactly, but around that number) the File Action fails, without meaningful errors in the logs
I'd need help to sort out a few possible options:
- Is it possible to manually specify the maximum number of files a File Action would be able to handle? E.g. setting more resources in whatever parameter, or specifying an explicit value somewhere?
- If I use a Shell Action instead, would it possibly be a better choice? I'm a bit hesitant because I'm afraid that, given that the "live files" getting copied in the " receiving" HDFS directory are coming in at a very high rate, the Shell Action could not be able to keep up with the changes inside the directory...
Looking forward to receiving your insights, thanks a lot!
... View more
01-29-2018
07:05 AM
This is probably related to the fact that the shell action, when run from Oozie, runs as user "yarn" and not as the desired user you're specifying in the ssh command You can refer to this thread for more information about the issue: https://community.cloudera.com/t5/Batch-Processing-and-Workflow/How-to-run-Oozie-workfllow-or-action-as-another-user/td-p/26794 It should all boil down (in case your cluster is not secured with kerberos) to try and set up your environment, specifically the " linux-container-executor" configuration parameter (you go in Cloudera Admin UI --> Yarn --> Configuration). It's all explained in the linked document. Another alternative could be to grant OS user "yarn" with permissions to execute "ssh" and/or "su" so you can eventually switch user in your script before executing the remote ssh command HTH
... View more
11-23-2017
08:43 AM
Hi all, I'm putting up a log parser in Pig and I'm trying to use "Pyasn", a Python extension allowing offline querying of an ASN database, to extract Autonomous System Number information from IP addresses
The link to the project is here:
https://pypi.python.org/pypi/pyasn
What happens is that:
1) I successfully installed pyasn (in a previous try via pip-install, currently I have built it manually, but still it doesn't work)
2) I wrote a custom UDF to be later imported in Pig, prior being wrapped inside Jython:
#!/usr/bin/python
import sys
sys.path.append('/usr/lib64/python2.6/site-packages/')
sys.path.append('/usr/lib64/python2.6/site-packages/pyasn-1.6.0b1-py2.6-linux-x86_64.egg/')
sys.path.append('/usr/lib/python2.6/site-packages/')
import pyasn
@outputSchema("asn:chararray")
def asnLookup(ip):
asndb = pyasn.pyasn('asn.dat')
asn = asndb.lookup(ip)
return asn
@outputSchema("asn_prefix:chararray")
def asnGetAsPrefixes(nbr):
asndb = pyasn.pyasn('asn.dat')
asn_prefix = asndb.get_as_prefixes(nbr)
return asn_prefix
3) But when I try to register my UDF, I get the following exception:
grunt> register 'hdfs:///user/xxxxxx/LIB/PYASN/python_pyasn.py' using jython as pythonPyasn;
2017-11-23 17:18:10,468 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2017-11-23 17:18:10,939 [main] INFO org.apache.pig.scripting.jython.JythonScriptEngine - created tmp python.cachedir=/tmp/pig_jython_8271942503558994412
2017-11-23 17:18:12,468 [main] WARN org.apache.pig.scripting.jython.JythonScriptEngine - pig.cmd.args.remainders is empty. This is not expected unless on testing.
2017-11-23 17:18:13,236 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1121: Python Error. Traceback (most recent call last):
File "/tmp/pig6864734086775637011tmp/python_pyasn.py", line 8, in <module>
import pyasn
File "/usr/lib64/python2.6/site-packages/pyasn-1.6.0b1-py2.6-linux-x86_64.egg/pyasn/__init__.py", line 20
SyntaxError: future feature print_function is not defined
4) The puzzling thing is that I'm currently doing the exact same thing with another Python extension for Geo Localization (PyGeoIP) and it works smoothly, the concept is the same, I wrote a UDF and imported it in Pig wrapping it up in Jython and I can call it successfully!
5) If, just to check things are formally OK, I open a PySpark Shell and use the extension, it works without any problems. But I don't want (can't) use Spark in this case, for a number of reasons
Any ideas/insight would be very much appreciated!
Thanks
... View more
09-26-2017
09:14 AM
Have you tried a configuration similar to: - 3 executors (1 per datanode) - Setting a very low initial memory setting for the executors (e.g. 1 GBytes) - Limiting the VCPU to 1 f or the executors (at least initially) - Maybe you can try with 2 VCPUs and 2 GBytes for the driver initially - It's important to allow a conspicuous overhead to the amount of RAM usable by the executors if they run out of initial resources (that 1 GBytes we specified before), but just when/if they need it. So set the following parameters accordingly: spark.yarn.executor.memoryOverhead = 8 GBytes spark.yarn.driver.memoryOverhead = 8 GBytes - You can read the following docs to get a better grasp on the concepts behind resources allocation: https://www.cloudera.com/documentation/enterprise/5-11-x/topics/cdh_ig_yarn_tuning.html https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_running_spark_on_yarn.html The first Doc also includes a link to a "Yarn Tuning Spreadsheet". The Docs contain details on how to configure all the features I've been talking about just above 🙂 HTH
... View more
07-22-2017
08:45 AM
Thanks mbigelow, following your suggestions I solved the massive error logging issue. I've processed in a Json validator the specific log file referenced in the Java stack trace: /user/spark/applicationHistory/application_1494352758818_0117_1 But the format was correct, according to the validator. So I just moved it away in a temporary directory. As soon as I did it, the error messages stopped clogging the system logs. So it was probably corrupted in a very subtle way... But it was definitely corrupted That Json file has been indeed generated by the Spark Action that is giving me problems, but it was an OLD file. New instances of that Spark Action are generating new Json logs, but they are not giving any troubles to the History Server (stopped having tons of exceptions logged as I just said) Unfortunately, the Spark job itself is still failing and it's needing further investigation on my side, so apparently this is not related to that specific error message. But I've solved an annoying problem, and at the same time I have cleared out the possibility of the Spark Action issue being related to that java exception Thanks!
... View more
07-19-2017
08:58 AM
Hi mbigelow, I've tried what ypu suggested (stop hive + running the action from the dropdown menu). Process was successful, but the warning in spark CLI is still there...
... View more
07-19-2017
07:05 AM
Additional info. If I run spark CLI (where my spark procedures are working, btw, differently from when they are launched in Oozie), as soon as I try to define a Dataframe I receive the following warning that I've never seen before the upgrade: In [17]: utenti_DF = sqlContext.table("xxxx.yyyy")
17/07/19 15:48:58 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.1.0
17/07/19 15:48:58 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException Anyway, as I repeat, from CLI things work. I just thought this could be relevant
... View more
07-19-2017
06:47 AM
Hi all, after recently upgrading to CDH 5.11 I get tons of the following "Unexpected end-of-input" log entries related to "SPARK" (running on YARN) and classified as "ERRORS". I'm experiencing malfunctionings (failed Oozie Jobs) and I believe they are related to these errors, so I'd really like to solve the causing issue and see if the situation gets any better. In the logs, "source" is: FsHistoryProvider And "message" is: Exception encountered when attempting to load application log hdfs://xxxxx.xxxxx.zz:8020/user/spark/applicationHistory/application_1494352758818_0117_1
com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input: was expecting closing quote for a string value
at [Source: java.io.StringReader@1fec7fc4; line: 1, column: 3655]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1369)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:599)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportInvalidEOF(ParserMinimalBase.java:532)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._finishString2(ReaderBasedJsonParser.java:1517)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._finishString(ReaderBasedJsonParser.java:1505)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.getText(ReaderBasedJsonParser.java:205)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:20)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:28)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:2888)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2034)
at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19)
at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44)
at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:583)
at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$16.apply(FsHistoryProvider.scala:410)
at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$16.apply(FsHistoryProvider.scala:407)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:407)
at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:309)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748) Any suggestions/ideas? Thanks!
... View more
06-27-2017
06:42 AM
In the end I've been able to solve the issue. I've been tricked by the fact that applying again from scratch the "YARN Resources Allocation Tuning Guide" proposed a (in my opinion) misleading way of calculating a few important parameters. Guide can be found here: https://www.cloudera.com/documentation/enterprise/5-10-x/topics/cdh_ig_yarn_tuning.html In a matter of fact, the Guide contains a downloadable XLS file which is a tool for calculating optimal parameters. This XLS automatically calculates and proposes a few values to be assigned to YARN configuration: As you can see above, at step 4 I got proposed "2" for "yarn.nodemanager.resource.cpu-vcores" and "5632" for " yarn.nodemanager.resource.memory-mb" I later found out that the correct values to be assigned to those configurations are the 2 values proposed at "step 5" Definitely, partly my fault (I do not have deep knowledge of YARN configuration). But partly misleading doc indeed. I am now fine tuning, trying different settings for the various java heap sizes etc Still I have no idea why everything was working fine until recently and stopped working after upgrading to 5.11, as I did not change any configuration while upgrading and physical resources are identical
... View more
06-26-2017
12:37 PM
As I believe that the problem is definitely due to differences betweek CDH 5.7 and CDH 5.11 in how resources are allocated to containers by YARN, I've tried to follow again from scratch the YARN Tuning Guide. The latest version of the YARN Tuning Guide available is apparently for CDH 5.10: https://www.cloudera.com/documentation/enterprise/5-10-x/topics/cdh_ig_yarn_tuning.html In that page, an XLS Sheet is available to help out planning the various parameters in a correct and working fashion. No luck. I always find myself with jobs stuck in "ACCEPTED" mode and never starting to run. I also found this interesting thread suggesting how to configure Dynamic Resource Pools for YARN: https://www.cloudera.com/documentation/enterprise/5-10-x/topics/cm_mc_resource_pools.html#concept_xkk_l1d_wr__section_c3f_vwf_4n I tried to limit the "number of concurrent jobs" to just 2 in the relevant Configuration Page of the Dynamic Resource Pools, but again, no success. Can anybody please point me out whatever new feature that could have been implemented in CDH 5.11 and related to YARN Resources Allocation (and that I have not mentioned here), because my Workflows were running smoothly before the upgrade, and now I'm facing heavy troubles! Workarounds are welcome too, as well as methods for monitoring/tracing resources usage in a way allowing me to understand what parameters I've been set up in a way that is not functional anymore in CDH 5.11 Thanks a lot for any hints or insights!
... View more
06-21-2017
11:31 AM
Hello, after successfully upgrading a small (5 nodes) CDH 5.7 cluster to CDH 5.11, I am experiencing various problems on existing Oozie Workflows that used to work correctly. The most significant example: I have this Workflow scheduling 8 jobs in parallel (mix of Hive, Shell and Sqoop actions). The 8 jobs are acquired and start running. But the 8 sub-jobs performing the action are stuck in "ACCEPTED" status and never switch to "RUNNING" state. After hours of work, I've not been able to find anything significant in the logs, apart from a few complaining about log4j. So I decided to upgrade JDK from 1.7 to 1.8 too, but without any improvement in the situation. Any help or suggestion pointing me in the right direction in solving this would be very very much appreciated! Thanks
... View more
02-06-2017
10:58 AM
Hi, thanks for your reply. I have tried to create the Table first, and indeed the overall behaviour changed. Now I get a different exception, which I'm going to paste just below. Strange thing is that the class referenced in the exception, "ClientBackoffPolicyFactory", IS loaded and present (it's in "--jars", as detailed in earlier posts above this one). Here is the main excerpt from the error stack: 2017-02-06 19:31:51,426 WARN [Thread-8] ipc.RpcControllerFactory (RpcControllerFactory.java:instantiate(78)) - Cannot load configured "hbase.rpc.controllerfactory.class" (org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory) from hbase-site.xml, falling back to use default RpcControllerFactory
2017-02-06 19:31:51,431 ERROR [Thread-8] datasources.InsertIntoHadoopFsRelation (Logging.scala:logError(95)) - Aborting job.
java.io.IOException: java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.client.backoff.ClientBackoffPolicyFactory$NoBackoffPolicy
...
Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.UnsupportedOperationException: Unable to find org.apache.hadoop.hbase.client.backoff.ClientBackoffPolicyFactory$NoBackoffPolicy
...
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.client.backoff.ClientBackoffPolicyFactory$NoBackoffPolicy
...
2017-02-06 19:31:51,538 ERROR [Driver] yarn.ApplicationMaster (Logging.scala:logError(74)) - User application exited with status 1 Given that "hbase-site.xml" is mentioned in the error stack, I'm also pasting that file just below: <?xml version="1.0" encoding="UTF-8"?>
<!--Autogenerated by Cloudera Manager-->
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://xxx01.yyy.it:8020/hbase</value>
</property>
<property>
<name>hbase.replication</name>
<value>true</value>
</property>
<property>
<name>hbase.client.write.buffer</name>
<value>2097152</value>
</property>
<property>
<name>hbase.client.pause</name>
<value>100</value>
</property>
<property>
<name>hbase.client.retries.number</name>
<value>35</value>
</property>
<property>
<name>hbase.client.scanner.caching</name>
<value>100</value>
</property>
<property>
<name>hbase.client.keyvalue.maxsize</name>
<value>10485760</value>
</property>
<property>
<name>hbase.ipc.client.allowsInterrupt</name>
<value>true</value>
</property>
<property>
<name>hbase.client.primaryCallTimeout.get</name>
<value>10</value>
</property>
<property>
<name>hbase.client.primaryCallTimeout.multiget</name>
<value>10</value>
</property>
<property>
<name>hbase.regionserver.thrift.http</name>
<value>false</value>
</property>
<property>
<name>hbase.thrift.support.proxyuser</name>
<value>false</value>
</property>
<property>
<name>hbase.rpc.timeout</name>
<value>60000</value>
</property>
<property>
<name>hbase.snapshot.enabled</name>
<value>true</value>
</property>
<property>
<name>hbase.snapshot.master.timeoutMillis</name>
<value>60000</value>
</property>
<property>
<name>hbase.snapshot.region.timeout</name>
<value>60000</value>
</property>
<property>
<name>hbase.snapshot.master.timeout.millis</name>
<value>60000</value>
</property>
<property>
<name>hbase.security.authentication</name>
<value>simple</value>
</property>
<property>
<name>hbase.rpc.protection</name>
<value>authentication</value>
</property>
<property>
<name>zookeeper.session.timeout</name>
<value>60000</value>
</property>
<property>
<name>zookeeper.znode.parent</name>
<value>/hbase</value>
</property>
<property>
<name>zookeeper.znode.rootserver</name>
<value>root-region-server</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>xxx02.yyy.it,xxx01.yyy.it,xxx03.yyy.it</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>hbase.rest.ssl.enabled</name>
<value>false</value>
</property>
</configuration> Thanks for any help... Meanwhile, I'll go on testing things, but maybe I'll try to find a workaround and do something completely different. This is starting to be a bit too much for my skills / patience 🙂
... View more
02-05-2017
04:02 PM
Update: Adding the following comma separated list of jars to the Spark Action's "Options List" I've been able to get past the Java exceptions: --jars hdfs:///user/xxx/ETL/SCRIPTS/SPARK/hive-hbase-handler-1.1.0-cdh5.7.0.jar,hdfs:///user/xxx/ETL/SCRIPTS/SPARK/hbase-server-1.2.0-cdh5.7.0.jar,hdfs:///user/xxx/ETL/SCRIPTS/SPARK/hbase-client-1.2.0-cdh5.7.0.jar Now I'm facing another problem. The workflow still fails, and in the Spark Action's log I see the following error: Traceback (most recent call last):
File "AnalisiVoipOut1.py", line 42, in <module>
... pyspark.sql.utils.AnalysisException: u'path hdfs://myhostname.mydomain.it:8020/user/hive/warehouse/xxx.db/analisi_voipout_utenti_credito_2016_rif already exists.;'
2017-02-05 22:46:51,405 ERROR [Driver] yarn.ApplicationMaster (Logging.scala:logError(74)) - User application exited with status 1 Now, I can confirm that even changing the Table's name every time (therefore: THAT PATH DOES'NT EXIST), I still get the error everytime I run the workflow. I've alsto thought that, maybe, given that I was using 4 executors, maybe fore some strange reason they got in confilct generating the output directory on hdfs. Therefore I also made several tries with just one executor, but with no success. This " HBaseStorageHandler" thing is really puzzling me, can anybody help me in getting past this last error? Thanks
... View more
02-05-2017
02:58 AM
Hello, I am on CDH 5.7 I have an HBase table. To be able to access it from Impala/Hive, I have created a Hive external table pointing to it, using HBaseStorageHandler, using the following approach: CREATE EXTERNAL TABLE hbase_utenti(key string, val1 string, ... , valX string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:a,cf1:b, ... ,cfX:X")
TBLPROPERTIES("hbase.table.name" = "utenti"); This WORKS smoothly, I can query the table from Impala/Hive Now I have the following PySpark script that I want to run via Spark Action (just an excerpt): ## IMPORT FUNCTIONS
import sys
import getopt
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
from pyspark.sql.types import *
## CREATE SQLCONTEXT
sconf = SparkConf().setAppName("ChurnVoip").setMaster("yarn-cluster").set("spark.driver.memory", "3g").set("spark.executor.memory", "1g").set("spark.executor.cores", 2)
sc = SparkContext(conf=sconf)
sqlContext = HiveContext(sc)
## CREATE MAIN DATAFRAME
hbase_utenti_DF = sqlContext.table("msgnet.hbase_utenti")
## SAVE HBASE TABLE AS "REAL" HIVE TABLE
hbase_utenti_filtered_DF = hbase_utenti_DF \
.select(hbase_utenti_DF["USERID"].alias("HHUTE_USERID"), \
hbase_utenti_DF["RIVENDITORE"].alias("HHUTE_RIVENDITORE"), \
hbase_utenti_DF["DATA_COLL"].alias("HHUTE_DATA_COLL"), \
hbase_utenti_DF["CREDITO"].alias("HHUTE_CREDITO"), \
hbase_utenti_DF["VIRTUAL"].alias("HHUTE_VIRTUAL"), \
hbase_utenti_DF["POST"].alias("HHUTE_POST")) \
.filter("((HHUTE_RIVENDITORE = 1) and \
(HHUTE_DATA_COLL='2016-12-30T00:00:00Z'))") \
.saveAsTable('msgnet.analisi_voipout_utenti_credito_2016_rif') Please note that if I run the above Python script via "spark-submit" OR if I run Pyspark shell and copy-paste the code into it, IT WORKS smoothly Problem arise when I try to schedule an Oozie Spark Action doing the job. In this case, I get the following exception: ...
2017-02-05 10:47:52,106 ERROR [Thread-8] hive.log (MetaStoreUtils.java:getDeserializer(387)) - error in initSerDe: java.lang.ClassNotFoundException Class org.apache.hadoop.hive.hbase.HBaseSerDe not found
java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.hbase.HBaseSerDe not found
...
hbase_utenti_DF = sqlContext.table("msgnet.hbase_utenti")
File "/hdp01/yarn/nm/usercache/msgnet/appcache/application_1486053545979_0053/container_1486053545979_0053_02_000001/pyspark.zip/pyspark/sql/context.py", line 593, in table
---
py4j.protocol.Py4JJavaError: An error occurred while calling o56.table.
: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hadoop.hive.hbase.HBaseStorageHandler Now, I'm pretty sure the whole problem boils down to Jars not being included. But I've tried and tried to add Paths and Variables to my Oozie Spark Action, with no success. What I've tried to do includes the following: - Adding the following "Hadoop Properties" to my Workflow Settings (gear icon on top right of the Workflow in Hue): spark.executor.extraClassPath /opt/cloudera/parcels/CDH/lib/hive/lib/*
spark.executor.extraClassPath /opt/cloudera/parcels/CDH/lib/hbase/lib/* - Adding the following to my Spark Action's --files
hdfs:///user/<XXX>/hive-site.xml,
hdfs:///user/<XXX>/hive-hbase-handler-1.1.0-cdh5.7.0.jar,
hdfs:///user/<XXX>/hbase-client-1.2.0-cdh5.7.0.jar,
hdfs:///user/<XXX>/hbase-server-1.2.0-cdh5.7.0.jar,
hdfs:///user/<XXX>/hbase-hadoop-compat-1.2.0-cdh5.7.0.jar,
hdfs:///user/<XXX>/hbase-hadoop2-compat-1.2.0-cdh5.7.0.jar,
hdfs:///user/<XXX>/metrics-core-2.2.0.jar
--driver-class-path /etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/lib/* Can anybody please shed some light on what I'm missing here? I'm really out of ideas, after trying and trying. Thanks for any insight!
... View more
Labels:
01-03-2017
12:38 PM
Hi AdrianMonter, sorry to say I haven't found a specific solution for the Avro file format in the meanwhile. I'm sticking to Parquet file format since I had this problem, and for now it covers all my needs... Maybe in the latest CDH/Spark releases this has been fixed? Maybe somebody from @Former Member can tell us someting more?
... View more
10-17-2016
01:40 PM
2 Kudos
Hi aj, yes I did manage to solve it. Please, take a look at the following thread and see if it can be of help. It may seem a bit unrelated from the "test.py not found" issue, but it contains detailed info about how to specify all the needed parameters to let the whole thing run smoothly: http://community.cloudera.com/t5/Batch-Processing-and-Workflow/Oozie-workflow-Spark-action-using-simple-Dataframe-quot-Table/m-p/40834 HTH
... View more
09-12-2016
09:25 AM
Thanks for the clarification. I thought that, as this widget was not part of the "new features" that you can manually activate in Hue 3.9, this would work... My bad
... View more
09-09-2016
04:37 AM
I mentioned that I have CDH 5.7.0 and Solr 4.10.3, but I forgot to say that Hue version is 3.9 I've tried to build the Timeline Chart both with and without "latest features enabled" in Hue Configuration...
... View more
09-08-2016
09:13 AM
Hi all, I have CDH 5.7.0 and Solr 4.10.3. I have a Solr index that's apparently working correctly, but if in Hue Search I try to create a Dashboard on it, containing a Timeline widget, it doesn't work as expected. I have a "date" type field in the index (that is, "data_coll"), it's indexed and stored (like ALL the other fields in the index itself). When I drag the Timeline Widget Item in my new dashboard, a dropdown asks me to select a Date field to use. It proposes me one and one only field: "data_coll" (that's the only "date" field in the index). So I click on it. A chart shows up in the destination pane, but... Problem is, on the Y axis I just get the Total Documents Count for that Index on that day. I can't change the Metric to be used (even just for count) on the Y axis. Am I doing something wrong? Screenshots attached. Thanks for any insights... 1) WHEN I DRAG THE WIDGET ITEM ON ITS DESTINATION PANE (I'M PROPOSED TO CHOOSE "DATA_COLL" AS X AXIS TIMELINE FIELD): 2) AFTER CHOOSING "DATA_COLL" AS THE TIME FIELD (CAN'T CHANGE Y AXIS METRIC AFTER THAT!)
... View more
08-19-2016
11:35 AM
I don't see any macroscopic problems in the logs you posted. Nevertheless, in the "yarn logs" section we can see your container gets killed by the resource manager... ...
2016-08-19 08:38:44,046 INFO [Thread-71] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1471289760180_0031_m_000000_0
... This looks a lot like a problem that happened to me before. In a matter of fact, it looks like if you use Oozie to schedule a Shell Action, the action gets to be executed by the "yarn" user, and not by the user logged in in Hue. This, in my case, caused a trivial permission problem when all boiled down to the final write in the destination HDFS directory. Maybe you can try the following: 1) Insert a "sleep 100" as first thing in your shell script, and then compare if the total running time of the Workflow before getting killed is going to be 100 seconds longer than without the "sleep". If this is the case, we'll know that the shell script has been correctly loaded and executing, and the problem is not in the way you launch it 2) If (1) is confirmed, try to change to 777 the permissions for your destination folder "/user/hive/blah" (I see you are saving as parquetfile in that directory) and try again 3) in any case, have a look at this thread, maybe it can be of inspiration to work out something: http://community.cloudera.com/t5/Batch-Processing-and-Workflow/Oozie-Workflow-Shell-Action-getting-killed-Memory-problem/m-p/42899#M2448 4) If nothing works, try again drilling down in the logs with the "Diagnostics" --> "Logs" tool in Cloudera Manager, minimum log level "ERROR", Sources "EVERYTHING" HTH
... View more