Member since
01-05-2016
55
Posts
37
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
271 | 10-21-2019 05:16 AM | |
2837 | 01-29-2018 07:05 AM | |
1798 | 06-27-2017 06:42 AM | |
33366 | 05-26-2016 04:05 AM | |
21736 | 05-17-2016 02:15 PM |
02-15-2021
12:08 PM
Good Day, Effective January 31, 2021, all Cloudera software requires a valid subscription and is only accessible from behind the paywall. This includes all legacy versions for Cloudera Distribution including Apache Hadoop (CDH), Hortonworks Data Platform (HDP), Data Flow (HDF/CDF), and Cloudera Data Science Workbench (CDSW). Information regarding paywall access will be available in technical documentation by software type and version. https://www.cloudera.com/downloads/paywall-expansion.html If you have a valid Cloudera Subscription, you can obtain your credentials for downloads following directions outlined here: https://docs.cloudera.com/cdp-private-cloud-base/latest/installation/topics/cdpdc-cm-download-information.html
... View more
11-06-2020
08:45 AM
Since no one responded back, responding to this. This is a defect and is addressed via HUE-9110
... View more
10-29-2020
08:49 AM
Hi, I have a Cloudera Express 6.3.2 setup where I'm trying to submit an Oozie Spark Action that reads from an HBase Table (correctly mapped in Hive) If I get into Pyspark CLI , adding the relevant "--jars" when calling it, I can read from the Table without any problems The list of Jars I'm adding is the following: hive-hbase-handler-2.1.1-cdh6.3.2.jar
hbase-client-2.1.0-cdh6.3.2.jar
guava-11.0.2.jar
hbase-common-2.1.0-cdh6.3.2.jar
hbase-hadoop-compat-2.1.0-cdh6.3.2.jar
hbase-hadoop2-compat-2.1.0-cdh6.3.2.jar
hbase-protocol-2.1.0-cdh6.3.2.jar
hbase-server-2.1.0-cdh6.3.2.jar
htrace-core4-4.2.0-incubating.jar But if I try to run the same script via Oozie + Spark Action, I get the following exception: Now, the same exact thing is working on a CDH 5 setup, but nevertheless I've tried several things hoping to make it work here too, but to no avail. What I've tried: ------------------ - Adding a "hive.aux.jars.path" section to BOTH my "hive-site.xml" and "hbase-site.xml" that I'm passing to my Spark Action woth "--files": - Configuring "Oozie Sharelib" and putting the Jar Files under BOTH the "spark" and "hive" HDFS directories relevant to the Sharelib itself (and of course setting "oozie.use.system.libpath" to "true" in my Oozie Workflow Configuration) - Also tried to set a " oozie . libpath" option, same way as the previous step, pointing to a complete list of the involved HBase Jars (then removed this config as I was getting an error telling me that it was not possible loading the relevant Jars multiple times, as they are already present in the Sharelib) - Passing a relevant "SPARK_CLASSPATH" along with a "SPARK_HOME" as "oozie.launcher.yarn.app.mapreduce.am.env" Hadoop Properties in my Oozie Workflow Configuration, with the list of all the Jars involved - Tried to add a "--conf "spark.driver.extraClassPath=..." and a "--conf "spark.executor.extraClassPath=..." configuration options, BOTH inside the Python script bering called and in my Oozie Workflow Spark Action window in the GUI in Hue - I've extensively searched the Community Forum and the web before posting this, but no joy. Also, this is the first time this is happening, I have other setups where this works as expected! Don't know what to try anymore. Any help would be greatly appreciated. Here below I'm posting the full stack trace. Thank you for any insights! Log Type: stdout
Log Upload Time: Thu Oct 29 15:36:22 +0100 2020
Log Length: 9563
Traceback (most recent call last):
File "reportServicesCreditExtraction.py", line 74, in <module>
hbase_utenti_DF = sqlContext.table("msgnet.hbase_utenti")
File "/data/1/yarn/nm/usercache/msgnet/appcache/application_1602174532153_0765/container_1602174532153_0765_02_000001/pyspark.zip/pyspark/sql/context.py", line 371, in table
File "/data/1/yarn/nm/usercache/msgnet/appcache/application_1602174532153_0765/container_1602174532153_0765_02_000001/pyspark.zip/pyspark/sql/session.py", line 791, in table
File "/data/1/yarn/nm/usercache/msgnet/appcache/application_1602174532153_0765/container_1602174532153_0765_02_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/data/1/yarn/nm/usercache/msgnet/appcache/application_1602174532153_0765/container_1602174532153_0765_02_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/data/1/yarn/nm/usercache/msgnet/appcache/application_1602174532153_0765/container_1602174532153_0765_02_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o86.table.
: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/mapreduce/TableInputFormatBase
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:246)
at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:235)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.hadoop.hive.hbase.HBaseStorageHandler.getInputFormatClass(HBaseStorageHandler.java:133)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7$$anonfun$12$$anonfun$apply$10.apply(HiveClientImpl.scala:463)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7$$anonfun$12$$anonfun$apply$10.apply(HiveClientImpl.scala:463)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7$$anonfun$12.apply(HiveClientImpl.scala:463)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7$$anonfun$12.apply(HiveClientImpl.scala:463)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7.apply(HiveClientImpl.scala:462)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7.apply(HiveClientImpl.scala:376)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:376)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:374)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
at org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:374)
at org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:81)
at org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:84)
at org.apache.spark.sql.hive.HiveExternalCatalog.getRawTable(HiveExternalCatalog.scala:120)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:737)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:737)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
at org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:736)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getTable(ExternalCatalogWithListener.scala:146)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:701)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:730)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:685)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:715)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:708)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:708)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:654)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:121)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:106)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:637)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:633)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.TableInputFormatBase
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:255)
at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:235)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 84 more
15:36:20.966 [Driver] ERROR org.apache.spark.deploy.yarn.ApplicationMaster - User application exited with status 1
... View more
05-22-2020
09:31 AM
Hi all, I have the following situation that I can't solve even after extensive search and trials. I have 2 clusters, "Cluster 1" is Hortonworks HDP-2.6.5.0 and "Cluster 2" is Cloudera CDH 5.13.1 Enterprise The final goal is to run a Pyspark script on "Cluster 1" and remotely create a Hive Table on "Cluster 2". Script (of course this is just an example to illustrate the issue) is the following: from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pyspark.sql.functions as func
sconf = SparkConf().setAppName("rob_example")
sc = SparkContext(conf=sconf)
sqlContext = HiveContext(sc)
sqlContext.setConf("fs.defaultFS","hdfs://<REMOTE_HDFS_HOST>:8020")
sqlContext.sql("create table rob.test_output_table as select * from rob.test_input_table"); Now, what happens it that if I use Spatk 1.6.3 on "Cluster 1", the above script runs just fine, and if I log in into Hive on "Cluster 2" I can see the destination table and data are created correctly. But if I use Spark 2.3 on "Cluster 1" (and I need to use 2.3) I get the following exception instead: pyspark.sql.utils.AnalysisException: u'java.lang.IllegalArgumentException: Wrong FS: hdfs://<REMOTE_HDFS_HOST>/user/hive/warehouse/rob.db/test_output_table/.hive-staging_hive_2020-05-22_17-59-00_296_5401117384982980379-1/-ext-10000/part-00000-733a2646-fc76-47f0-80d6-a14b28677f7e-c000, expected: hdfs://cluster1_host1.domain1.local:8020;' First of all, I notice a suspect ";" trailing the "expected" Filesystem where Spark is trying to read/write (maybe pointing to wrong Hive Metastore?). In fact, it's also strange that it's expecting me to tell it to read/write on "Cluster 1" itself, differently from what I specified in my Spark session Hive Context configuration (fs.defaultFS parameter inside the Script) More than this, even if I try setting the following parameters I can't make it work. And it's strange, because as I said if I use Spark 1.6 everything runs smoothly (even without the following additional configurations): sqlContext.setConf("default.fs.name","hdfs://<REMOTE_HDFS_NODE>:8020")
sqlContext.setConf("hive.metastore.uris","thrift://<REMOTE_THRIFT_NODE>:9083") Also please note that if I'd use a pure Spark Dataframes approach as follows, things would work, but I need to use Spark SQL otherwise the output table I'll get won't be in Hive compatible format: test_DF = sqlContext.sql("select * from rob.test_input_table")
test_DF.write.mode("overwrite").saveAsTable("rob.test_output_table") So, the above last approach is not feasible (not Hive compatible format of the output table). I found a lot of people having similar issues, but none of the cases I found applied exactly to my case, and I'm stuck at this point. Any help/hints would be greatly appreciated! Thank you for any insights on this
... View more
Labels:
10-21-2019
05:16 AM
1 Kudo
You can query the API exposed by Cloudera Manager and simplify your life. For example, you can run the following: curl -u <CM_USER>:<CM_PASSWD> http://<CM_IP_ADDRESS>:7180/api/v19/clusters/<CLUSTER_NAME>/services/hive2 You'll get a Json answer in reply to your Query, with all the details related to the desired service's status. You can finally parse your Json answer (e.g. using "jq" or directly inside your bash script) and take the desired actions HTH
... View more
- Tags:
- The
03-29-2019
06:07 AM
Dear All. I am facing the issue with Oozie while runing simple job with HUE GUI. getting below error Please help me! error:- "traceback": [ [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/handlers/base.py", 112, "get_response", "response = wrapped_callback(request, *callback_args, **callback_kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/db/transaction.py", 371, "inner", "return func(*args, **kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/decorators.py", 113, "decorate", "return view_func(request, *args, **kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/decorators.py", 75, "decorate", "return view_func(request, *args, **kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/views/editor2.py", 373, "submit_workflow", "return _submit_workflow_helper(request, workflow, submit_action=reverse('oozie:editor_submit_workflow', kwargs={'doc_id': workflow.id}))" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/views/editor2.py", 428, "_submit_workflow_helper", "'is_oozie_mail_enabled': _is_oozie_mail_enabled(request.user)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/views/editor2.py", 435, "_is_oozie_mail_enabled", "oozie_conf = api.get_configuration()" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/libs/liboozie/src/liboozie/oozie_api.py", 319, "get_configuration", "resp = self._root.get('admin/configuration', params)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/core/src/desktop/lib/rest/resource.py", 100, "get", "return self.invoke(\"GET\", relpath, params, headers=headers, allow_redirects=True, clear_cookies=clear_cookies)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/core/src/desktop/lib/rest/resource.py", 80, "invoke", "clear_cookies=clear_cookies)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/core/src/desktop/lib/rest/http_client.py", 196, "execute", "raise self._exc_class(ex)" ] ] } Thanks HadoopHelp
... View more
11-12-2018
12:09 PM
Hi, late reply but I hope iy can still be useful. To achieve what you want you should do something like: dataframe2 = dataframe1.persist(StorageLevel.MEMORY_AND_DISK) HTH
... View more
11-12-2018
04:17 AM
1 Kudo
Hello community, In Spark >= 2.0, the following statement should be correct: df2_DF = df1_DF.checkpoint() But of course if I try to perform checkpointing on a dataframe in Spark 1.6, I get the following exception, as checkpointing wasn't yet implemented: AttributeError: 'DataFrame' object has no attribute 'checkpoint' Is there a quick way to implement this in Spark 1.6.0 using a workaround? What I've tried until now is (similar to what I saw in a post on Stack Overflow) the following approach, using an intermediate rdd conversion for the original Dataframe df1: ...
sc.setCheckpointDir("hdfs:///tmp")
...
df1 = sqlContext.table("<TABLE_NAME>")
...
df1.rdd.checkpoint()
df1.rdd.count
df2 = sqlContext.createDataFrame(df1.rdd, df1.schema) But I'd like to ask if: - Everybody else is using this approach? Can you pls give feedback if you do? - If there is a different, better approach? What I'm doing in reality is looping over a Dataframe and it gets "heavier" at each loop - If the above is the only option, does it make sense if in the last statement I directly reassign the "should-now-be-checkpointed-rdd" to the original Dataframe? Something like: df1 = sqlContext.createDataFrame(df1.rdd, df1.schema) - Any other observations/comments? Thanks in advance for any insight or help
... View more
Labels:
01-29-2018
07:26 AM
1) Apparently, yes 2) The name of the user you're trying to use to log in to the remote system, I suppose. Pls note that the user you specify here would be the user "oozie" will run as, so you'd eventually get other problems of unpredictable nature when using Oozie 3) I don't really know, sorry about that... The fact is that even if I'm pretty sure to have understood the cause of your issue, I never had to deal with it directly myself. Maybe the easier way could be to follow the additional suggestions I wrote in my previous answer (give permissions to OS User "yarn" to "ssh" and/or "su"). Or, maybe, another possibility would be for you to create a "yarn" user on the remote system and grant this user with the correct permissions to get to the final working directory I hope you'll manage to get through the problems and make it 🙂
... View more
01-29-2018
07:19 AM
Hi all, I have this File Action (part of an Oozie Workflow) that runs every 15 minutes, and moves all the files in a "receiving" HDFS directory, putting them into an "in_process" HDFS directory.
Everything is OK unless for whatever reason the number of files in the "receiving" HDFS directory grows too much. If, let's say, that number gets to be > 30000 (not exactly, but around that number) the File Action fails, without meaningful errors in the logs
I'd need help to sort out a few possible options:
- Is it possible to manually specify the maximum number of files a File Action would be able to handle? E.g. setting more resources in whatever parameter, or specifying an explicit value somewhere?
- If I use a Shell Action instead, would it possibly be a better choice? I'm a bit hesitant because I'm afraid that, given that the "live files" getting copied in the " receiving" HDFS directory are coming in at a very high rate, the Shell Action could not be able to keep up with the changes inside the directory...
Looking forward to receiving your insights, thanks a lot!
... View more
12-08-2017
07:45 AM
Hi, I had the same problem... After going through the logs, I found that the ojdbc driver could not be found. The simplest solution was to copy ojdbc.jar into yarn home directory: /opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/lib/hadoop-yarn on all nodes.
... View more
11-23-2017
08:43 AM
Hi all, I'm putting up a log parser in Pig and I'm trying to use "Pyasn", a Python extension allowing offline querying of an ASN database, to extract Autonomous System Number information from IP addresses
The link to the project is here:
https://pypi.python.org/pypi/pyasn
What happens is that:
1) I successfully installed pyasn (in a previous try via pip-install, currently I have built it manually, but still it doesn't work)
2) I wrote a custom UDF to be later imported in Pig, prior being wrapped inside Jython:
#!/usr/bin/python
import sys
sys.path.append('/usr/lib64/python2.6/site-packages/')
sys.path.append('/usr/lib64/python2.6/site-packages/pyasn-1.6.0b1-py2.6-linux-x86_64.egg/')
sys.path.append('/usr/lib/python2.6/site-packages/')
import pyasn
@outputSchema("asn:chararray")
def asnLookup(ip):
asndb = pyasn.pyasn('asn.dat')
asn = asndb.lookup(ip)
return asn
@outputSchema("asn_prefix:chararray")
def asnGetAsPrefixes(nbr):
asndb = pyasn.pyasn('asn.dat')
asn_prefix = asndb.get_as_prefixes(nbr)
return asn_prefix
3) But when I try to register my UDF, I get the following exception:
grunt> register 'hdfs:///user/xxxxxx/LIB/PYASN/python_pyasn.py' using jython as pythonPyasn;
2017-11-23 17:18:10,468 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2017-11-23 17:18:10,939 [main] INFO org.apache.pig.scripting.jython.JythonScriptEngine - created tmp python.cachedir=/tmp/pig_jython_8271942503558994412
2017-11-23 17:18:12,468 [main] WARN org.apache.pig.scripting.jython.JythonScriptEngine - pig.cmd.args.remainders is empty. This is not expected unless on testing.
2017-11-23 17:18:13,236 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1121: Python Error. Traceback (most recent call last):
File "/tmp/pig6864734086775637011tmp/python_pyasn.py", line 8, in <module>
import pyasn
File "/usr/lib64/python2.6/site-packages/pyasn-1.6.0b1-py2.6-linux-x86_64.egg/pyasn/__init__.py", line 20
SyntaxError: future feature print_function is not defined
4) The puzzling thing is that I'm currently doing the exact same thing with another Python extension for Geo Localization (PyGeoIP) and it works smoothly, the concept is the same, I wrote a UDF and imported it in Pig wrapping it up in Jython and I can call it successfully!
5) If, just to check things are formally OK, I open a PySpark Shell and use the extension, it works without any problems. But I don't want (can't) use Spark in this case, for a number of reasons
Any ideas/insight would be very much appreciated!
Thanks
... View more
09-26-2017
09:14 AM
Have you tried a configuration similar to: - 3 executors (1 per datanode) - Setting a very low initial memory setting for the executors (e.g. 1 GBytes) - Limiting the VCPU to 1 f or the executors (at least initially) - Maybe you can try with 2 VCPUs and 2 GBytes for the driver initially - It's important to allow a conspicuous overhead to the amount of RAM usable by the executors if they run out of initial resources (that 1 GBytes we specified before), but just when/if they need it. So set the following parameters accordingly: spark.yarn.executor.memoryOverhead = 8 GBytes spark.yarn.driver.memoryOverhead = 8 GBytes - You can read the following docs to get a better grasp on the concepts behind resources allocation: https://www.cloudera.com/documentation/enterprise/5-11-x/topics/cdh_ig_yarn_tuning.html https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_running_spark_on_yarn.html The first Doc also includes a link to a "Yarn Tuning Spreadsheet". The Docs contain details on how to configure all the features I've been talking about just above 🙂 HTH
... View more
07-22-2017
08:45 AM
Thanks mbigelow, following your suggestions I solved the massive error logging issue. I've processed in a Json validator the specific log file referenced in the Java stack trace: /user/spark/applicationHistory/application_1494352758818_0117_1 But the format was correct, according to the validator. So I just moved it away in a temporary directory. As soon as I did it, the error messages stopped clogging the system logs. So it was probably corrupted in a very subtle way... But it was definitely corrupted That Json file has been indeed generated by the Spark Action that is giving me problems, but it was an OLD file. New instances of that Spark Action are generating new Json logs, but they are not giving any troubles to the History Server (stopped having tons of exceptions logged as I just said) Unfortunately, the Spark job itself is still failing and it's needing further investigation on my side, so apparently this is not related to that specific error message. But I've solved an annoying problem, and at the same time I have cleared out the possibility of the Spark Action issue being related to that java exception Thanks!
... View more
06-27-2017
06:42 AM
In the end I've been able to solve the issue. I've been tricked by the fact that applying again from scratch the "YARN Resources Allocation Tuning Guide" proposed a (in my opinion) misleading way of calculating a few important parameters. Guide can be found here: https://www.cloudera.com/documentation/enterprise/5-10-x/topics/cdh_ig_yarn_tuning.html In a matter of fact, the Guide contains a downloadable XLS file which is a tool for calculating optimal parameters. This XLS automatically calculates and proposes a few values to be assigned to YARN configuration: As you can see above, at step 4 I got proposed "2" for "yarn.nodemanager.resource.cpu-vcores" and "5632" for " yarn.nodemanager.resource.memory-mb" I later found out that the correct values to be assigned to those configurations are the 2 values proposed at "step 5" Definitely, partly my fault (I do not have deep knowledge of YARN configuration). But partly misleading doc indeed. I am now fine tuning, trying different settings for the various java heap sizes etc Still I have no idea why everything was working fine until recently and stopped working after upgrading to 5.11, as I did not change any configuration while upgrading and physical resources are identical
... View more
02-06-2017
10:58 AM
Hi, thanks for your reply. I have tried to create the Table first, and indeed the overall behaviour changed. Now I get a different exception, which I'm going to paste just below. Strange thing is that the class referenced in the exception, "ClientBackoffPolicyFactory", IS loaded and present (it's in "--jars", as detailed in earlier posts above this one). Here is the main excerpt from the error stack: 2017-02-06 19:31:51,426 WARN [Thread-8] ipc.RpcControllerFactory (RpcControllerFactory.java:instantiate(78)) - Cannot load configured "hbase.rpc.controllerfactory.class" (org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory) from hbase-site.xml, falling back to use default RpcControllerFactory
2017-02-06 19:31:51,431 ERROR [Thread-8] datasources.InsertIntoHadoopFsRelation (Logging.scala:logError(95)) - Aborting job.
java.io.IOException: java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.client.backoff.ClientBackoffPolicyFactory$NoBackoffPolicy
...
Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.UnsupportedOperationException: Unable to find org.apache.hadoop.hbase.client.backoff.ClientBackoffPolicyFactory$NoBackoffPolicy
...
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.client.backoff.ClientBackoffPolicyFactory$NoBackoffPolicy
...
2017-02-06 19:31:51,538 ERROR [Driver] yarn.ApplicationMaster (Logging.scala:logError(74)) - User application exited with status 1 Given that "hbase-site.xml" is mentioned in the error stack, I'm also pasting that file just below: <?xml version="1.0" encoding="UTF-8"?>
<!--Autogenerated by Cloudera Manager-->
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://xxx01.yyy.it:8020/hbase</value>
</property>
<property>
<name>hbase.replication</name>
<value>true</value>
</property>
<property>
<name>hbase.client.write.buffer</name>
<value>2097152</value>
</property>
<property>
<name>hbase.client.pause</name>
<value>100</value>
</property>
<property>
<name>hbase.client.retries.number</name>
<value>35</value>
</property>
<property>
<name>hbase.client.scanner.caching</name>
<value>100</value>
</property>
<property>
<name>hbase.client.keyvalue.maxsize</name>
<value>10485760</value>
</property>
<property>
<name>hbase.ipc.client.allowsInterrupt</name>
<value>true</value>
</property>
<property>
<name>hbase.client.primaryCallTimeout.get</name>
<value>10</value>
</property>
<property>
<name>hbase.client.primaryCallTimeout.multiget</name>
<value>10</value>
</property>
<property>
<name>hbase.regionserver.thrift.http</name>
<value>false</value>
</property>
<property>
<name>hbase.thrift.support.proxyuser</name>
<value>false</value>
</property>
<property>
<name>hbase.rpc.timeout</name>
<value>60000</value>
</property>
<property>
<name>hbase.snapshot.enabled</name>
<value>true</value>
</property>
<property>
<name>hbase.snapshot.master.timeoutMillis</name>
<value>60000</value>
</property>
<property>
<name>hbase.snapshot.region.timeout</name>
<value>60000</value>
</property>
<property>
<name>hbase.snapshot.master.timeout.millis</name>
<value>60000</value>
</property>
<property>
<name>hbase.security.authentication</name>
<value>simple</value>
</property>
<property>
<name>hbase.rpc.protection</name>
<value>authentication</value>
</property>
<property>
<name>zookeeper.session.timeout</name>
<value>60000</value>
</property>
<property>
<name>zookeeper.znode.parent</name>
<value>/hbase</value>
</property>
<property>
<name>zookeeper.znode.rootserver</name>
<value>root-region-server</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>xxx02.yyy.it,xxx01.yyy.it,xxx03.yyy.it</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>hbase.rest.ssl.enabled</name>
<value>false</value>
</property>
</configuration> Thanks for any help... Meanwhile, I'll go on testing things, but maybe I'll try to find a workaround and do something completely different. This is starting to be a bit too much for my skills / patience 🙂
... View more
01-17-2017
05:20 PM
1 Kudo
A schema or protocol may not contain multiple definitions of a fullname. Further, a name must be defined before it is used ("before" in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come "before" the messages attribute.)
... View more
10-17-2016
02:18 PM
Ah, my error was not using HDFS: for the .py. Thanks!
... View more
09-12-2016
09:25 AM
Thanks for the clarification. I thought that, as this widget was not part of the "new features" that you can manually activate in Hue 3.9, this would work... My bad
... View more
08-15-2016
05:15 AM
Thank you. Useful insight and crystal clear argumentation, as usual from you. I have to say in the meanwhile I had the chance to study a bit more, and in the end I came to a conclusion which matches your considerations, therefore I'm glad that apparently I moved in the right direction. In a matter of fact I've seen this Open Source project here http://opentsdb.net , and I've seen that generally speaking the approach they use is the last that you explained. To provide a practical example, in my case: - A new record every week for the same Customer Entity - Therefore, column Versioning is NOT used at all! (like you suggested) - "Speaking" record key e.g. "<CUST_ID> + <YYYY-MM-DD>" - This sort of Key is not monotonically increasing, because the "CUST_ID" part is "stable", so this approach should be good also on a "Table Splitting" perspective (when the Table grows, it will split up "evenly" and all the Splits will take care of a part of the future inserts, balancing the Machines Load evenly) - Same set of columns for each record containing the new sampled value for that field for that week e.g. "<Total progressive time used Service X>" This is the approach I used in the end, which has nothing to do with my original idea of using Versions but perfectly matching the last approach you described in your answer. Regarding the Fixed Values (e.g. "Name", "Surname") I've decided to replicate them every week too, as if they were Time Series themselves... I know, waste of storage. Planning on modifying this structure soon and move the Fixed Values in another Table (Hive or HBase, don't know yet) and pick up the information I'd eventually need at the moment (for instance, during Data Processing, I'll join in the relevant Anagraphic Data in the relevant Dataframes via Join). I just wanted to write a few more lines about the issue for the posterity. I hope this post will be useful to people 🙂 Thanks again!
... View more
07-28-2016
03:11 AM
Thanks. Seems a good alternative, and in a matter of fact I was not aware of its availability in CDH 5.7 Marking the thread as solved, even if by now I don't know yet if all the features I'd need will be there in the native hbase-spark connector
... View more
07-16-2016
02:46 PM
I ended up changing the permissions to 777 on both the "source" and "destination" directories on HDFS. Of course, I have limited understanding of all the security implications going on behind the scenes, but this seems to me a bug and not a feature. If I log in to Hue as a particular user, and this user has the correct permissions granted to access the relevant Directories in RW, I don't see why I have to be obligated to change permissions to 777 to let the Container be able to do the job. Therefore I'm not marking this post as "resolved", even though I could work around the issue somehow (in a very bad way, actually)
... View more
05-19-2016
11:39 AM
Can you please share your "workflow.xml" and "job.properties" files? Also, can you try to adapt your "spark_easy.py" as follows and give it a try? Like: from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
sconf = SparkConf().setAppName("SparkEasy").set("spark.driver.memory", "1g")
sc = SparkContext(conf=sconf)
sqlCtx = HiveContext(sc)
simple_DF = sqlCtx.sql("select * from <WHATEVER_EXISTING_TABLE_HERE>") HTH
... View more
03-06-2016
11:13 AM
Thanks for the suggestion about registering the class and for the additional info. I have to say that I had already read the link you sent me, but I didn't really get what was the meaning of "you have to register the classes first". In fact, I gave up with my tries at the time and sticked to Java Serializer for my testing purposes. Maybe I'll need to get back to this in the future, and I'll do it with additional knowledge now. Thanks a lot. Also, I have to say that now, after reading all this, I find it a bit strange that Cloudera sets Kryo as default serializer. Anyway.
... View more
01-21-2016
01:43 AM
Maybe it has something to do with the approach/configuration explained at the link here below? Just an idea... http://www.cloudera.com/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html
... View more