Member since
03-04-2015
96
Posts
12
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4542 | 01-04-2017 02:33 PM | |
12352 | 07-17-2015 03:11 PM |
10-12-2020
10:08 PM
I imported our existing v5.12 workflows via command-line loaddata. They show up in Hue 3 Oozie Editor, but not Hue 4. We are using CDH 5.16. I find the new "everything is document" paradigm confusing and misleading - Oozie workflows, Hive queries, Spark jobs etc. are not physical documents - in the Unix/HDFS sense that normal users would expect, with absolute paths that can be accessed and manipulated directly. The traditional-style Hue 3 UI lets one focus on working with the technology at hand, instead of imposing The Grand Unifying Design on the user.
... View more
07-01-2020
01:23 PM
The Phoenix-Hive storage handler as of v4.14.0 (CDH 5.12) seems buggy. I was able to get the Hive external wrapper table working for simple queries, after tweaking column mapping around upper/lower case gotchas. However, it fails to work when I tried the "INSERT OVERWRITE DIRECTORY ... SELECT ..." command to export to file: org.apache.phoenix.schema.ColumnNotFoundException: ERROR 504 (42703): Undefined column. columnName=<table name> This is a known problem that no one is apparently looking at: https://issues.apache.org/jira/browse/PHOENIX-4804
... View more
02-15-2019
06:40 PM
Try put <extraClassPath> settings into the global Spark config in Ambari instead, in the spark-defaults section (you may have to add them as customized). This works for us with Cloudera and Spark 1.6.
... View more
07-24-2018
01:29 PM
We were unable to access Spark app log files from either YARN or Spark History Server UIs, with error "Error getting logs at <worker_node>:8041". We can see the logs with "yarn logs" command. Turns out our yarn.nodemanager.remote-app-log-dir = /tmp/logs, but the directory was owned by "yarn:yarn". Following your instruction fixed the issue. Thanks a lot! Miles
... View more
04-20-2018
10:04 PM
Ambari server was getting HTTP 502 trying to query timeline metrics - this fixed it for us behind corporate firewall. Thanks!
... View more
04-10-2018
08:57 AM
Hi Eric: Thanks for your explanation. Would you be able to point us to the formal licensing statements stating the same? It would be required by our corporate to approve CDK (and CDS2 for that matter) for production use. Miles
... View more
04-09-2018
03:58 PM
1 Kudo
You can choose to either compile the package into your application jar, or manually install it on every spark/yarn worker node and include the dir in your <extraClassPath>. Sample pom.xml on HDP 2.6.3: <dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.3.2.6.3.0-235</version>
<scope>provided</scope>
</dependency>
...
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.5.0</version>
<scope>provided</scope>
</dependency> Use " <scope>provided</scope> " if you choose external installation. Leave out if you want to compile in. Simpler to compile in, but if you have a large cluster or multiple Spark applications that will share such external libraries, using "provided" scope may be more optimal. In this case, you would need to specify: --conf "spark.driver.extraClassPath=...:<your ext lib path>/*" --conf "spark.executor.extraClassPath=...:<your ext lib path>/*" on your spark-submit command line.
... View more
03-30-2018
09:18 AM
We are interested in using CDK instead of the base Apache version for its additional operation features. However, I cannot find any info about its pricing and licensing terms - required by my corporate legal. Nothing on CDK documentation. The main Cloudera product and download pages seem to have been redesigned, and do not even provide links to the individual component distros like Spark 2 and Kafka. Pricing is also changed to base on high-level "solutions" instead of individual software products. Since CDK is essentially CM+ZooKeeper+Kafka in a parcel, would it be licensed on the same basis as base CDH? I believe (the simpler) Cloudera Spark 2 is indeed free, but cannot find official info on that, either. Can the Cloudera corporate folks help answer? Thanks, Miles Yao
... View more
Labels:
- Labels:
-
Apache Kafka
-
Apache Spark
03-20-2018
07:48 PM
1 Kudo
Note that phoenix-spark2.jar MUST precede phoenix-client.jar in extraClassPath, otherwise connection will fail with: java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
... View more
02-01-2018
06:39 PM
Found the solution - place phoenix-spark2.jar before phoenix-client.jar, and everything worked. The Spark2/Scala 2.11 versions of org.apache.phoenix.spark classes need to overlay those included in the main phoenix-client.jar. Try it and let us know. 🙂
... View more
01-31-2018
06:07 PM
We tried this too on our HDP 2.6.3 cluster. Sure enough, we got the same issue: /usr/hdp/current/spark2-client/bin/spark-shell --master yarn-client --driver-memory 3g --executor-memory 3g --num-executors 2 --executor-cores 2 --conf "spark.driver.extraClassPath=/usr/hdp/current/phoenix-client/phoenix-client.jar:/usr/hdp/current/phoenix-client/phoenix-spark2.jar:/etc/hbase/conf" --conf "spark.executor.extraClassPath=/usr/hdp/current/phoenix-client/phoenix-client.jar:/usr/hdp/current/phoenix-client/phoenix-spark2.jar:/etc/hbase/conf" scala> val jobsDF = spark.read.format("org.apache.phoenix.spark").options(Map(
| "table" -> "ns.Jobs", "zkUrl" -> zkUrl)).load
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,file:/usr/hdp/2.6.3.0-235/phoenix/phoenix-4.7.0.2.6.3.0-235-client.jar!/ivysettings.xml will be used
2018-01-30 16:24:33,254 INFO [main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x79bb14d8 connecting to ZooKeeper ensemble=zkhost1:2181,zkhost2:2181,zkhost3:2181
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.getDeclaredMethod(Class.java:2128)
at java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1575)
...
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
... 49 elided
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.DataFrame
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 83 more
Tweaking extraClassPath and --jars using phoenix-client.jar, phoenix-4.7.0.2.6.3.0-235-spark2.jar, and spark-sql_2.11-2.2.0.2.6.3.0-235.jar made no difference. I am inclined to agree with this other fellow that Hortonworks' phoenix-client.jar is not actually Spark2-compatible, release note to the contrary.
... View more
01-30-2018
06:48 PM
If you look outside the Hortonworks distribution 😉 , Cloudera is pushing Kudu, which is supposed to be a middle ground between Hive and Phoenix. There is also Splice Machines, an MVCC SQL engine on top of HBase which is now open-sourced. Good luck!
... View more
09-14-2017
02:54 PM
When you run OfflineMetaRepair, most likely you will run it from your userid or root. Then we may get some opaque errors like "java.lang.AbstractMethodError: org.apache.hadoop.hbase.ipc.RpcScheduler.getWriteQueueLength()". If you check in HDFS, you may see that the meta directory is no longer owned by hbase: $ hdfs dfs -ls /hbase/data/hbase/
Found 2 items
drwxr-xr-x - root hbase 0 2017-09-12 13:58 /hbase/data/hbase/meta
drwxr-xr-x - hbase hbase 0 2016-06-15 15:02 /hbase/data/hbase/namespace Manually chown -R it and restart HBase fixed it for me.
... View more
08-22-2017
08:38 PM
Thanks for the write-up. Does the above imply that the newly split region will always stay on the same RS, or is it configurable? If it's always local, then won't the load on the "hot" region server just get heavier and heavier, until the global load balancer thread kicks in? Shouldn't HBase just create the new daughter regions on the least-loaded RS instead? There was a lot of discussion related to this in HBASE-3373, but it isn't clear what the resulting implementation was.
... View more
08-13-2017
09:20 PM
This feature behaves unexpectedly when the table is migrated from another HBase cluster. In this case, the table creation time can be much later than the row timestamps of all its data. A flashback query meant to select an earlier subset of data will return the following failure instead: scala> df.count
2017-08-11 20:12:40,550 INFO [main] mapreduce.PhoenixInputFormat: UseSelectColumns=true, selectColumnList.size()=3, selectColumnList=TIMESTR,DBID,OPTION
2017-08-11 20:12:40,550 INFO [main] mapreduce.PhoenixInputFormat: Select Statement: SELECT "TIMESTR","DBID","OPTION" FROM NS.USAGES
2017-08-11 20:12:40,558 ERROR [main] mapreduce.PhoenixInputFormat: Failed to get the query plan with error [ERROR 1012 (42M03): Table undefined. tableName=NS.USAGES]
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#13L])
+- TungstenExchange SinglePartition, None
+- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#16L])
+- Project
+- Scan ExistingRDD[TIMESTR#10,DBID#11,OPTION#12]
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:80)
... Which apparently means that Phoenix considers the table nonexistent at this point. I tested the same approach in sqlline and sure enough, the table is missing from "!tables" Any workaround?
... View more
08-03-2017
09:31 PM
Same question about accessing Phoenix tables from full HDP 2.6 Spark 2 SQL.
... View more
07-21-2017
01:29 PM
That's good news. But I think the requester would like to know when Cloudera plans to integrate Spark 2 into CDH, not as a separate install (like what Hortonworks does). Thanks, Miles
... View more
07-12-2017
06:35 PM
On HDP 2.6, appending $CLASSPATH seems to break Spark2 interpreter with: "org.apache.zeppelin.interpreter.InterpreterException: Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;" Is the included Phoenix-Spark driver (phoenix-spark-4.7.0.2.6.1.0-129.jar) certified to work with Spark2? I thought it's the preferred way rather than via JDBC. Thanks!
... View more
07-07-2017
03:45 PM
I had the same problem with a valid Linux/HDFS user as Ambari ID, the solution worked - thanks!
... View more
05-22-2017
06:36 PM
Debian 8 (Jessie) has been made current stable version for a year now. When do you plan to support it? Is there any known issue that blocks its adoption?
... View more
05-22-2017
06:09 PM
We have HDP 2.4 on Debian 7. There is no /usr/lib/python2.6/site-packages/ambari_server/os_type_check.sh installed - only os_check_type.py. And all it checks is whether the current node OS matches the cluster, not whether the OS version is supported. /usr/lib/ambari-server/lib/ambari_commons/resources/os_family.json seems to list the supported OS versions (e.g. RedHat 6/7, Debian 7, Ubuntu 12/14) which matches documentation.
... View more
03-16-2017
09:20 AM
2 Kudos
First, thanks for the helpful detailed explanation. We have a similar issue of migrating from default embedded DB to a separate PostgreSQL instance. Some comments: The documentation needs to be clearer - the criteria for determining "embeddedness" you listed is not intuitive and could not have been inferred from the documentation. Your writeup should have been included right there. The embeddedness criteria seem over-strict. Insisting the DB be off-cluster is based on the old 3-tier architecture assumption - on the other hand, the Hadoop architectural principle is about co-hosting data and software. On the practical side, basing such a central component off-cluster just seems needlessly inefficient and difficult to manage. Can't the best practice be to use one dedicated node for CM, CMS, and DB? Can Cloudera provide some guidelines? For production use, the external DB option requires too many manual steps across multiple services. Can Cloudera Manager provide more central admin and integration? Including transparent migration from embedded DB. This again requires the DB node to be part of the cluster under CM management. Thanks, Miles Yao
... View more
- Tags:
- lpful
01-19-2017
09:33 PM
Can you elaborate a bit on how to set up the environment properly in the shell wrapper before calling spark-submit? Which login to get the action to run as? (owner/yarn/spark/oozie) We've had a lot of problems getting the setup right when we implemented shell actions that wrap Hive queries (to process query output). spark-submit itself is a shell wrapper that does a lot of environment initialization, so I imagine it won't be smooth. Thanks! Miles
... View more
01-04-2017
02:33 PM
We were able to install the official parcel. The only problem encountered was that all the signature files in the repository have extension .sha1. Our CM (5.8.3) were expecting .sha . Manually renaming it allowed the install to complete.
... View more
12-13-2016
01:18 PM
Hi Cloudera folks: The new official Spark2 release looks identical to the beta version released last month. Any difference to expect if we already have the beta installed? Should we re-install? Thanks, Miles
... View more
Labels:
- Labels:
-
Apache Spark
11-08-2016
10:03 AM
Yes, that works. "CSD file" sounds like a text config file. Adding a simple description that it's a JAR in the instruction page would have clarified. Thanks again. Miles
... View more
11-03-2016
03:01 PM
The CSD download link given actually points to a jar file with the following directory tree:
$ jar tvf SPARK2_ON_YARN-2.0.0.cloudera.beta1.jar 0 Wed Sep 21 17:24:28 CDT 2016 descriptor/ 0 Wed Sep 21 17:24:28 CDT 2016 scripts/ 0 Wed Sep 21 17:24:28 CDT 2016 aux/ 0 Wed Sep 21 17:24:28 CDT 2016 aux/client/ 3312 Wed Sep 21 17:24:28 CDT 2016 images/icon.png 1711 Wed Sep 21 17:24:28 CDT 2016 aux/client/spark-env.sh 0 Wed Sep 21 17:24:28 CDT 2016 images/ 18456 Wed Sep 21 17:24:28 CDT 2016 descriptor/service.sdl 0 Wed Sep 21 17:50:46 CDT 2016 meta/ 20 Wed Sep 21 17:50:46 CDT 2016 meta/version 0 Wed Sep 21 17:50:58 CDT 2016 META-INF/ 1813 Wed Sep 21 17:24:28 CDT 2016 scripts/control.sh 12362 Wed Sep 21 17:24:28 CDT 2016 scripts/common.sh 104 Wed Sep 21 17:50:58 CDT 2016 META-INF/MANIFEST.MF
Now, when the documentation specifies "install Spark2 CSD", which file(s) is it referring to exactly? Just descriptor/service.sdl, or the entire jar to /opt/cloudera/csd? The two scripts above look like operational scripts for CM.
Thanks,
Miles Yao
... View more
Labels:
- Labels:
-
Apache Spark
09-15-2016
12:26 PM
CDH 5.7.1 - Same issue, but for configuring app-specific log4j: Working spark-submit command line: [--master yarn-cluster --files hdfs:/user/myao/config/log4j.properties --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties" --class <class> <jar> ] Cannot make the log4j portion above to work in Spark action - everything else is ok: <action name="spark-7844"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>spark.driver.extraJavaOptions</name> <value>-Dlog4j.configuration=log4j.properties</value> </property> <property> <name>spark.executor.extraJavaOptions</name> <value>-Dlog4j.configuration=log4j.properties</value> </property> </configuration> <master>yarn-cluster</master> <mode>cluster</mode> <name>...</name> <class>...</class> <jar>...</jar> <spark-opts>--executor-memory 2G --files hdfs:/user/myao/config/log4j.properties --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties" </spark-opts> </spark> <ok to="End"/> <error to="Kill"/> </action> Driver stderr: Using properties file: null Warning: Ignoring non-spark config property: "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties" Warning: Ignoring non-spark config property: "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties" Parsed arguments: master yarn-cluster deployMode cluster executorMemory 2G executorCores null totalExecutorCores null propertiesFile null driverMemory null driverCores null driverExtraClassPath ........ driverExtraLibraryPath /opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native driverExtraJavaOptions null supervise false queue null numExecutors null files hdfs:/user/myao/config/log4j.properties pyFiles null archives null mainClass .... primaryResource .... name .... childArgs [] jar ............. packages null packagesExclusions null repositories null verbose true Thanks!
... View more
08-26-2016
09:09 AM
Thanks for your detailed reply. That's a valid and understandable concern. We chose Cloudera for our production Hadoop platform precisely for the quality of integration and maturity you offer. We as users simply need some clarity from the vendor for observed feature discrepancies from the official distro, especially for such a critical component as Spark. Are there any other discrepancy/customization that we should be aware of? Can Cloudera be more transparent in your release notes whenever you remove/modify features from the official open-source versions? Searching for "SparkR" in CDH5.7 release notes for Spark found 4 Jiras, which would give one the impression that SparkR is included. Thanks again, Miles
... View more