Member since
11-08-2016
32
Posts
7
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
562 | 11-09-2016 08:58 AM |
06-19-2017
11:51 AM
Unfortunately it does not change it. The tests are still not run. 😞
... View more
06-13-2017
01:36 PM
Hi guys, I have read the article about testing and I would like to test spark-testing-base with spark. Unfortunately I am not an expert with Maven so I don't get the tests to run. My project looks like this: pom.xml
src/main/scala/com/test/spark/mycode.scala
src/test/scala/com/test/spark/test.scala I can run 'mvn package' without problems. But it says [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ mycode ---
[INFO] Nothing to compile - all classes are up to date
Why does my test not run? I have added the dependency as explained on the git for scala-testing-base and used the first example from its wiki. My pom.xml looks like this: <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.test.spark</groupId>
<artifactId>mycode</artifactId>
<version>0.0.1</version>
<name>${project.artifactId}</name>
<description>Simple wtest</description>
<inceptionYear>2017</inceptionYear>
<!-- change from 1.6 to 1.7 depending on Java version -->
<properties>
<maven.compiler.source>1.6</maven.compiler.source>
<maven.compiler.target>1.6</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.11.5</scala.version>
<scala.compat.version>2.11</scala.compat.version>
<spark.version>1.6.1</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- Spark dependency -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Spark sql dependency -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Spark hive dependency -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- spark-testing-base dependency -->
<dependency>
<groupId>com.holdenkarau</groupId>
<artifactId>spark-testing-base_2.11</artifactId>
<version>${spark.version}_0.6.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalactic</groupId>
<artifactId>scalactic_2.11</artifactId>
<version>3.0.1</version>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.11</artifactId>
<version>3.2.0-SNAP5</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<!-- Create JAR with all dependencies -->
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
<plugin>
<!-- see http://davidb.github.com/scala-maven-plugin -->
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<scalaCompatVersion>${scala.compat.version}</scalaCompatVersion>
</configuration>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- for testing scala code -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.20</version>
<configuration>
<argLine>-Xmx2048m -XX:MaxPermSize=2048m</argLine>
</configuration>
</plugin>
</plugins>
</build>
</project>
Do you have any advise what I need to change in order to have my test run?? Best, Ken
... View more
Labels:
- Labels:
-
Apache Spark
06-12-2017
12:24 PM
Hi Laurent, thanks for your answer. Is id advisable to use Sqoop Client with a Java action? Or should one stick to the Sqoop-Action.
The idea is to have a better error handling within the Java-Action.
... View more
06-12-2017
08:32 AM
Hi, how can I start a Sqoop2 server on my Sandbox and use the Sqoop Client API to import data into HDFS?
... View more
Labels:
- Labels:
-
Apache Sqoop
06-08-2017
11:37 AM
Hi everybody, I have a question regarding Oozie and stabilyzing my development process. First of all to write safe code I found whis article about test driven development in hadoop. As far as I understand this means that I have to provide developer tests sperately for each tool (i.e. Sqoop, Hive, Spark,..). So how would a typical developement process look like in your opinion? Should all code for Hive and Spark be written first and then be tested by unit tests which were defined before developing the actual code? This means using Beeline or HiveRunner as well as Spark-Testing-Base. And only then test the Oozie Worklfow with minioozie? In addition I would like to know how I can handle errors in Oozie appropriately. I had the feeling that sometimes an error occured (maybe in Hive or something else) and the complete workflow was stopped at that point. So the action was stopped and did not even reach the point where Oozie decides whether to use the OK or ERROR branch. So al my error handling in Oozie was not useful. When and how can that happen? Is is a type of error which has to be tested before in the tool itself and not in Oozie? Maybe I do not really understand how Oozie delegates the action to YARN and where those errors rise. Any help on this topic is really appreciated. Brwosing the web I didn't come up with many input which refers precisely on this topic. Thanks in advance!
... View more
Labels:
- Labels:
-
Apache Oozie
04-20-2017
05:51 AM
Hi @Jay Zhou Can you be a bit more specific what you have changed? What did you exactly do with this line? val hiveSqlContext = new org.apache.spark.sql.hive.HiveContext(sc) I have a similar problem where I get an error WARN Hive: Failed to access metastore. This class should not accessed in runtime. but this is only when I run the job via Oozie. When I use spark submit the code works so I guess the dependencies are right. Do you have any idea what can cause this?
... View more
04-07-2017
08:39 AM
Hi guys, I am on HDP 2.5 and I would like to run a Spark action within Oozie. The jar was tested with spark-submit beofre. My action looks like this: (Namenode etc. from global settings) <action name="spark-action" retry-max="1">
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn-cluster</master>
<name>Spark Test</name>
<class>org.SparkTest</class>
<jar>/myhdfs/spark_test.jar</jar>
<spark-opts>--executor-memory 2G</spark-opts>
</spark>
<ok to="end"/>
<error to="kill"/>
</action> At first I tried with mode="locale" but then I get an error that the master and mode don't fit together. (Client deploy mode is not compatible with master "yarn-cluster") When I leave it open, so no declaration of mode I get an error that the file does not exist. But the file is definetly in the HDFS in the right place. What do I need to change?? The error is: Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, Application application_1491479953113_0414 finished with failed status
org.apache.spark.SparkException: Application application_1491479953113_0414 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1122)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1169)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:289)
at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:211)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:51)
at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:59)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:242)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
In my job.properties I added the following: oozie.use.system.libpath=trueoozie.action.sharelib.for.spark = spark,hcatalog,hive Thanks for your help!!
... View more
Labels:
- Labels:
-
Apache Oozie
-
Apache Spark
03-23-2017
06:39 AM
Hi @Vinod Bonthu, your answer really helped but unfortunately it wasn't the complete solution.
As statd in my error log I had to add some jars via the spark-submit --jars
command and I also added my hive site with spark-submit --files /usr/hdp/current/spark-client/conf/hive-site.xml
With those two changes and removing .setMaster("local[2]")
it worked! Thanks for your help!
... View more
03-22-2017
11:21 AM
Yes I did. The reult is -rwxr-xr-x 3 USER USER 12252 2017-03-17 10:49 /path-to-jar/my.jar And the ouput when I run the spark-commit command is stated above.
... View more
03-22-2017
10:33 AM
Hi @Aditya Deshpande, 17/03/22 11:27:53 INFO HiveContext: Initializing execution hive, version 1.2.1
17/03/22 11:27:53 INFO ClientWrapper: Inspected Hadoop version: 2.7.3.2.5.0.0-1245
17/03/22 11:27:53 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.7.3.2.5.0.0-1245
17/03/22 11:27:53 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
17/03/22 11:27:53 INFO ObjectStore: ObjectStore, initialize called
17/03/22 11:27:53 WARN HiveMetaStore: Retrying creating default database after error: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
...
...
That is the first error I find in the logs in stderr. Another one is 17/03/22 11:27:53 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
17/03/22 11:27:53 INFO ObjectStore: ObjectStore, initialize called
17/03/22 11:27:53 WARN Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
...
...
Can you make use of this?
... View more
03-22-2017
06:41 AM
Hi @Vinod Bonthu, Thanks gor the quick guide but I have a follow up question. So it seems my pom is fine since I can package it with maven. Although when I run it via "spark-submit" I get the error stated in my original post. But when I run one of the spark examples by executing below code works. So why is that? I have the feeling that still something is not right and as a forst step I just want to run an example code packaged by myself via spark-submit. Afterwards I will try an oozie workflow. Thanks again! spark-submit org.apache.spark.exaples.SOMEEXAMPlE --master yarn-cluster hdfs://path-to-spark-examples.jar
... View more
03-21-2017
05:20 PM
Hi pbarna, Oh I am sorry. Yes I could run mvn package without any errors (at the beginninh some dependencies were missing but I fixed that). The error in my original post is when I try to run my paavked jar on HDP. So I do not need any cluster or Hortonworks specific things in my pom, right?
... View more
03-21-2017
02:59 PM
Hi folks, I would like to make a minimal example packed with Maven based on Scala code (like a hello world) and run it on a HDP2.5 sandbox. What do I need to specify in my pom.xml? So far I have this: <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.test.spark</groupId>
<artifactId>Test</artifactId>
<version>0.0.1</version>
<name>${project.artifactId}</name>
<description>Simple test app</description>
<inceptionYear>2017</inceptionYear>
<!-- change from 1.6 to 1.7 depending on Java version -->
<properties>
<maven.compiler.source>1.6</maven.compiler.source>
<maven.compiler.target>1.6</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.11.5</scala.version>
<scala.compat.version>2.11</scala.compat.version>
<spark.version>1.6.1</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- Spark dependency -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Spark sql dependency -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Spark hive dependency -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<!-- Create JAR with all dependencies -->
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
<plugin>
<!-- see http://davidb.github.com/scala-maven-plugin -->
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<scalaCompatVersion>${scala.compat.version}</scalaCompatVersion>
</configuration>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
And my scala code is this: package com.test.spark
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql
import org.apache.commons.lang
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.rdd.RDD
object Test {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Test") .setMaster("local[2]")
val spark = new SparkContext(conf)
println( "Hello World!" )
}
}
I run the code with spark-submit --class com.test.spark.Test --master yarn --deploy-mode cluster hdfs://HDP25/test.jar
Unfortunately it does not run. 😞 See the image attached. Am I missing something? Can you please help me to get a minimal example running? Thanks and kind regards
... View more
Labels:
- Labels:
-
Apache Spark
02-27-2017
10:15 AM
Hi Sunile, Thanks for your answer. We think we will store our initial model and then all alter scripts. But all alter scripts will be included in the initial model in cas a complete re-deployment is wanted. To view a logical model a tool will be used which can reverse engineer the ddl. We try to establish a workflow in that fashion and hope that works.
... View more
02-14-2017
09:38 AM
Hi guys,
I am looking for a best practice using Hive especially for database modeling, software development and if possible a version control. At the moment I struggle at a certain point where the logical world meets the code.
I have found tools which assist modelling databases in Hive, e.g. Embaracadero (Hortonworks Partner?). So I could model my databases there and create DDL scripts, I guess. To get version control I can add those scripts to git or something else. What happens if many users want to work on the logical model? How do you handle such problems? Jumping back and forth between database model versions is only possible with git versioning not only by the tool, or is it? All other scripts regarding the hive databases and tables (ingestion and so on) live in a git repository. So they are under perfect version control but if something changes many adaptions have to be made (at least one in the config file and maybe in Insert statements etc.). What I am missing in the code world is a nice view of databases and tablss even a entity-relation-diagramm but that's not of main interest. What is in your opinion a good way to tackle these problems? I mean someone like Facebook does not want to manage all tables and databases via a Hive View or solely based on code, or do they? How to keep the oversight in the big data world? Any help is really appreciated! Thanks in advance! Kind regards, Ken
... View more
Labels:
- Labels:
-
Apache Hive
02-13-2017
08:02 AM
1 Kudo
Hi Timothy, Okay I had a closer look into it. For me it looks like ApacheNiFi (Hortonworks' DataFlow) is more or less a tool piping your data from a non Hadoopsystem (RDMBS, IoT,...) into Hadoop. Thereafter, an other tool is needed to manage data. Here, Apache Falcon has its strength. Airflow, Luigi, Azkaban are solutions for broader scheduling tasks and need more effort to be installed (next) to your cluster. Quickly dipping my toe into scheduling with Spark I didn't come up with many resources. Last but not least Oozie (e.g. managed via Hue) seems like the easiest fit to manage all kind of workflows (Sqoop, Hive, Shell, Spark,...) within a cluster. Of course, I have dependencies between single action whereas dependencies between single coordinators is missing. In my humble opinion this funcitonality can be added with flagfiles. I think, Oozie is still the best fit although it is cumbersome to handle via xml files. Of course there is the Eclipse plugin to visualize workflows and create them as well. Feel free to correct my views. Thanks!
... View more
02-11-2017
03:34 AM
Thanks I will have a lookninto it. Especially controlling jobs with spark sounds interesting. I haven't heared of it before. Do you have a source? Thanks again!
... View more
02-10-2017
07:55 PM
1 Kudo
Hi Timothy, Thanks for your wuick reply. The point is that I am quite ungappy with oozie. Well, it does its job but handling the xmls is not my favourite. So i was looking for something more sophisticated where i can have a dependency between dofferent job packages (i.e. a coordinator in oozie). I thought airflow cod be my solution.
... View more
02-10-2017
12:33 PM
1 Kudo
Is there any best practice or installation guide out there by hortonworks to set up airflow within hdp and start random jobs? I have seen there are some operators available and the rest could be managed via shell.
... View more
11-29-2016
01:55 PM
Hey guys, I would like to pass properties defined in the <global> section by <job-xml> or <configuration> to a Hive Action parameter. But I cannot manage to do so. Example config: <workflow>
<global>
<job-xml>hive_params.xml</job-xml>
</global>
...
<hive 0.6>
<script>h_script.hql</script>
<param>test=${test_from_hive_params_xml}</param>
</hive> How can I achive this properly without changing my hive-site.xml? Or to pass a second xml to Hive but I do not want to change my hive-site.xml. This is needed since I do not want to type all parameters by hand and pass it via the global section. Looking forward to your help! Thanks and regards
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Oozie
11-09-2016
08:58 AM
It seems that I managed to solve the problem. What was missing: oozie admin -oozie http://localhost:11000/oozie -sharelibupdate One can check the share lib by oozie admin -oozie http://localhost:11000/oozie -shareliblist sqoop I hope this helps also others and this should be the same for different database systems.
... View more
11-08-2016
03:25 PM
Hey guys,
I am new with HDP (v2.5 as Sandbox) and I am trying to submit an Oozie job using a Sqoop action to connect to an Oracle Database.
I managed to run Oozie jobs with a simple Hive action and a simple Sqoop action (just sending 'version') - both worked. The error occurs when sending the command: <command>sqoop-list-tables --connect jdbc:oracle:thin:@//XX.XX.XX.XXX:XXXX/NAME --username USER --password PW</command> The error is: 2016-11-08 15:14:46,581 WARN SqoopActionExecutor:523 - SERVER[sandbox.hortonworks.com] USER[root] GROUP[-] TOKEN[] APP[sqoopaction] JOB[0000065-161107122627815-oozie-oozi-W] ACTION[0000065-161107122627815-oozie-oozi-W@sqoopjob] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1] Actually, the command is working from the command line so it has something to do with Oozie. At some point I have read that there might be an issue with the ojdbc6.jar file so I copied it to several /lib/ folders without success.
To get rid of the spaces inbetween the command I also tried <arg></arg> but it didn't work. What am I doing wrong here? Any advice is appreciated!
I really have no clue what to do and the error log doesn't help. Thanks and regards, Ken (If you need further information just let me know!) ------------- job.properties: nameNode=hdfs://sandbox.hortonworks.com:8020
jobTracker=hdfs://sandbox.hortonworks.com:8050
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.wf.application.path=data/oozie/sqoopaction
oozie.use.system.libpath=true
oozie.action.sharelib.for.sqoop = hive,hcatalog,sqoop
... View more
Labels:
- Labels:
-
Apache Oozie
-
Apache Sqoop