Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Creating a Spark program with aws POM dependencies to load to s3 bucket

Creating a Spark program with aws POM dependencies to load to s3 bucket

Contributor

I'm using the HDP version 2.6.3 with the 2.2 version of Spark (not HDP cloud) and I'm trying to write to s3 from an IntelliJ project. I have no problems writing to the s3 bucket from the shell, but when I try to test my app on my local machine in IntelliJ I get weird errors after adding the Hadoop-aws and aws-java-sdk dependency jars. It seems like depending on where I place them in the ordering of the dependencies in my POM file I get different errors. When I put the spark dependencies at the top I get ERROR MetricsSystem: Sink class org.apache.spark.metrics.sink.MetricsServlet cannot be instantiated. If I take the Hadoop-aws dependency out and then invalidate the cache, everything runs fine except saving to s3, where I get a class not found error for org/apache/http/message/TokenParser. My POM file is pasted below. I have been playing around with putting the Hadoop-aws dependency in different orders in my pom, but haven't been successful in getting it to work. If I put it above my Spark dependencies, I get class not found errors for Spark classes. I set configurations for accessing s3a by setting the fs.s3a.iml, fs.s3a.access.key, and fs.s3a.secret.key properties through sc.hadoopConfiguration.set. Again, I have no problems with saving to s3 from the shell by setting these properties the same way. Any help would be greatly appreciated with this. Is there dependencies that it must be before or after? I wasn't aware that the ordering mattered, but apparently it does. I'm guessing that there might be some conflicting classes between the Hadoop-aws jar and one of the other Hadoop or spark jars.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">  <modelVersion>4.0.0</modelVersion>  <groupId>com.lendingtree.data_lake</groupId>  <artifactId>Spark2_DL_ETL</artifactId>  <version>1.0-SNAPSHOT</version>  <name>${project.artifactId}</name>  <description>My wonderfull scala app</description>  <inceptionYear>2015</inceptionYear>  <licenses>    <license>      <name>My License</name>      <url>http://....</url>      <distribution>repo</distribution>    </license>  </licenses>  <properties>    <maven.compiler.source>1.6</maven.compiler.source>    <maven.compiler.target>1.6</maven.compiler.target>    <encoding>UTF-8</encoding>    <scala.version>2.11.8</scala.version>    <scala.compat.version>2.11</scala.compat.version>    <spark.version>2.2.0.2.6.3.0-235</spark.version>    <kafka.version>0.10.1</kafka.version>    <hbase.version>1.1.2.2.6.3.0-235</hbase.version>    <hadoop.version>2.7.3.2.6.3.0-235</hadoop.version>    <zookeeper.version>3.4.6</zookeeper.version>    <shc.version>1.1.0.2.6.3.0-235</shc.version>  </properties>  <repositories>    <repository>      <id>hortonworks</id>      <name>hortonworks repo</name>      <url>http://repo.hortonworks.com/content/repositories/releases/</url>    </repository>  </repositories>  <dependencies>    <dependency>      <groupId>org.scala-lang</groupId>      <artifactId>scala-library</artifactId>      <version>${scala.version}</version>    </dependency>    <dependency>      <groupId>org.apache.spark</groupId>      <artifactId>spark-core_${scala.compat.version}</artifactId>      <version>${spark.version}</version>    </dependency>    <dependency>      <groupId>org.apache.spark</groupId>      <artifactId>spark-sql_${scala.compat.version}</artifactId>      <version>${spark.version}</version>    </dependency>    <dependency>      <groupId>org.apache.spark</groupId>      <artifactId>spark-sql-kafka-0-10_${scala.compat.version}</artifactId>      <version>${spark.version}</version>    </dependency>    <dependency>      <groupId>org.apache.spark</groupId>      <artifactId>spark-hive_${scala.compat.version}</artifactId>      <version>${spark.version}</version>    </dependency>    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-common</artifactId>      <version>${hadoop.version}</version><!--<scope>provided</scope>--></dependency>    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-hdfs</artifactId>      <version>${hadoop.version}</version><!--<scope>provided</scope>--></dependency>    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-nfs</artifactId>      <version>${hadoop.version}</version><!--<scope>provided</scope>--></dependency>    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-auth</artifactId>      <version>${hadoop.version}</version><!--<scope>provided</scope>--></dependency>    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-mapreduce-client-core</artifactId>      <version>${hadoop.version}</version><!--<scope>provided</scope>--></dependency>    <dependency>      <groupId>org.apache.hive</groupId>      <artifactId>hive-hbase-handler</artifactId>      <version>1.2.1</version>    </dependency>    <dependency>      <groupId>org.apache.hbase</groupId>      <artifactId>hbase-spark</artifactId>      <version>${hbase.version}</version>    </dependency>    <dependency>      <groupId>org.apache.hbase</groupId>      <artifactId>hbase-common</artifactId>      <version>${hbase.version}</version>    </dependency>    <dependency>      <groupId>org.apache.hbase</groupId>      <artifactId>hbase-server</artifactId>      <version>${hbase.version}</version>    </dependency>    <dependency>      <groupId>com.amazonaws</groupId>      <artifactId>aws-java-sdk</artifactId>      <version>1.10.6</version>    </dependency><!--    <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-aws</artifactId>      <version>${hadoop.version}</version>    </dependency>    -->    <!-- Test --><dependency>      <groupId>junit</groupId>      <artifactId>junit</artifactId>      <version>4.11</version>      <scope>test</scope>    </dependency>    <dependency>      <groupId>org.specs2</groupId>      <artifactId>specs2-core_${scala.compat.version}</artifactId>      <version>2.4.16</version>      <scope>test</scope>    </dependency>    <dependency>      <groupId>org.specs2</groupId>      <artifactId>specs2-junit_${scala.compat.version}</artifactId>      <version>2.4.16</version>      <scope>test</scope>    </dependency>    <dependency>      <groupId>org.scalatest</groupId>      <artifactId>scalatest_${scala.compat.version}</artifactId>      <version>2.2.4</version>      <scope>test</scope>    </dependency>  </dependencies>  <build>    <sourceDirectory>src/main/scala</sourceDirectory>    <testSourceDirectory>src/test/scala</testSourceDirectory>    <plugins>      <plugin><!-- see http://davidb.github.com/scala-maven-plugin --><groupId>net.alchim31.maven</groupId>        <artifactId>scala-maven-plugin</artifactId>        <version>3.2.0</version>        <executions>          <execution>            <goals>              <goal>compile</goal>              <goal>testCompile</goal>            </goals>            <configuration>              <args><!--<arg>-make:transitive</arg>--><arg>-dependencyfile</arg>                <arg>${project.build.directory}/.scala_dependencies</arg>              </args>            </configuration>          </execution>        </executions>      </plugin>      <plugin>        <groupId>org.apache.maven.plugins</groupId>        <artifactId>maven-surefire-plugin</artifactId>        <version>2.18.1</version>        <configuration>          <useFile>false</useFile>          <disableXmlReport>true</disableXmlReport><!-- If you have classpath issue like NoDefClassError,... -->          <!-- useManifestOnlyJar>false</useManifestOnlyJar --><includes>            <include>**/*Test.*</include>            <include>**/*Suite.*</include>          </includes>        </configuration>      </plugin>    </plugins>  </build></project>
1 REPLY 1

Re: Creating a Spark program with aws POM dependencies to load to s3 bucket

There's a risk here that you are being burned by Jackson versions. the AWS SDK needs one set of Jackson jars, Spark uses another. On a normal spark-submit, everything works because Spark has shaded theirs, The IDE doesn't do that (lovely as IntelliJ is), so it refuses to play. FWIW, I hit the same problem.

The workaround I use is: start the job as an executable but have the spark-submit pause for a while, and then attach the IDE to it via "attach to a local process". How to get it to wait? Simplest: put a sleep() in. Most flexible, have it poll for a file existing and then do a sleep(1000) if it isn't and repeat. That way, all you have to do is create that file and it will set off.