Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (2)

While this article provides a mechanism through which we could setup Spark with HiveContext, there are some limitation that when using Spark with HiveContext. For e.x Hive support writing query result to HDFS using the "INSERT OVERWRITE DIRECTORY" i.e

INSERT OVERWRITE DIRECTORY 'hdfs://cl1/tmp/query'

SELECT * FROM REGION

Above command will result is writing the result of above query to HDFS. However if the same query is passed to Spark with HiveContext, this will fail since "INSERT OVERWRITE DIRECTORY" is not a supported feature when using Spark. This is tracked via this jira. If the same needs to be achieved via spark -- it could achieved by using the Spark CSV library ( required in case of Spark1 ).

Below is the code snippet on how to achieve the same.

DataFrame df = hiveContext.sql("SELECT * FROM REGION");
  df.write()
      .format("com.databricks.spark.csv")
      .option("delimiter", "\u0001")
      .save("hdfs://cl1/tmp/query");

Above command will save the result in HDFS under dir /tmp/query. Please note the delimiter which is used, this is same as what hive currently supports.

Also below depedency needs to be added to pom.xml

<dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-csv_2.10</artifactId>
    <version>1.5.0</version>
</dependency>
3,111 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎06-30-2017 09:57 AM
Updated by:
 
Contributors
Top Kudoed Authors