Support Questions

Find answers, ask questions, and share your expertise

Java Spark insert JSON into Hive from the local file system instead of HDFS

avatar
Explorer

I have the following Java code that read a JSON file from HDFS and output it as a HIVE view using Spark.

package org.apache.spark.examples.sql.hive;
import java.io.File;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
// $example off:spark_hive$
public class JavaSparkHiveExample {
  public static void main(String[] args) {

    // $example on:spark_hive$
    SparkSession spark = SparkSession
      .builder()
      .appName("Java Spark Hive Example")
            .master("local[*]")
            .config("hive.metastore.uris", "thrift://localhost:9083")
      .enableHiveSupport()
      .getOrCreate();

    Dataset<Row> jsonTest = spark.read().json("/tmp/testJSON.json");
    jsonTest.createOrReplaceTempView("jsonTest");
    Dataset<Row> showAll = spark.sql("SELECT * FROM jsonTest");

    showAll.show();
    spark.stop();
  }
}

I would like to change so the JSON file is read from the system instead of HDFS (for instance from the same location where the program is executed). Furthermore, how could I remake it to INSERT the JSON into table test1 instead of just making a view out of it?

Help is very appreciated!

1 REPLY 1

avatar
Contributor

Spark by default looks for files in HDFS but for some reason if you want to load file from the local filesystem, you need to prepend "file://" before the file path. So your code will be

Dataset<Row> jsonTest = spark.read().json("file:///tmp/testJSON.json");

However this will be a problem when you are submitting in cluster mode since cluster mode will execute on the worker nodes. All the worker nodes are expected to have that file in that exact path so it will fail. To overcome, you can pass the file path in the --files parameter while running spark-submit which will put the file on the classpath so you can refer the file by simply calling the file name alone.

For ex, if you submitted the following way:

> spark-submit --master <your_master> --files /tmp/testJSON.json --deploy-mode cluster --class <main_class> <application_jar>

then you can simply read the file the following way:

Dataset<Row> jsonTest = spark.read().json("testJSON.json");