Created 04-04-2018 09:55 AM
Hi everyone,
I am getting crazy trying to figure out, why I cannot read a Hive External Table, which points to a directory with parquet files.
The parquet files are created with a Spark program like this:
eexTable.repartition(1).write.mode("append").save(dataPath.concat(eexFileName))
I created an external table using this dll:
CREATE EXTERNAL TABLE my_db.eex_actual_plant_gen_line ( meta_date timestamp, meta_directory string , meta_filename string, meta_host string, meta_numberofrecords int, meta_recordnumber int, meta_sequenceid string, country string, source string, data_timestamp timestamp, actual_generation double, publication_timestamp timestamp, modification_timestamp timestamp, created_on timestamp ) COMMENT 'External table eex transparency actual plant generation line' ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/development/data/EEX/transparency/production/usage'
I am able to query this table using Ambari or CLI but When I try to use spark I can retrieve the table schema but no rows are returned:
import org.apache.spark.sql.{ Row, SaveMode, SparkSession } import org.apache.spark.sql.functions._ val warehouseLocation = "/apps/hive/warehouse" val spark = SparkSession .builder() .appName("EEX_Trans") .config("spark.sql.warehouse.dir", warehouseLocation) .config("hive.metastore.uris", "thrift://myserver1:9083,thrift://myserver2:9083") .enableHiveSupport() .getOrCreate() val hadoopConf = spark.sparkContext.hadoopConfiguration hadoopConf.set("mapreduce.input.fileinputformat.input.dir.recursive", "true") spark.sql("Select * from dev_sdsp.facilities").count()
I cannot find the error and I already read 1000 posts without luck.
Any comment will be appreciated.
Kind regards, Paul
Created 04-04-2018 11:42 AM
Does the user have access (File System Level) to the warehouse directory you've specified?
The docs seem to indicate that the 'spark.sql.warehouse.dir' is optional when Hive is already present and you're attaching to a metastore.
---
Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml
, the context automatically creates metastore_db
in the current directory and creates a directory configured by spark.sql.warehouse.dir
, which defaults to the directory spark-warehouse
in the current directory that the Spark application is started.
---
Try omitting that setting from your application.
Created 04-04-2018 12:10 PM
Hi @David Streever, thanks for your reply.
I think I found the problem but I still have to test the solution.
It seems that Spark is not able to traverse my directory structure unless I create partitions.
Even when I defined the properties:
.config("mapreduce.input.fileinputformat.input.dir.recursive","true") .config("mapred.input.dir.recursive","true") .config("hive.mapred.supports.subdirectories","true")
Created 06-20-2018 04:49 PM
Hey Paul - did you find a solution to this? It looks like its only parquet thats affected..csv doesnt have this problem. I too have data in subdirectories and spark sql returns null