Support Questions

paulhernandez · ‎04-04-2018

Hi everyone,

I am getting crazy trying to figure out, why I cannot read a Hive External Table, which points to a directory with parquet files.

The parquet files are created with a Spark program like this:

eexTable.repartition(1).write.mode("append").save(dataPath.concat(eexFileName))

I created an external table using this dll:

CREATE EXTERNAL TABLE my_db.eex_actual_plant_gen_line (
meta_date timestamp,
meta_directory string ,
meta_filename string,
meta_host string,
meta_numberofrecords int,
meta_recordnumber int,
meta_sequenceid string,
country string,
source string, 
data_timestamp timestamp,
actual_generation double,
publication_timestamp timestamp,
modification_timestamp timestamp,
created_on timestamp
)
COMMENT 'External table eex transparency actual  plant generation line'
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/development/data/EEX/transparency/production/usage'

I am able to query this table using Ambari or CLI but When I try to use spark I can retrieve the table schema but no rows are returned:

import org.apache.spark.sql.{ Row, SaveMode, SparkSession }
import org.apache.spark.sql.functions._
val warehouseLocation = "/apps/hive/warehouse"
val spark = SparkSession
   .builder()
   .appName("EEX_Trans")
   .config("spark.sql.warehouse.dir", warehouseLocation)
   .config("hive.metastore.uris", "thrift://myserver1:9083,thrift://myserver2:9083")
   .enableHiveSupport()
   .getOrCreate()

val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
spark.sql("Select * from dev_sdsp.facilities").count()

I cannot find the error and I already read 1000 posts without luck.

Any comment will be appreciated.

Kind regards, Paul

dstreev · ‎04-04-2018

Does the user have access (File System Level) to the warehouse directory you've specified?

The docs seem to indicate that the 'spark.sql.warehouse.dir' is optional when Hive is already present and you're attaching to a metastore.

---

Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started.

---

Try omitting that setting from your application.

paulhernandez · ‎04-04-2018

Hi @David Streever, thanks for your reply.

I think I found the problem but I still have to test the solution.

It seems that Spark is not able to traverse my directory structure unless I create partitions.

Even when I defined the properties:

   .config("mapreduce.input.fileinputformat.input.dir.recursive","true")
   .config("mapred.input.dir.recursive","true") 
   .config("hive.mapred.supports.subdirectories","true")

crayonml · ‎06-20-2018

@Paul Hernandez

Hey Paul - did you find a solution to this? It looks like its only parquet thats affected..csv doesnt have this problem. I too have data in subdirectories and spark sql returns null

Cloudera Community

Support Questions

Hive External Table with Parquet Format produces an empty dataframe when reading from Spark