Support Questions

Find answers, ask questions, and share your expertise

create a parquet table in Hive from a dataframe in Scala,

avatar
New Contributor

1) Read Data from a file in Hadoop to a DataFrame in Spark in Scala

//sc -- SparkContext

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

var hadoopFileDataFrame =hiveContext.read.format("com.databricks.spark.csv").load(filePath)

2) Using Dataframe schema , create a table in Hive in Parquet format and load the data from dataframe to Hive Table.

Issue 1 : Dependency added in pom.xml for parquet-hive-bundle-1.6.0.jar .

Using following code:

var query = "CREATE TABLE Test(EMP_ID string,Organisation string,Org_Skill string,EMP_Name string)ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' TBLPROPERTIES ('PARQUET.COMPRESS'='SNAPPY')"

val dataFrame = hiveContext.sql(query)

The code hangs although the table is created but unable to fire select query on the table. Issue 2: hadoopFileDataFrame.registerTempTable("temp") var query="CREATE TABLE TEST AS SELECT * FROM TEMP"

hiveContext.sql(query)

val dataFrame = hiveContext.sql("select * from test")

dataFrame.show()

Note: It successfully loads the data from dataframe to Hive Table as printed in the console logs. But when I check the Hive table using same Select Statement , there is no data in the table . What is the cause behind this?

How can I copy the data from dataframe to a Hive table and store it as Parquet file and perform dynamic partitioning of the data ?(ensuring that the data is correctly copied in the Hive table )

,

2 REPLIES 2

avatar
@Neha Jain

Use

df.write.format("parquet").partitionBy('..').saveAsTable(...)


(or)


df.write.format("parquet").partitionBy('...').insertInto(...)

avatar
New Contributor

I have similar query but it is about reading data from HIVE tables which are stored as parquet format.Getting below error on table data read using spark-SQL.

java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.parquet.bytes.BytesUtils.bytesToLong(BytesUtils.java:250) at org.apache.parquet.column.statistics.LongStatistics.setMinMaxFromBytes(LongStatistics.java:50) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:255)

,

I have a similar requirement to read data from a HIVE table which is stored in parquet format. Getting the below exception.

WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2,...........) java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.parquet.bytes.BytesUtils.bytesToLong(BytesUtils.java:250) at org.apache.parquet.column.statistics.LongStatistics.setMinMaxFromBytes(LongStatistics.java:50) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:255)