Created 07-07-2016 04:02 PM
1) Read Data from a file in Hadoop to a DataFrame in Spark in Scala
//sc -- SparkContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
var hadoopFileDataFrame =hiveContext.read.format("com.databricks.spark.csv").load(filePath)
2) Using Dataframe schema , create a table in Hive in Parquet format and load the data from dataframe to Hive Table.
Issue 1 : Dependency added in pom.xml for parquet-hive-bundle-1.6.0.jar .
Using following code:
var query = "CREATE TABLE Test(EMP_ID string,Organisation string,Org_Skill string,EMP_Name string)ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' TBLPROPERTIES ('PARQUET.COMPRESS'='SNAPPY')"
val dataFrame = hiveContext.sql(query)
The code hangs although the table is created but unable to fire select query on the table. Issue 2: hadoopFileDataFrame.registerTempTable("temp") var query="CREATE TABLE TEST AS SELECT * FROM TEMP"
hiveContext.sql(query)
val dataFrame = hiveContext.sql("select * from test")
dataFrame.show()
Note: It successfully loads the data from dataframe to Hive Table as printed in the console logs. But when I check the Hive table using same Select Statement , there is no data in the table . What is the cause behind this?
How can I copy the data from dataframe to a Hive table and store it as Parquet file and perform dynamic partitioning of the data ?(ensuring that the data is correctly copied in the Hive table )
,
Created 07-07-2016 06:27 PM
Use
df.write.format("parquet").partitionBy('..').saveAsTable(...) (or) df.write.format("parquet").partitionBy('...').insertInto(...)
Created 07-05-2017 12:21 PM
I have similar query but it is about reading data from HIVE tables which are stored as parquet format.Getting below error on table data read using spark-SQL.
java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.parquet.bytes.BytesUtils.bytesToLong(BytesUtils.java:250) at org.apache.parquet.column.statistics.LongStatistics.setMinMaxFromBytes(LongStatistics.java:50) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:255)
,
I have a similar requirement to read data from a HIVE table which is stored in parquet format. Getting the below exception.
WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2,...........) java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.parquet.bytes.BytesUtils.bytesToLong(BytesUtils.java:250) at org.apache.parquet.column.statistics.LongStatistics.setMinMaxFromBytes(LongStatistics.java:50) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:255)