Member since
07-07-2016
1
Post
0
Kudos Received
0
Solutions
07-07-2016
04:02 PM
1) Read Data from a file in Hadoop to a DataFrame in Spark in Scala //sc -- SparkContext val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) var hadoopFileDataFrame =hiveContext.read.format("com.databricks.spark.csv").load(filePath) 2) Using Dataframe schema , create a table in Hive in Parquet format and load the data from dataframe to Hive Table.
Issue 1 : Dependency added in pom.xml for parquet-hive-bundle-1.6.0.jar . Using following code: var query = "CREATE TABLE Test(EMP_ID string,Organisation string,Org_Skill string,EMP_Name string)ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' TBLPROPERTIES ('PARQUET.COMPRESS'='SNAPPY')" val dataFrame = hiveContext.sql(query) The code hangs although the table is created but unable to fire select query on the table.
Issue 2:
hadoopFileDataFrame.registerTempTable("temp")
var query="CREATE TABLE TEST AS SELECT * FROM TEMP" hiveContext.sql(query) val dataFrame = hiveContext.sql("select * from test") dataFrame.show() Note: It successfully loads the data from dataframe to Hive Table as printed in the console logs. But when I check the Hive table using same Select Statement , there is no data in the table . What is the cause behind this? How can I copy the data from dataframe to a Hive table and store it as Parquet file and perform dynamic partitioning of the data ?(ensuring that the data is correctly copied in the Hive table ) ,
... View more
Labels:
- Labels:
-
Apache Hive