- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Hi, Is there a way to load xlsx file into hive table?
- Labels:
-
Apache Hive
Created ‎05-06-2016 04:18 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎05-06-2016 06:06 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
here's an example of CSVExcelStorage https://community.hortonworks.com/questions/5775/best-practice-for-extractoutput-data-generated-by.h... and then you can execute sql commands in pig using https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore
Created ‎05-06-2016 04:21 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
its better to export it as csv or any delimited format and load it into hive table.
Created ‎05-06-2016 04:21 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't think we have any direct method to do that, however there are few workaround to achieve this
One way is to write a custom java mapreduce job to convert xls to csv or create your own custom serd to access xls.
Created ‎05-06-2016 04:26 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not directly I am afraid. You can write a MapReduce job that transforms them into normal delimited data. Similar to the way it was done with Tika here. ( Assuming you have lots of small files )
https://community.hortonworks.com/repos/4576/apache-tika-integration-with-mapreduce.html
You would however need to use a Java library like POI instead of Tika
To read it directly in Hive you need to write an HiveInputFormat. You can use this inputformat class as an example:
https://sreejithrpillai.wordpress.com/2014/11/06/excel-inputformat-for-hadoop-mapreduce/
If you return a row for each record that is delimited and pretend to the Hive Serde that its a text inputformat you might be able to get it working.
Created ‎05-06-2016 04:26 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are multiple options:
1. You can use apache tika (using a programming language like Java) to read the xlxs and load into hive.
2. If its a single xls sheet, then you can use pig's CSVExcelStorage() and insert into hive table using HCatStorer()
3. Convert to a delimited CSV and load it.
Created ‎05-06-2016 05:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mine is xlsx files with single sheets. Can you please explain how to use pig's CSVExcelStorage() and insert into hive table using HCatStorer().
Created ‎05-06-2016 06:06 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
here's an example of CSVExcelStorage https://community.hortonworks.com/questions/5775/best-practice-for-extractoutput-data-generated-by.h... and then you can execute sql commands in pig using https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore
