Created 05-06-2016 04:18 PM
Created 05-06-2016 06:06 PM
here's an example of CSVExcelStorage https://community.hortonworks.com/questions/5775/best-practice-for-extractoutput-data-generated-by.h... and then you can execute sql commands in pig using https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore
Created 05-06-2016 04:21 PM
its better to export it as csv or any delimited format and load it into hive table.
Created 05-06-2016 04:21 PM
I don't think we have any direct method to do that, however there are few workaround to achieve this
One way is to write a custom java mapreduce job to convert xls to csv or create your own custom serd to access xls.
Created 05-06-2016 04:26 PM
Not directly I am afraid. You can write a MapReduce job that transforms them into normal delimited data. Similar to the way it was done with Tika here. ( Assuming you have lots of small files )
https://community.hortonworks.com/repos/4576/apache-tika-integration-with-mapreduce.html
You would however need to use a Java library like POI instead of Tika
To read it directly in Hive you need to write an HiveInputFormat. You can use this inputformat class as an example:
https://sreejithrpillai.wordpress.com/2014/11/06/excel-inputformat-for-hadoop-mapreduce/
If you return a row for each record that is delimited and pretend to the Hive Serde that its a text inputformat you might be able to get it working.
Created 05-06-2016 04:26 PM
There are multiple options:
1. You can use apache tika (using a programming language like Java) to read the xlxs and load into hive.
2. If its a single xls sheet, then you can use pig's CSVExcelStorage() and insert into hive table using HCatStorer()
3. Convert to a delimited CSV and load it.
Created 05-06-2016 05:58 PM
Mine is xlsx files with single sheets. Can you please explain how to use pig's CSVExcelStorage() and insert into hive table using HCatStorer().
Created 05-06-2016 06:06 PM
here's an example of CSVExcelStorage https://community.hortonworks.com/questions/5775/best-practice-for-extractoutput-data-generated-by.h... and then you can execute sql commands in pig using https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore