- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to load CSV file directly into Hive ORC table without using temporary Hive table?
- Labels:
-
Apache Hive
Created ‎08-04-2016 06:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a CSV file with 2 attributes and also a Hive ORC-based table with same attributes and its data type. Without using temporary hive table, can I directly load this CSV file into Hive ORC table? If there is no way to load CSV file, then can anyone help me to convert a CSV file into ORC file format, so that I'll load this ORC file directly into Hive ORC Table?
Created ‎08-04-2016 10:32 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why don't you want a hive external table? It is just a temporary entry without any significant overhead.
You can also use OrcStorage in pig to write orc files directly.
http://pig.apache.org/docs/r0.15.0/func.html#OrcStorage
Similar functions are available for spark
Or you might be able to write a custom Mapreduce function using an ORC outputformat.
Created ‎08-04-2016 02:17 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Am not familiar with spark. But looks like it has some functions to meet your requirement.
http://stackoverflow.com/questions/36436020/converting-csv-to-orc-with-spark
Created ‎08-04-2016 03:05 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As @Benjamin Leonhardi said, there is very little overhead to using an external table to do this. The only thing stored in the Hive Metastore is the schema about the CSV and the pointer to where the data is on HDFS. The data is left where you put it on HDFS. Using an external table is a very common way of solving this problem.
Having said that, you can use Pig to load CSV data directly from HDFS. You have to define the schema for the CSV within the Pig script and you can write the data to a Hive ORC table. Be aware that the Hive ORC table must be created before you can write to it with Pig.
Here is a tutorial that covers this: http://hortonworks.com/hadoop-tutorial/how-to-use-basic-pig-commands/
Here is an example of loading CSV data via Pig:
STOCK_A = LOAD '/user/maria_dev/NYSE_daily_prices_A.csv' USING PigStorage(',') AS (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); DESCRIBE STOCK_A;
Created ‎08-04-2016 03:59 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a tutorial you could use Spark to load csv into hive.
http://hortonworks.com/hadoop-tutorial/using-hive-with-orc-from-apache-spark/
Created ‎08-04-2016 05:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Assume you have a ORC table "test" in hive that fits to the csv file "test.csv"
SparkSQL
sqlContext.read.format("com.databricks.spark.csv") .option("header", "true") .option("delimiter", ",") .load("/tmp/test.csv") .insertInto("test")
Created ‎05-15-2017 02:36 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you're looking for a standalone tool to convert CSV to ORC have a look at https://github.com/cartershanklin/csv-to-orc
It's a standalone Java tool that can run anywhere, including off of your Hadoop cluster. It supports custom null strings, row skipping and basic Hive types (no complex types currently)
