Support Questions

Find answers, ask questions, and share your expertise

convert orc table data into csv

avatar
Expert Contributor

I have a text data(.csv) file in hive external table. I am planning to use orc compression on the text data by creating a new orc table (the compression rate is more than 10 x times better...) and then i would like to delete the external tables csv file. In some cases if i want to convert the ORC table back into csv file, what is the easiest way to do it? i cannot do it directly when i do hive -e "select ... from orc table" >> /tmp/csvfile, it is just stuck.

1 ACCEPTED SOLUTION

avatar
Master Guru

@PJ

When you are having huge data in orc table then it will take a while to convert all the results and store them as csv file.

Here is what i tried:-

foo is an orc table

hive#select * from foo;
+---------+--+
| foo.id  |
+---------+--+
| 1       |
| 2       |
| 3       |
| 4       |
+---------+--+
bash#hive-e "select * from foo1">>foo1.txt
bash# cat foo1.txt
+----------+--+
| foo1.id  |
+----------+--+
| 1        |
| 2        |
| 3        |
| 4        |
+----------+--+

When we are having small set of data it will be done very quickly.

if the number of records are really big then Ideal way to do this is as follows

hive#INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
SELECT * FROM default.foo;

or else you can write the data to local directories also just add local

hive#INSERT OVERWRITE LOCAL DIRECTORY '<Local-Dir-Path>'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
SELECT * FROM default.foo;

Also, note that the insert overwrite directory basically removes all the existing files under the specified folder and then create data files as part files and this may create multiple files and you may want to concatenate them on the client side after it's done exporting.

Using this approach means you don't need to worry about the format of the source tables, can select your own delimiters and output formats.

** I would suggest try to avoid saving large file to local directory if possible, Use insert overwrite directory and store the results to HDFS directory **

For more details refer to this link.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Insertingdatai...

View solution in original post

5 REPLIES 5

avatar
@PJ

Give a try to download as csv from Ambari View or Hue?

avatar
Expert Contributor

@Divakar Annapureddy

I have data hourly with about 9+ million rows, firstly it is taking a lot of time to download it to csv from ambari view and secondly it is only exporting 47K+ rows into csv .. not sure why

Is there any better way ?

Also, what are some pros and cons of converting a text file to orc in terms of joins and anything else that is to be noted?

Thanks.

avatar
Master Guru

@PJ

When you are having huge data in orc table then it will take a while to convert all the results and store them as csv file.

Here is what i tried:-

foo is an orc table

hive#select * from foo;
+---------+--+
| foo.id  |
+---------+--+
| 1       |
| 2       |
| 3       |
| 4       |
+---------+--+
bash#hive-e "select * from foo1">>foo1.txt
bash# cat foo1.txt
+----------+--+
| foo1.id  |
+----------+--+
| 1        |
| 2        |
| 3        |
| 4        |
+----------+--+

When we are having small set of data it will be done very quickly.

if the number of records are really big then Ideal way to do this is as follows

hive#INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
SELECT * FROM default.foo;

or else you can write the data to local directories also just add local

hive#INSERT OVERWRITE LOCAL DIRECTORY '<Local-Dir-Path>'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
SELECT * FROM default.foo;

Also, note that the insert overwrite directory basically removes all the existing files under the specified folder and then create data files as part files and this may create multiple files and you may want to concatenate them on the client side after it's done exporting.

Using this approach means you don't need to worry about the format of the source tables, can select your own delimiters and output formats.

** I would suggest try to avoid saving large file to local directory if possible, Use insert overwrite directory and store the results to HDFS directory **

For more details refer to this link.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Insertingdatai...

avatar
Expert Contributor
@Shu

I am trying to use hive -e "select ,, * to txt but after logging initiliazed nothing is happening:

Logging initialized using configuration in file:/etc/hive/2.5.3.0-37/0/hive-log4j.properties

OK

Is it because of the size of the table?

avatar
Master Guru
@PJ

Yeah, it might be that case.Because if you are having large number of records then it will take a lot of time to convert ORC data to csv format and if you compare these two process executing query with insert overwrite directory will perform much faster with no issues and also we can keep what ever delimiter we need and we don't need to worry about size of the data.