Created on 10-24-2017 06:58 PM - edited 09-16-2022 05:26 AM
I have a text data(.csv) file in hive external table. I am planning to use orc compression on the text data by creating a new orc table (the compression rate is more than 10 x times better...) and then i would like to delete the external tables csv file. In some cases if i want to convert the ORC table back into csv file, what is the easiest way to do it? i cannot do it directly when i do hive -e "select ... from orc table" >> /tmp/csvfile, it is just stuck.
Created 10-24-2017 07:38 PM
When you are having huge data in orc table then it will take a while to convert all the results and store them as csv file.
Here is what i tried:-
foo is an orc table
hive#select * from foo; +---------+--+ | foo.id | +---------+--+ | 1 | | 2 | | 3 | | 4 | +---------+--+
bash#hive-e "select * from foo1">>foo1.txt
bash# cat foo1.txt +----------+--+ | foo1.id | +----------+--+ | 1 | | 2 | | 3 | | 4 | +----------+--+
When we are having small set of data it will be done very quickly.
if the number of records are really big then Ideal way to do this is as follows
hive#INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SELECT * FROM default.foo;
or else you can write the data to local directories also just add local
hive#INSERT OVERWRITE LOCAL DIRECTORY '<Local-Dir-Path>' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SELECT * FROM default.foo;
Also, note that the insert overwrite directory basically removes all the existing files under the specified folder and then create data files as part files and this may create multiple files and you may want to concatenate them on the client side after it's done exporting.
Using this approach means you don't need to worry about the format of the source tables, can select your own delimiters and output formats.
** I would suggest try to avoid saving large file to local directory if possible, Use insert overwrite directory and store the results to HDFS directory **
For more details refer to this link.
Created 10-24-2017 07:04 PM
Give a try to download as csv from Ambari View or Hue?
Created 10-24-2017 07:36 PM
I have data hourly with about 9+ million rows, firstly it is taking a lot of time to download it to csv from ambari view and secondly it is only exporting 47K+ rows into csv .. not sure why
Is there any better way ?
Also, what are some pros and cons of converting a text file to orc in terms of joins and anything else that is to be noted?
Thanks.
Created 10-24-2017 07:38 PM
When you are having huge data in orc table then it will take a while to convert all the results and store them as csv file.
Here is what i tried:-
foo is an orc table
hive#select * from foo; +---------+--+ | foo.id | +---------+--+ | 1 | | 2 | | 3 | | 4 | +---------+--+
bash#hive-e "select * from foo1">>foo1.txt
bash# cat foo1.txt +----------+--+ | foo1.id | +----------+--+ | 1 | | 2 | | 3 | | 4 | +----------+--+
When we are having small set of data it will be done very quickly.
if the number of records are really big then Ideal way to do this is as follows
hive#INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SELECT * FROM default.foo;
or else you can write the data to local directories also just add local
hive#INSERT OVERWRITE LOCAL DIRECTORY '<Local-Dir-Path>' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SELECT * FROM default.foo;
Also, note that the insert overwrite directory basically removes all the existing files under the specified folder and then create data files as part files and this may create multiple files and you may want to concatenate them on the client side after it's done exporting.
Using this approach means you don't need to worry about the format of the source tables, can select your own delimiters and output formats.
** I would suggest try to avoid saving large file to local directory if possible, Use insert overwrite directory and store the results to HDFS directory **
For more details refer to this link.
Created 10-24-2017 09:13 PM
I am trying to use hive -e "select ,, * to txt but after logging initiliazed nothing is happening:
Logging initialized using configuration in file:/etc/hive/2.5.3.0-37/0/hive-log4j.properties
OK
Is it because of the size of the table?
Created 10-25-2017 03:41 AM
Yeah, it might be that case.Because if you are having large number of records then it will take a lot of time to convert ORC data to csv format and if you compare these two process executing query with insert overwrite directory will perform much faster with no issues and also we can keep what ever delimiter we need and we don't need to worry about size of the data.