Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (2)
avatar
Master Mentor

I’m going to show you a neat way to work with CSV files and Apache Hive. Usually, you’d have to do some preparatory work on CSV data before you can consume it with Hive but I’d like to show you a built-in SerDe (Serializer/Deseriazlier) for Hive that will make it a lot more convenient to work with CSV. This work was merged in Hive 0.14 and there’s no additional steps necessary to work with CSV from Hive.

Suppose you have a CSV file with the following entries

id first_name last_name email gender ip_address

1 James Coleman jcoleman0@cam.ac.uk Male 136.90.241.52

2 Lillian Lawrence llawrence1@statcounter.com Female 101.177.15.130

3 Theresa Hall thall2@sohu.com Female 114.123.153.64

4 Samuel Tucker stucker3@sun.com Male 89.60.227.31

5 Emily Dixon edixon4@surveymonkey.com Female 119.92.21.19

to consume it from within Hive, you’ll need to upload it to hdfs

hdfs dfs -put sample.csv /tmp/serdes/

now all it takes is to create a table schema on top of the file

drop table if exists sample;
create external table sample(id int,first_name string,last_name string,email string,gender string,ip_address string)
  row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
  stored as textfile
location '/tmp/serdes/';

now you can query the table as is

select * from sample limit 10;

but what if your CSV file was tab-delimited rather than comma?

well the SerDe got you covered there too:

drop table if exists sample;
create external table sample(id int,first_name string,last_name string,email string,gender string,ip_address string)
  row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
  "separatorChar" = "\t"
  )
  stored as textfile
location '/tmp/serdes/';

notice the separatorChar argument, in all, the SerDe accepts two more arguments; custom escape characters and quote characters Take a look at the wiki for more info

https://cwiki.apache.org/confluence/display/Hive/CSV+Serde.

39,465 Views