Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar

I've seen some postings (including This one) where people are using CSVSerde for processing input data. CSVSerde is a magical piece of code, but it isn't meant to be used for all input CSVs.

tl;dr - Use CSVSerde only when you have quoted text or really strange delimiters (such as blanks) in your input data - otherwise you will take a rather substantial performance hit...

When to use CSVSerde:

For example: If we have a text file with the following data:

col1  col2  col3
----------------------
121  Hello World 4567
232  Text        5678
343  More Text   6789

Pipe delimited it would look like:

121|Hello World|4567|
232|Text|5678|
343|More Text|6789|

but Blank delimited with quoted text would look like (don't laugh - Progress database dumps are blank delimited and text quoted in this exact format):

121 'Hello World' 4567
232 Text 5678
343 'More Text' 6789

Notice that the text may or may not have quote marks around it - text only needs to be quoted if it contains a blank. This is a particularly nasty set of data. You need custom coding - unless you use CSVSerde.

CSVSerde can handle this data with ease. Blank delimited/Quoted text files are parsed perfectly without any coding when you use the following table declaration: .

CREATE TABLE my_table(col1 string, col2, string, col3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (  "separatorChar" = " ",  
                        "quoteChar"  = "'") 

Performance Hit when Using CSVSerde on conventional CSV data

tl;dr Using CSVSerde for conventional CSV files is about 3X slower...

The following code shows timings encountered when processing a simple pipe-delimited csv file. One Hive table definition uses conventional delimiter processing, and one uses CSVSerde.

The input timings were on a small cluster (28 data nodes). The file used for testing had 62,825,000 rows. Again, rather small.

Table DDL using conventional delimiter definition:

CREATE external TABLE test_csv_serde (  
`belnr` string,  
`bukrs` string,  
`budat` string,  
`bstat` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
location '/user/<uname>/elt/test_csvserde/';

-- Load the data one-time
insert overwrite table test_csv_serde
select * from <large table>;

Table DDL using CSVSerde (same file/source data as the other table):

CREATE external  TABLE test_csv_serde_using_CSV_Serde_reader (  
`belnr` string,  
`bukrs` string,  
`budat` string,  
`bstat` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'  
WITH SERDEPROPERTIES ("separatorChar" = "|")
location '/user/<uname>/elt/test_csvserde/';

Results:

hive> select count(*) from test_csv_serde;
Time taken: 8.683 seconds, Fetched: 1 row(s)
hive> select count(*) from test_csv_serde_using_CSV_Serde_reader;
Time taken: 27.442 seconds, Fetched: 1 row(s)

hive> select count(*) from test_csv_serde;
Time taken: 8.707 seconds, Fetched: 1 row(s)
hive> select count(*) from test_csv_serde_using_CSV_Serde_reader;
Time taken: 27.41 seconds, Fetched: 1 row(s)

hive> select min(belnr) from  test_csv_serde;
Time taken: 10.267 seconds, Fetched: 1 row(s)
hive> select min(belnr) from  test_csv_serde_using_CSV_Serde_reader;
Time taken: 29.271 seconds, Fetched: 1 row(s) 
28,232 Views
Comments
avatar
Rising Star

this is awesome! saved me a lot of research 🙂

avatar

Good Stuff. Glad it helped you Ned!

avatar
Contributor

I'd like to be able to use the CSVSerDe from within Spark SQL. Do you know what configuration changes need to be made (on the Sandbox or otherwise) for Spark to have the CSVSerDe in it's class path?

avatar
Contributor

I think I've got my DDL right, but I don't have Spark (via Zeppelin) seeing the CSVSerDe. Any thoughts? I've tried playing around with various driver-class-path and library-class-path settings both in the Zeppelin interpreter settings and in the Spark configuration via Ambrai, but haven't figured this one out yet. Specifically, this is via Zeppelin on the the HDP Sandbox 2.3.2.0-2950.3.2.0-2950

%sql 
CREATE EXTERNAL TABLE demo_table ( rid STRING,
hospital_name STRING,
hospital_type STRING )
 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
STORED AS TEXTFILE 
LOCATION '/user/zeppelin/directory' TBLPROPERTIES ('skip.header.line.count'='1')
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hadoop.hive.serde2.OpenCSVSerde
	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:350)
	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:327)

avatar
Master Mentor

@Paul Boal this is an article, and our thread is becoming too large, next time try to open a new question on HCC instead. Here's a solution you can try http://zeltov.blogspot.com/2015/11/external-jars-not-getting-picked-up-in_9.html

avatar
Expert Contributor

a)what is the source for the tables:-test_csv_serde_using_CSV_Serde_reader,test_csv_serde;

which one i need to consider?

b)Is it same for both the tables?

Option 1 :
col1  col2  col3
----------------------
121  Hello World 4567
232  Text        5678
343  More Text   6789

Option 2 :
121|Hello World|4567|
232|Text|5678|
343|More Text|6789|
Option 3 :
121 'Hello World' 4567
232 Text 5678
343 'More Text' 6789
avatar

Hi @vamsi valiveti.

The example above uses the exact same source file in the exact same location for both external tables. Both test_csv_serde_using_CSV_Serde_reader and test_csv_serde tables read an external file(s) stored in the directory called '/user/<uname>/elt/test_csvserde/'.

The file I used was pipe delimited and contains 62,000,000 rows - so I didn't attach it . 😉

It would look like Option 2 above, but of course with 4 columns:

121|Hello World|4567|34345
232|Text|5678|78678
343|More Text|6789|342134
avatar
New Contributor

If there multi characters like '\r\n' for line separator how to handle in serde