Created on 11-04-2015 09:46 PM
I've seen some postings (including This one) where people are using CSVSerde for processing input data. CSVSerde is a magical piece of code, but it isn't meant to be used for all input CSVs.
tl;dr - Use CSVSerde only when you have quoted text or really strange delimiters (such as blanks) in your input data - otherwise you will take a rather substantial performance hit...
For example: If we have a text file with the following data:
col1 col2 col3 ---------------------- 121 Hello World 4567 232 Text 5678 343 More Text 6789
Pipe delimited it would look like:
121|Hello World|4567| 232|Text|5678| 343|More Text|6789|
but Blank delimited with quoted text would look like (don't laugh - Progress database dumps are blank delimited and text quoted in this exact format):
121 'Hello World' 4567 232 Text 5678 343 'More Text' 6789
Notice that the text may or may not have quote marks around it - text only needs to be quoted if it contains a blank. This is a particularly nasty set of data. You need custom coding - unless you use CSVSerde.
CSVSerde can handle this data with ease. Blank delimited/Quoted text files are parsed perfectly without any coding when you use the following table declaration: .
CREATE TABLE my_table(col1 string, col2, string, col3 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = " ", "quoteChar" = "'")
tl;dr Using CSVSerde for conventional CSV files is about 3X slower...
The following code shows timings encountered when processing a simple pipe-delimited csv file. One Hive table definition uses conventional delimiter processing, and one uses CSVSerde.
The input timings were on a small cluster (28 data nodes). The file used for testing had 62,825,000 rows. Again, rather small.
Table DDL using conventional delimiter definition:
CREATE external TABLE test_csv_serde ( `belnr` string, `bukrs` string, `budat` string, `bstat` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' location '/user/<uname>/elt/test_csvserde/'; -- Load the data one-time insert overwrite table test_csv_serde select * from <large table>;
Table DDL using CSVSerde (same file/source data as the other table):
CREATE external TABLE test_csv_serde_using_CSV_Serde_reader ( `belnr` string, `bukrs` string, `budat` string, `bstat` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ("separatorChar" = "|") location '/user/<uname>/elt/test_csvserde/';
Results:
hive> select count(*) from test_csv_serde; Time taken: 8.683 seconds, Fetched: 1 row(s) hive> select count(*) from test_csv_serde_using_CSV_Serde_reader; Time taken: 27.442 seconds, Fetched: 1 row(s) hive> select count(*) from test_csv_serde; Time taken: 8.707 seconds, Fetched: 1 row(s) hive> select count(*) from test_csv_serde_using_CSV_Serde_reader; Time taken: 27.41 seconds, Fetched: 1 row(s) hive> select min(belnr) from test_csv_serde; Time taken: 10.267 seconds, Fetched: 1 row(s) hive> select min(belnr) from test_csv_serde_using_CSV_Serde_reader; Time taken: 29.271 seconds, Fetched: 1 row(s)
Created on 11-05-2015 03:25 AM
this is awesome! saved me a lot of research 🙂
Created on 11-05-2015 06:33 PM
Good Stuff. Glad it helped you Ned!
Created on 02-14-2016 09:24 PM
I'd like to be able to use the CSVSerDe from within Spark SQL. Do you know what configuration changes need to be made (on the Sandbox or otherwise) for Spark to have the CSVSerDe in it's class path?
Created on 02-14-2016 11:48 PM
@Paul Boal use this guide to work with hive udfs in spark http://hortonworks.com/hadoop-tutorial/apache-spark-1-4-1-technical-preview-with-hdp/
And here's example of invoking csvserde https://community.hortonworks.com/content/kbentry/8313/apache-hive-csv-serde-example.html
Created on 02-15-2016 07:19 AM
I think I've got my DDL right, but I don't have Spark (via Zeppelin) seeing the CSVSerDe. Any thoughts? I've tried playing around with various driver-class-path and library-class-path settings both in the Zeppelin interpreter settings and in the Spark configuration via Ambrai, but haven't figured this one out yet. Specifically, this is via Zeppelin on the the HDP Sandbox 2.3.2.0-2950.3.2.0-2950
%sql CREATE EXTERNAL TABLE demo_table ( rid STRING, hospital_name STRING, hospital_type STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION '/user/zeppelin/directory' TBLPROPERTIES ('skip.header.line.count'='1') org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hadoop.hive.serde2.OpenCSVSerde at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:350) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:327)
Created on 02-15-2016 02:18 PM
@Paul Boal this is an article, and our thread is becoming too large, next time try to open a new question on HCC instead. Here's a solution you can try http://zeltov.blogspot.com/2015/11/external-jars-not-getting-picked-up-in_9.html
Created on 11-08-2016 12:27 PM
a)what is the source for the tables:-test_csv_serde_using_CSV_Serde_reader,test_csv_serde;
which one i need to consider?
b)Is it same for both the tables?
Option 1 : col1 col2 col3 ---------------------- 121 Hello World 4567 232 Text 5678 343 More Text 6789
Option 2 : 121|Hello World|4567| 232|Text|5678| 343|More Text|6789|
Option 3 : 121 'Hello World' 4567 232 Text 5678 343 'More Text' 6789
Created on 11-08-2016 09:56 PM
Hi @vamsi valiveti.
The example above uses the exact same source file in the exact same location for both external tables. Both test_csv_serde_using_CSV_Serde_reader and test_csv_serde tables read an external file(s) stored in the directory called '/user/<uname>/elt/test_csvserde/'.
The file I used was pipe delimited and contains 62,000,000 rows - so I didn't attach it . 😉
It would look like Option 2 above, but of course with 4 columns:
121|Hello World|4567|34345 232|Text|5678|78678 343|More Text|6789|342134
Created on 11-22-2017 05:54 AM
If there multi characters like '\r\n' for line separator how to handle in serde