Created 08-31-2016 09:27 AM
I get some rows broken like:
admin,base,Default,configurable,"50,76,188,467",IN1541MTODREWHT-187,1,Plush Maxi Dress,Buy Plush Maxi Dress Online | Maxi Dresses | StalkBuyLove,/i/n/in1541mtodrewht-187-front.jpg,/i/n/in1541mtodrewht-187-front.jpg,plush-maxi-dress,plush-maxi-dress-97763-SBLPR.html,"""","""",Block after Info Column,"""","""","""", ,No,"""",No,Tax 5,"""",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL ",61.8""",
61.8 actually comes into next row and when I open it in excel , it looks like;
,61.8"
in the field.
I don't want my rows breaking. It happens after I use CSV serde and then download the data from hive.
Created 08-31-2016 01:25 PM
Without seeing the input file you may need to do a pre-processing step in pig or do a second 'create table as' step in order to reformat the data correctly.
It's great to use csv-serde since it does such a good job of stripping out quoted text among other things, but you may need that extra processing in order to use csv-serde effectively (handling NULLs and double-quotes the way you want it to).
Created 08-31-2016 10:38 AM
Hi @Simran Kaur.
This looks like a very complex set of data with a wildly varying structure. Pipe characters, empty strings (""""), forward slashes, quoted and unquoted text, overloaded string fields delimited internally with pipes, etc.
From what you state, It appears that this data came from a source (file or RDBMS) -> then was loaded into HDFS -> then a Hive table structure placed on the file using CSVSerde.
Are you using comma as the delimiter for the table? If you could paste in the table DDL it would help.
My suspicion is that the data in the source data file is not correctly formatted so it causing the CSVSerde to have a weird value in that column. I suspect that [in the Hive table] the final column in question actually holds this 14 character literal string:
NULL ",61.8"""
Please validate if that is true via a select statement.
Also please confirm that the source file does not have a <CR><LF> character after the NULL and before the double-quote.
In either case - the file may require some data cleansing, and it may make sense to use a different delimiter on the source file - perhaps a tilde "~".
Created 08-31-2016 11:39 AM
@bpreachuk: You are right about that. I exported it from RDB and mapped it to hive. but my CSV seems to have just the right structure. Just that when I map it to CSVserde, it changed nulll values into """" and then I see " as value in the field instead of NULLS. similarly, it gives me values like 61.8 with random quotes and that is creating the problem
Created 08-31-2016 01:25 PM
Without seeing the input file you may need to do a pre-processing step in pig or do a second 'create table as' step in order to reformat the data correctly.
It's great to use csv-serde since it does such a good job of stripping out quoted text among other things, but you may need that extra processing in order to use csv-serde effectively (handling NULLs and double-quotes the way you want it to).