Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

how to process comma(",") through spark if source file is CSV(comam delimited) and data itself has comma(",") somewhere?

how to process comma(",") through spark if source file is CSV(comam delimited) and data itself has comma(",") somewhere?

Explorer

I need to process CSV file through spark , I need load CSV file into hive tables through spark however my files itself has comma in data not as a separator but as a content at several places in this case there are three questions

1) How will spark identify that this is not a separator and consider this comma as a content of data

2) How can we process such data and load into hive including comma which is content and not a separator

Please share some techniques to achieve above points.

1 REPLY 1
Highlighted

Re: how to process comma(",") through spark if source file is CSV(comam delimited) and data itself has comma(",") somewhere?

Hi @HDave,

if your text fields have a double quotation or something like that, it shouldn't be a problem.

In the other case: You can't use a delimiter which is used in the fields you want to separate!

So to answer your questions:

1) Do you have a sample dataset? Maybe you can try some fancy regex stuff (though I don't think it will work in most cases).

2) As mentioned before, you should use double quotation marks for text fields. But best practice would be just to use a delimiter which isn't used by your fields.

Best regards

Jan

Don't have an account?
Coming from Hortonworks? Activate your account here