Created 07-29-2020 02:47 PM
Hi,
Im new to Nifi. Im trying to parse csv records with the SplitRecord Processor utilizing the CSVReader controller service, however Im getting IllegalArgumentException an exception on the file despite the CSV seems to be well formatted and Im able to parse\read through other means. Im getting the file through the GetFile processor and when I view the content from the queue I notice a red dot (see screenshot) in the beginning which seems to indicate invalid character but when I open the file in a text editor I dont see anything abnormal. This file is coming from a third party and it would be hard to change. When I copy the content and place it into other file and save, the new file works! Can someone help please to find what is causing the issue and how to fix? Thanks
Created 07-31-2020 01:58 AM
Hello @SAMSAL ,
Thanks for your reply and the additional information.
What I would personally do is to keep collecting the data and note down the encodings detected to see if there is a pattern and do some research based on it.
Since the issue is not caused by any Cloudera product defects, we’ve reached the limit of what assistance can be provided via Community support. This thread will remain open if other community peers want to contribute.
If you’re a Cloudera Subscription Support customer, we can connect you with your Account team to explore a possible Services engagement for this request. Let us know if you’re interested in this path, we’ll private message you to collect more information.
Thank you:
Ferenc
Ferenc Erdelyi, Technical Solutions Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created 08-05-2020 06:09 AM
Hi SAMSAL,
I believe we're hitting a similar issue here to this stackoverflow thread. As in, the third party utility that creates the csv files adds this non-breaking space character to the file that displays as a red dot. You mentioned that it's not trivial to change the input files but maybe this is a minor adjustment that the vendor can make.
Created 07-30-2020 01:14 AM
Hello @SAMSAL ,
thank you for reaching out with your issue about parsing a CSV with SplitRecord processor.
The described behaviour sounds like an encoding issue. Have you tried to identify what is the encoding of your source data, please?
I've just quickly googled for "identify encoding" and found this tool, which might works.
(Have not tested it, so feel free to browse in this topic).
Once you know what is the character encoding for the source data, please set your "Character Set" in NiFi accordingly.
Please let us know if it helped by pressing the "Accept as Solution" button.
Kind regards:
Ferenc
Ferenc Erdelyi, Technical Solutions Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created 07-30-2020 09:31 AM
Hi Bender,
Thanks for the reply. I found a way to find the encoding and set it accordingly in the CSV reader character set field, that worked for some files but not other still getting errors that I dont get with other CSV parser online like (https://csvjson.com/csv2json). Is there a way I can upload the file for you to check? I cant go back to the vendor and tell them there is a formatting issue if it works fine with other parsers.
Thanks
Created 07-31-2020 01:58 AM
Hello @SAMSAL ,
Thanks for your reply and the additional information.
What I would personally do is to keep collecting the data and note down the encodings detected to see if there is a pattern and do some research based on it.
Since the issue is not caused by any Cloudera product defects, we’ve reached the limit of what assistance can be provided via Community support. This thread will remain open if other community peers want to contribute.
If you’re a Cloudera Subscription Support customer, we can connect you with your Account team to explore a possible Services engagement for this request. Let us know if you’re interested in this path, we’ll private message you to collect more information.
Thank you:
Ferenc
Ferenc Erdelyi, Technical Solutions Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created 07-31-2020 12:17 PM
Thanks Bender for your help. I think your previous answer put me on the right path because I found that my issue was a combination of figuring out the correct encoding and bad formatting. My statement about other parser was inaccurate because even though the conversion to json happened without errors but some records were incorrect, so Kodos for Nifi for capturing the inconsistencies despite error message did not provide much info on which record\filed where inconsistencies happened. I'm still carious to know what the red dot that shows in the beginning of the text in the content viewer means since I did not see it with other contents.Can you elaborate on this? Also I would like to know what encoding values the "Character Set" in the CSV reader supports and I wish if that is listed in the CSV Reader manual (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services...)
Thank you.
Created 08-05-2020 06:09 AM
Hi SAMSAL,
I believe we're hitting a similar issue here to this stackoverflow thread. As in, the third party utility that creates the csv files adds this non-breaking space character to the file that displays as a red dot. You mentioned that it's not trivial to change the input files but maybe this is a minor adjustment that the vendor can make.