I'm trying to use ConsumeImap -> ExtractEmailAttachments processors with google mail. And it gets messages but fails with: "Message failed RFC2822 validation".
From source code i see there should be from and sentDate headers set.
I was thinking something is wrong with gmail server and i wrote python code using imaplib and email and it works. This code is outside of nifi.
Does anyone have any experience with this problem or clue how to fix this.
If it is not sensitive, would you be able to provide an example of the flow file content that is failing?
You could get this from using provenance, or from routing the failure of ExtractEmailAttachments to a directory using PutFile.
I get this from provenance:
--Apple-Mail=_851C8E02-DA29-4CEB-8309-895E2E5B1FB3 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Test body --Apple-Mail=_851C8E02-DA29-4CEB-8309-895E2E5B1FB3 Content-Disposition: attachment; filename=test.csv Content-Type: text/csv; name="test.csv" Content-Transfer-Encoding: quoted-printable foo;bar;;;;;=0D 1;2;;;;;=0D 3;4;;;;;=0D 5;6;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;=0D ;;;;;;= --Apple-Mail=_851C8E02-DA29-4CEB-8309-895E2E5B1FB3--
I had a feeling it was a problem with the output of ConumeIMAP, that JIRA definitely looks like what you are seeing. I'm glad we have already captured the issue, although sorry that it is causing you problems.
No problem, i just need to create some hack. And good thing i know now how Email processors are working in NIFI(i read all the code).
Maybe i can still use ConsumeIMAP processor to watch for new messages and then maybe route to ExecuteScript which will run Python code which will extract attachment and pass to next processor? Only issue that i see is the how to detect new messages, i am not sure if session is persistent or something like that. I could mark them as read after i download them. I did not have much experience with emails. But i will try. If you have suggestion how to hack this please be free to suggest.
I think what you suggested makes sense. I am not very familiar with these email processors, but if you are still using ConsueIMAP I think that would be handling getting the new messages and marking them as read, all your script would be doing is receiving a flow file with the message in it and parsing it like ExtractEmailAttachments was doing, but working around the missing headers.
Hi, you can sure mark the message as read an adjust the python script to only read when needed. On a separate note, have you tested the POP3 processor? Many email providers like gmail, exchange, etc offer both protocols to user agents. Curious to know if the same issue happens with those as well. Cheers
POP3 is also failing.
In the end i created python script using smptlib to get messages after consumeimap fires.
ConsumeImap -> ExecuteScript -> ExtractEmailAttachments
But i don't like this solution, two time i am download messages.
@blood9raven, Thanks for your message, I have just added a patch attempting to solve the bug you hit, would be able to test it and let me know if it works?
The patch can be found on the JIRA page you linked previously Cheers