Support Questions

Find answers, ask questions, and share your expertise

Nifi ListenSyslog howto ingest into HAWQ

avatar

Hi,

1. step we ingest SYSLOG messages with the ListenSyslog and PutHDFS in our DataLake.

2. step ingest with PXF external table technology into HAWQ.

CREATE EXTERNAL TABLE ext_syslog_hist
(message TEXT)
LOCATION ('pxf://xxx.xxx.com:8020/apps/nifi/syslog/history/*?PROFILE=HdfsTextSimple')
FORMAT 'TEXT' (delimiter=E'\t');

Every minute there are thousands of small (100 bytes) files created in the HDFS.
So I'm looking for a way to ingest the incoming ListenSyslog directly into a HAWQ table.

Regards
Timo

1 ACCEPTED SOLUTION

avatar

You could use PutSQL to issue INSERT statements directly against HAWQ. Assign the data item values to attributes, as described in the PutSQL documentation. Then put the INSERT statement in the content of the Flow File, with '?' placeholders for the individual values.

View solution in original post

9 REPLIES 9

avatar

You could use PutSQL to issue INSERT statements directly against HAWQ. Assign the data item values to attributes, as described in the PutSQL documentation. Then put the INSERT statement in the content of the Flow File, with '?' placeholders for the individual values.

avatar

Thanks,
this was really easy.
I just generating "insert into table" statements including the Flow File with the ReplaceText processor and connect it with the PutSQL processor.

regards
Timo

avatar
Contributor

Is this equivalent to: INSERT INTO <table> (col1, col2, col3) VALUES (?, ?, ?);

avatar
Master Guru

You could also use MergeContent before PutHDFS so that you can merge together all of the small files based on size or time before writing to HDFS.

avatar

Hi Bryan,

how can i merge the FlowData Files to a multiline File?
Using the MergeCentent processor i got a concatenated line.
input:

1 FlowData: <30>Aug 11 06:27:26 xxx.xxx.com systemd[28892]: Stopping Timers.

2 FlowData:<30>Aug 11 06:27:15 xxx.xxx.com systemd[24517]: Stopping Paths

etc...

MergeContent output:

<30>Aug 11 06:27:26 xxx.xxx.com systemd[28892]: Stopping Timers.<30>Aug 11 06:27:15 xxx.xxx.com systemd[24517]:Stopping Paths

I need the following multiline structure:

<30>Aug 11 06:27:26 xxx.xxx.com systemd[28892]: Stopping Timers.
<30>Aug 11 06:27:15 xxx.xxx.com systemd[24517]:Stopping Paths

any solution?

regards
Timo

avatar
Master Guru

Just set the "Delimiter Strategy" in MergeContent to "Text" and then set the "Demarcator" property to shift+enter to make a new line. It will insert the demarcator between each flow file that it merges.

avatar
Contributor

You definitely do not want to execute an INSERT for every row of data. That will be really, really slow.

I suggest you use the gpfdist protocol to load the log files directly into HAWQ. You start a gpfdist process where the files are located, create an external table that uses the gpfdist location, and then execute a single INSERT statement to load all of the data from the files into your HAWQ table.

avatar
Contributor

The PutSQL option is by far the worst option. This is the wrong answer. Please, don't do that. Hellmer clearly doesn't use HAWQ.

I wrote a blog post over a year ago on this subject. The singleton INSERT solution that is common for OLTP databases is the absolute worst solution for HAWQ. In other words, "INSERT INTO table_name (columns...) VALUES (...);"

Here is my blog post covering loading data into HAWQ: http://www.pivotalguru.com/?p=1425

avatar

Great, works !!!
Thanks!