Created 02-03-2016 01:07 PM
Hi all,
I need to import into HDFS, Windows IIS web logs from multiple servers.
Each server have a file share that expose logs folder. In example
\\server01\web_log\00001.log
\\server01\web_log\00002.log
\\server02\web_log\00001.log
...
I would like to use PIG to import data intoHadoop and the final result on HDFS should be
/tmp/iis_logs/server01/00001.log
/tmp/iis_logs/server01/00002.log
/tmp/iis_logs/server02/00001.log
Which is the best approach to accomplish this?
Thank you in advance.
Andrea
Created 02-14-2016 05:04 PM
Thank you all. In the end we solved with a powershell script that create pig files and hive files and execute them against shells. @Artem Ervits I'm interested in Nifi but I didn't found any info about Nifi on Windows.
Created 02-03-2016 01:09 PM
You cannot use pig for data import. Look into webhdfs option.
Created 02-03-2016 01:10 PM
Created 02-03-2016 01:11 PM
take a look at apache nifi or flume. You can watch directory and it will upload your files automatically when they appear
Created 02-03-2016 01:44 PM
Different options. Depends how you want to do it. Often time I end up with a bit of python glue code on the edge node.
There is not really a "best" way to do it.
I have used
- flume ( good if you want to merge all files into a log stream and perhaps filter events )
- webhdfs ( good if you want to upload files as is but cannot access an edge node )
- mounted a folder on the edge node and used a shell script running in cron
This is perhaps the easiest for secure mount there is sshfs and you can just run the hadoop fs -put commands in the shell script
- used rsync to sync a folder to an edge node and run a python program there to pick up the files
the file logic was more difficult so python was better than shell
- used rsync to copy a log folder and used a python script to load files incrementally
Since the log files were supposed to load incrementally the python file kept an offset with tell() for each file and uploaded new results
My tip:
if you can mount the log folders on the edge node and use the hadoop client api for full file loads
If you want incremental loads and pre-processing before hdfs look at nifi or flume
Created 02-03-2016 01:47 PM
@Andrea Squizzato as @Artem Ervits suggested I'd start with Flume. Flume can handle multiple channels. I think this is the design you're looking for:
Or alternatively you can avoid the aggregation and write directly into the HDFS folders http://flume.apache.org/FlumeUserGuide.html#hdfs-sink. You may also want to consider writing directly into a Hive sink http://flume.apache.org/FlumeUserGuide.html#hive-sink
Created 02-03-2016 03:19 PM
Thank you all. I see there are a lot of options so I have to add more details
- WEB servers are stressed and I don't want add any workload (agent or similar)
- I need to load previous day logs, so I can do it in a "scheduled" way
- I created an HIVE External table and I want to keep only 30 last days logs
I have no python skills and I would like to implement something simple.
Perhaps shell script o Power Shell script can be the way.
About Ooozie?
Tank you all.
Created 02-03-2016 03:22 PM
@Andrea Squizzato Flume and Nifi can both upload old logs and you can control whether you want to purge the logs after. Powershell script or shell will work too. You will be able to execute shell scripts inside Nifi processors with next release as well. Trust me you won't regret looking at Nifi. It makes things a lot simpler than script, load, script, schedule, etc. Oozie will do it on schedule basis, Nifi has cron capabilities. Nifi is continuous stream of data, oozie, sqoop are schedule based.
Created 02-14-2016 05:04 PM
Thank you all. In the end we solved with a powershell script that create pig files and hive files and execute them against shells. @Artem Ervits I'm interested in Nifi but I didn't found any info about Nifi on Windows.
Created 02-14-2016 05:45 PM
@Andrea Squizzato It's a jvm program and Windows is suppored, here's admin guide. https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html