Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Import data from multiple servers

avatar
Explorer

Hi all,

I need to import into HDFS, Windows IIS web logs from multiple servers.

Each server have a file share that expose logs folder. In example

\\server01\web_log\00001.log

\\server01\web_log\00002.log

\\server02\web_log\00001.log

...

I would like to use PIG to import data intoHadoop and the final result on HDFS should be

/tmp/iis_logs/server01/00001.log

/tmp/iis_logs/server01/00002.log

/tmp/iis_logs/server02/00001.log

Which is the best approach to accomplish this?

Thank you in advance.

Andrea

1 ACCEPTED SOLUTION

avatar
Explorer

Thank you all. In the end we solved with a powershell script that create pig files and hive files and execute them against shells. @Artem Ervits I'm interested in Nifi but I didn't found any info about Nifi on Windows.

View solution in original post

9 REPLIES 9

avatar
Master Mentor

@Andrea Squizzato

You cannot use pig for data import. Look into webhdfs option.

avatar
Master Mentor

avatar
Master Mentor
@Andrea Squizzatowill

take a look at apache nifi or flume. You can watch directory and it will upload your files automatically when they appear

avatar
Master Guru

Different options. Depends how you want to do it. Often time I end up with a bit of python glue code on the edge node.

There is not really a "best" way to do it.

I have used

- flume ( good if you want to merge all files into a log stream and perhaps filter events )

- webhdfs ( good if you want to upload files as is but cannot access an edge node )

- mounted a folder on the edge node and used a shell script running in cron

This is perhaps the easiest for secure mount there is sshfs and you can just run the hadoop fs -put commands in the shell script

- used rsync to sync a folder to an edge node and run a python program there to pick up the files

the file logic was more difficult so python was better than shell

- used rsync to copy a log folder and used a python script to load files incrementally

Since the log files were supposed to load incrementally the python file kept an offset with tell() for each file and uploaded new results

My tip:

if you can mount the log folders on the edge node and use the hadoop client api for full file loads

If you want incremental loads and pre-processing before hdfs look at nifi or flume

avatar

@Andrea Squizzato as @Artem Ervits suggested I'd start with Flume. Flume can handle multiple channels. I think this is the design you're looking for:

2016-02-03-07-42-02.png

Or alternatively you can avoid the aggregation and write directly into the HDFS folders http://flume.apache.org/FlumeUserGuide.html#hdfs-sink. You may also want to consider writing directly into a Hive sink http://flume.apache.org/FlumeUserGuide.html#hive-sink

avatar
Explorer

Thank you all. I see there are a lot of options so I have to add more details

- WEB servers are stressed and I don't want add any workload (agent or similar)

- I need to load previous day logs, so I can do it in a "scheduled" way

- I created an HIVE External table and I want to keep only 30 last days logs

I have no python skills and I would like to implement something simple.

Perhaps shell script o Power Shell script can be the way.

About Ooozie?

Tank you all.

avatar
Master Mentor

@Andrea Squizzato Flume and Nifi can both upload old logs and you can control whether you want to purge the logs after. Powershell script or shell will work too. You will be able to execute shell scripts inside Nifi processors with next release as well. Trust me you won't regret looking at Nifi. It makes things a lot simpler than script, load, script, schedule, etc. Oozie will do it on schedule basis, Nifi has cron capabilities. Nifi is continuous stream of data, oozie, sqoop are schedule based.

avatar
Explorer

Thank you all. In the end we solved with a powershell script that create pig files and hive files and execute them against shells. @Artem Ervits I'm interested in Nifi but I didn't found any info about Nifi on Windows.

avatar
Master Mentor

@Andrea Squizzato It's a jvm program and Windows is suppored, here's admin guide. https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html