- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Import data from multiple servers
- Labels:
-
Apache Pig
Created ‎02-03-2016 01:07 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I need to import into HDFS, Windows IIS web logs from multiple servers.
Each server have a file share that expose logs folder. In example
\\server01\web_log\00001.log
\\server01\web_log\00002.log
\\server02\web_log\00001.log
...
I would like to use PIG to import data intoHadoop and the final result on HDFS should be
/tmp/iis_logs/server01/00001.log
/tmp/iis_logs/server01/00002.log
/tmp/iis_logs/server02/00001.log
Which is the best approach to accomplish this?
Thank you in advance.
Andrea
Created ‎02-14-2016 05:04 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you all. In the end we solved with a powershell script that create pig files and hive files and execute them against shells. @Artem Ervits I'm interested in Nifi but I didn't found any info about Nifi on Windows.
Created ‎02-03-2016 01:09 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You cannot use pig for data import. Look into webhdfs option.
Created ‎02-03-2016 01:10 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎02-03-2016 01:11 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
take a look at apache nifi or flume. You can watch directory and it will upload your files automatically when they appear
Created ‎02-03-2016 01:44 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Different options. Depends how you want to do it. Often time I end up with a bit of python glue code on the edge node.
There is not really a "best" way to do it.
I have used
- flume ( good if you want to merge all files into a log stream and perhaps filter events )
- webhdfs ( good if you want to upload files as is but cannot access an edge node )
- mounted a folder on the edge node and used a shell script running in cron
This is perhaps the easiest for secure mount there is sshfs and you can just run the hadoop fs -put commands in the shell script
- used rsync to sync a folder to an edge node and run a python program there to pick up the files
the file logic was more difficult so python was better than shell
- used rsync to copy a log folder and used a python script to load files incrementally
Since the log files were supposed to load incrementally the python file kept an offset with tell() for each file and uploaded new results
My tip:
if you can mount the log folders on the edge node and use the hadoop client api for full file loads
If you want incremental loads and pre-processing before hdfs look at nifi or flume
Created ‎02-03-2016 01:47 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Andrea Squizzato as @Artem Ervits suggested I'd start with Flume. Flume can handle multiple channels. I think this is the design you're looking for:
Or alternatively you can avoid the aggregation and write directly into the HDFS folders http://flume.apache.org/FlumeUserGuide.html#hdfs-sink. You may also want to consider writing directly into a Hive sink http://flume.apache.org/FlumeUserGuide.html#hive-sink
Created ‎02-03-2016 03:19 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you all. I see there are a lot of options so I have to add more details
- WEB servers are stressed and I don't want add any workload (agent or similar)
- I need to load previous day logs, so I can do it in a "scheduled" way
- I created an HIVE External table and I want to keep only 30 last days logs
I have no python skills and I would like to implement something simple.
Perhaps shell script o Power Shell script can be the way.
About Ooozie?
Tank you all.
Created ‎02-03-2016 03:22 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Andrea Squizzato Flume and Nifi can both upload old logs and you can control whether you want to purge the logs after. Powershell script or shell will work too. You will be able to execute shell scripts inside Nifi processors with next release as well. Trust me you won't regret looking at Nifi. It makes things a lot simpler than script, load, script, schedule, etc. Oozie will do it on schedule basis, Nifi has cron capabilities. Nifi is continuous stream of data, oozie, sqoop are schedule based.
Created ‎02-14-2016 05:04 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you all. In the end we solved with a powershell script that create pig files and hive files and execute them against shells. @Artem Ervits I'm interested in Nifi but I didn't found any info about Nifi on Windows.
Created ‎02-14-2016 05:45 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Andrea Squizzato It's a jvm program and Windows is suppored, here's admin guide. https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html
