Options
- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How read ftp server files and load into hdfs in incremental load format using python
Labels:
- Labels:
-
Apache Hadoop
Created ‎08-18-2017 08:41 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 REPLIES 2
Contributor
Created ‎08-19-2017 08:38 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't know Pyton, but if you don't need specifically to use Pyton, you can use NIFI.
The NIFI has many processors to this purpose.
You can get files from FTP, FS, HDFS, and to ingest to HDFS.
I hope to helped.
Expert Contributor
Created ‎08-21-2017 07:57 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@swathi thukkaraju
I'm not completely sure what you mean by 'incremental load format', but here are some hints:
- To read FTP server files you can simply use the builtin python module urllib, more specifically urlopen or urlretrieve
- To write to HDFS you can
- Use an external library, like HdfsCLI
- Use the HDFS shell and call it from python with subprocess
- Mount your HDFS with HDFS NFS Gateway and simply write with the normal write() method. Beware, that using this solution you won't be able to append!
Here's an implementation for you using urlopen and HdfsCli. To try it first install HdfsCli with pip install hdfs.
from urllib.request import urlopen from hdfs import InsecureClient # You can also use KerberosClient or custom client namenode_address = 'your namenode address' webhdfs_port = 'your webhdfs port' # default for Hadoop 2: 50070, Hadoop 3: 9870 user = 'your user name' client = InsecureClient('http://' + namenode_address + ':' + webhdfs_port, user=user) ftp_address = 'your ftp address' hdfs_path = 'where you want to write' with urlopen(ftp_address) as response: content = response.read() # You can also use append=True # Further reference: https://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client.write with client.write(hdfs_path) as writer: writer.write(content
