Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

How read ftp server files and load into hdfs in incremental load format using python



I don't know Pyton, but if you don't need specifically to use Pyton, you can use NIFI.

The NIFI has many processors to this purpose.

You can get files from FTP, FS, HDFS, and to ingest to HDFS.

I hope to helped.

Expert Contributor
@swathi thukkaraju

I'm not completely sure what you mean by 'incremental load format', but here are some hints:

  • To read FTP server files you can simply use the builtin python module urllib, more specifically urlopen or urlretrieve
  • To write to HDFS you can
    • Use an external library, like HdfsCLI
    • Use the HDFS shell and call it from python with subprocess
    • Mount your HDFS with HDFS NFS Gateway and simply write with the normal write() method. Beware, that using this solution you won't be able to append!

Here's an implementation for you using urlopen and HdfsCli. To try it first install HdfsCli with pip install hdfs.

from urllib.request import urlopen
from hdfs import InsecureClient

# You can also use KerberosClient or custom client
namenode_address = 'your namenode address'
webhdfs_port = 'your webhdfs port' # default for Hadoop 2: 50070, Hadoop 3: 9870
user = 'your user name'
client = InsecureClient('http://' + namenode_address + ':' + webhdfs_port, user=user)

ftp_address = 'your ftp address'
hdfs_path = 'where you want to write'
with urlopen(ftp_address) as response:
    content =
    # You can also use append=True
    # Further reference:
    with client.write(hdfs_path) as writer: