Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How read ftp server files and load into hdfs in incremental load format using python

How read ftp server files and load into hdfs in incremental load format using python

 
2 REPLIES 2
Highlighted

Re: How read ftp server files and load into hdfs in incremental load format using python

Explorer

I don't know Pyton, but if you don't need specifically to use Pyton, you can use NIFI.

The NIFI has many processors to this purpose.

You can get files from FTP, FS, HDFS, and to ingest to HDFS.

I hope to helped.

Highlighted

Re: How read ftp server files and load into hdfs in incremental load format using python

Expert Contributor
@swathi thukkaraju

I'm not completely sure what you mean by 'incremental load format', but here are some hints:

  • To read FTP server files you can simply use the builtin python module urllib, more specifically urlopen or urlretrieve
  • To write to HDFS you can
    • Use an external library, like HdfsCLI
    • Use the HDFS shell and call it from python with subprocess
    • Mount your HDFS with HDFS NFS Gateway and simply write with the normal write() method. Beware, that using this solution you won't be able to append!

Here's an implementation for you using urlopen and HdfsCli. To try it first install HdfsCli with pip install hdfs.

from urllib.request import urlopen
from hdfs import InsecureClient

# You can also use KerberosClient or custom client
namenode_address = 'your namenode address'
webhdfs_port = 'your webhdfs port' # default for Hadoop 2: 50070, Hadoop 3: 9870
user = 'your user name'
client = InsecureClient('http://' + namenode_address + ':' + webhdfs_port, user=user)

ftp_address = 'your ftp address'
hdfs_path = 'where you want to write'
with urlopen(ftp_address) as response:
    content = response.read()
    # You can also use append=True
    # Further reference: https://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client.write
    with client.write(hdfs_path) as writer:
        writer.write(content
Don't have an account?
Coming from Hortonworks? Activate your account here