Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Rising Star

Interacting with Hadoop HDFS using Python codes

This post will go through the following:

  1. Introducing python “subprocess” module
  2. Running HDFS commands with Python
  3. Examples of HDFS commands from Python

1-Introducing python “subprocess” module

The Python “subprocess” module allows us to:

  • spawn new Unix processes
  • connect to their input/output/error pipes
  • obtain their return codes

To run UNIX commands we need to create a subprocess that runs the command. The recommended approach to invoking subprocesses is to use the convenience functions for all use cases they can handle. Or we can use the underlying Popen interface can be used directly.

2-Running HDFS commands with Python

We will create a Python function called run_cmd that will effectively allow us to run any unix or linux commands or in our case hdfs dfs commands as linux pipe capturing stdout and stderr and piping the input as list of arguments of the elements of the native unix or HDFS command. It is passed as a Python list rather than a string of characters as you don't have to parse or escape characters.

# import the python subprocess module
import subprocess


def run_cmd(args_list):
        """
        run linux commands
        """
        # import subprocess
        print('Running system command: {0}'.format(' '.join(args_list)))
        proc = subprocess.Popen(args_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        s_output, s_err = proc.communicate()
        s_return =  proc.returncode
        return s_return, s_output, s_err 

3-Examples of HDFS commands from Python

Run Hadoop ls command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-ls', 'hdfs_file_path'])
lines = out.split('\n')


Run Hadoop get command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-get', 'hdfs_file_path', 'local_path'])


Run Hadoop put command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-put', 'local_file', 'hdfs_file_path'])


Run Hadoop copyFromLocal command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyFromLocal', 'local_file', 'hdfs_file_path'])
                            
Run Hadoop copyToLocal command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyToLocal', 'hdfs_file_path', 'local_file'])


hdfs dfs -rm -skipTrash /path/to/file/you/want/to/remove/permanently
Run Hadoop remove file command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', 'hdfs_file_path'])
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-skipTrash', 'hdfs_file_path'])


rm -r
HDFS Command to remove the entire directory and all of its content from HDFS.
Usage: hdfs dfs -rm -r <path>
    
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', 'hdfs_file_path'])
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', '-skipTrash', 'hdfs_file_path'])




Check if a file exist in HDFS
Usage: hadoop fs -test -[defsz] URI


Options:


-d: f the path is a directory, return 0.
-e: if the path exists, return 0.
-f: if the path is a file, return 0.
-s: if the path is not empty, return 0.
-z: if the file is zero length, return 0.
Example:


hadoop fs -test -e filename


hdfs_file_path = '/tmpo'
cmd = ['hdfs', 'dfs', '-test', '-e', hdfs_file_path]
ret, out, err = run_cmd(cmd)
print(ret, out, err)
if ret:
    print('file does not exist')

These simple but very powerful lines of code allow to interact with HDFS in a programmatic way and can be easily scheduled as part of schedule cron jobs.

109,469 Views
Comments
avatar
Contributor

In case you want to leverage structured results from HDFS commands or further reduce latency / overhead, also have a look at "snakebite", which is a pure python implementation of HDFS client functionality:

https://github.com/spotify/snakebite

https://community.hortonworks.com/articles/26416/how-to-install-snakebite-in-hdp.html

avatar
Rising Star

Thanks for the comment Michael. I wrote these commands for hdp environments using standard python 2.7 where we can not do a pip install of snakebite. (i.e. hdp clusters are behind the firewall in secure zone with no pip download allowed)

avatar
Explorer

It was an excellent article on interacting of Hadoop HDFS using Python to hear from you which is very useful. thank you so much for gathering all this information in one post with examples, and it will be extremely helpful for all people.

avatar
Contributor

Hi All.

Here is all steps for doing same!!!

Link :-
https://www.oreilly.com/library/view/hadoop-with-python/9781492048435/ch01.html

 

 

Thanks

HadoopHelp

avatar
New Contributor

Hi,

 

Thank you for this great article and code snippets. its really useful. i just face one problem sometimes while executing the commands that it gives OSError: arguments list too long. can you or someone please help in this regard, how can i resolve this problem my run_command looks like as follows:

 

run_cmd(['/usr/bin/hdfs', 'dfs', '-copyToLocal', 'hdfs://namenode:port/path/to/file', 'localPath'])

 

thanks for any help in advance.

 

Regards,

Wasif

avatar
Community Manager

@wasiftanveer ,

As this is an old article, you would have a better chance of receiving a useful response by starting a new thread. This will also provide you with the opportunity to provide details specific to your issue that could aid others in providing a more tailored answer to your question.

Regards,

Vidya

avatar
New Contributor

@VidyaSargur Thank you for the response and the suggestion, i will create a new thread for my problem.

 

Edit:

i have created my new question here

 

Thanks and regards,

Wasif