Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Contributor

Interacting with Hadoop HDFS using Python codes

This post will go through the following:

  1. Introducing python “subprocess” module
  2. Running HDFS commands with Python
  3. Examples of HDFS commands from Python

1-Introducing python “subprocess” module

The Python “subprocess” module allows us to:

  • spawn new Unix processes
  • connect to their input/output/error pipes
  • obtain their return codes

To run UNIX commands we need to create a subprocess that runs the command. The recommended approach to invoking subprocesses is to use the convenience functions for all use cases they can handle. Or we can use the underlying Popen interface can be used directly.

2-Running HDFS commands with Python

We will create a Python function called run_cmd that will effectively allow us to run any unix or linux commands or in our case hdfs dfs commands as linux pipe capturing stdout and stderr and piping the input as list of arguments of the elements of the native unix or HDFS command. It is passed as a Python list rather than a string of characters as you don't have to parse or escape characters.

# import the python subprocess module
import subprocess


def run_cmd(args_list):
        """
        run linux commands
        """
        # import subprocess
        print('Running system command: {0}'.format(' '.join(args_list)))
        proc = subprocess.Popen(args_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        s_output, s_err = proc.communicate()
        s_return =  proc.returncode
        return s_return, s_output, s_err 

3-Examples of HDFS commands from Python

Run Hadoop ls command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-ls', 'hdfs_file_path'])
lines = out.split('\n')


Run Hadoop get command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-get', 'hdfs_file_path', 'local_path'])


Run Hadoop put command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-put', 'local_file', 'hdfs_file_path'])


Run Hadoop copyFromLocal command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyFromLocal', 'local_file', 'hdfs_file_path'])
                            
Run Hadoop copyToLocal command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyToLocal', 'hdfs_file_path', 'local_file'])


hdfs dfs -rm -skipTrash /path/to/file/you/want/to/remove/permanently
Run Hadoop remove file command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', 'hdfs_file_path'])
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-skipTrash', 'hdfs_file_path'])


rm -r
HDFS Command to remove the entire directory and all of its content from HDFS.
Usage: hdfs dfs -rm -r <path>
    
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', 'hdfs_file_path'])
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', '-skipTrash', 'hdfs_file_path'])




Check if a file exist in HDFS
Usage: hadoop fs -test -[defsz] URI


Options:


-d: f the path is a directory, return 0.
-e: if the path exists, return 0.
-f: if the path is a file, return 0.
-s: if the path is not empty, return 0.
-z: if the file is zero length, return 0.
Example:


hadoop fs -test -e filename


hdfs_file_path = '/tmpo'
cmd = ['hdfs', 'dfs', '-test', '-e', hdfs_file_path]
ret, out, err = run_cmd(cmd)
print(ret, out, err)
if ret:
    print('file does not exist')

These simple but very powerful lines of code allow to interact with HDFS in a programmatic way and can be easily scheduled as part of schedule cron jobs.

34,445 Views
0 Kudos
Comments
New Contributor

In case you want to leverage structured results from HDFS commands or further reduce latency / overhead, also have a look at "snakebite", which is a pure python implementation of HDFS client functionality:

https://github.com/spotify/snakebite

https://community.hortonworks.com/articles/26416/how-to-install-snakebite-in-hdp.html

Contributor

Thanks for the comment Michael. I wrote these commands for hdp environments using standard python 2.7 where we can not do a pip install of snakebite. (i.e. hdp clusters are behind the firewall in secure zone with no pip download allowed)

New Contributor

It was an excellent article on interacting of Hadoop HDFS using Python to hear from you which is very useful. thank you so much for gathering all this information in one post with examples, and it will be extremely helpful for all people.

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎03-31-2017 07:42 PM
Updated by:
 
Contributors
Top Kudoed Authors