Created on 03-31-2017 07:42 PM
This post will go through the following:
The Python “subprocess” module allows us to:
To run UNIX commands we need to create a subprocess that runs the command. The recommended approach to invoking subprocesses is to use the convenience functions for all use cases they can handle. Or we can use the underlying Popen interface can be used directly.
We will create a Python function called run_cmd that will effectively allow us to run any unix or linux commands or in our case hdfs dfs commands as linux pipe capturing stdout and stderr and piping the input as list of arguments of the elements of the native unix or HDFS command. It is passed as a Python list rather than a string of characters as you don't have to parse or escape characters.
# import the python subprocess module import subprocess def run_cmd(args_list): """ run linux commands """ # import subprocess print('Running system command: {0}'.format(' '.join(args_list))) proc = subprocess.Popen(args_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE) s_output, s_err = proc.communicate() s_return = proc.returncode return s_return, s_output, s_err
Run Hadoop ls command in Python (ret, out, err)= run_cmd(['hdfs', 'dfs', '-ls', 'hdfs_file_path']) lines = out.split('\n') Run Hadoop get command in Python (ret, out, err)= run_cmd(['hdfs', 'dfs', '-get', 'hdfs_file_path', 'local_path']) Run Hadoop put command in Python (ret, out, err)= run_cmd(['hdfs', 'dfs', '-put', 'local_file', 'hdfs_file_path']) Run Hadoop copyFromLocal command in Python (ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyFromLocal', 'local_file', 'hdfs_file_path']) Run Hadoop copyToLocal command in Python (ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyToLocal', 'hdfs_file_path', 'local_file']) hdfs dfs -rm -skipTrash /path/to/file/you/want/to/remove/permanently Run Hadoop remove file command in Python (ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', 'hdfs_file_path']) (ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-skipTrash', 'hdfs_file_path']) rm -r HDFS Command to remove the entire directory and all of its content from HDFS. Usage: hdfs dfs -rm -r <path> (ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', 'hdfs_file_path']) (ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', '-skipTrash', 'hdfs_file_path']) Check if a file exist in HDFS Usage: hadoop fs -test -[defsz] URI Options: -d: f the path is a directory, return 0. -e: if the path exists, return 0. -f: if the path is a file, return 0. -s: if the path is not empty, return 0. -z: if the file is zero length, return 0. Example: hadoop fs -test -e filename hdfs_file_path = '/tmpo' cmd = ['hdfs', 'dfs', '-test', '-e', hdfs_file_path] ret, out, err = run_cmd(cmd) print(ret, out, err) if ret: print('file does not exist')
These simple but very powerful lines of code allow to interact with HDFS in a programmatic way and can be easily scheduled as part of schedule cron jobs.
Created on 04-03-2017 01:41 PM
In case you want to leverage structured results from HDFS commands or further reduce latency / overhead, also have a look at "snakebite", which is a pure python implementation of HDFS client functionality:
https://github.com/spotify/snakebite
https://community.hortonworks.com/articles/26416/how-to-install-snakebite-in-hdp.html
Created on 04-03-2017 07:09 PM
Thanks for the comment Michael. I wrote these commands for hdp environments using standard python 2.7 where we can not do a pip install of snakebite. (i.e. hdp clusters are behind the firewall in secure zone with no pip download allowed)
Created on 01-30-2018 10:21 AM
It was an excellent article on interacting of Hadoop HDFS using Python to hear from you which is very useful. thank you so much for gathering all this information in one post with examples, and it will be extremely helpful for all people.
Created on 12-17-2019 07:54 AM
Hi All.
Here is all steps for doing same!!!
Link :-
https://www.oreilly.com/library/view/hadoop-with-python/9781492048435/ch01.html
Thanks
HadoopHelp
Created on 05-26-2020 08:14 PM
Hi,
Thank you for this great article and code snippets. its really useful. i just face one problem sometimes while executing the commands that it gives OSError: arguments list too long. can you or someone please help in this regard, how can i resolve this problem my run_command looks like as follows:
run_cmd(['/usr/bin/hdfs', 'dfs', '-copyToLocal', 'hdfs://namenode:port/path/to/file', 'localPath'])
thanks for any help in advance.
Regards,
Wasif
Created on 05-26-2020 10:57 PM
As this is an old article, you would have a better chance of receiving a useful response by starting a new thread. This will also provide you with the opportunity to provide details specific to your issue that could aid others in providing a more tailored answer to your question.
Regards,
Vidya
Created on 05-26-2020 11:52 PM - last edited on 05-27-2020 12:31 AM by VidyaSargur
@VidyaSargur Thank you for the response and the suggestion, i will create a new thread for my problem.
Edit:
i have created my new question here
Thanks and regards,
Wasif