- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
read/write hdfs files with standalone python script
- Labels:
-
Apache Oozie
-
Apache Spark
-
Cloudera Hue
-
HDFS
Created on ‎09-13-2017 04:26 AM - edited ‎09-16-2022 05:14 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have some python standalone files, which acces data through the common command:
with open("filename") as f: for lines in f: [...]
I want make the python scripts able to run, without changing too much of the code and without dependencies, if possible. Right now I start the files as spark-programms in the Workflow in HUE.
Are there built-in packages I can use? I tried to import pydoop and hdfs, but they didnt exist.
My goal is to make these scripts run and be able to read/write files on the HDFS.
Thanks for the help.
Created ‎09-18-2017 06:35 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Creaping,
As HDFS is not a standard unix filesystem, it is not possible to read it with native python IO libraries. As HDFS is open-source, there are plenty of connectors out there. You can also acces HDFS via HttpFS on a REST interface.
In case you'd like to parse large amount of data, none of that will be suitable, as the script itself still runs on a single computer. To solve that, you can use pyspark to rewrite your script and use the spark-provided utils to manipulate data. This'll solve both HDFS access and the distribution of the workload for you.
Zsolt
Created ‎09-18-2017 06:35 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Creaping,
As HDFS is not a standard unix filesystem, it is not possible to read it with native python IO libraries. As HDFS is open-source, there are plenty of connectors out there. You can also acces HDFS via HttpFS on a REST interface.
In case you'd like to parse large amount of data, none of that will be suitable, as the script itself still runs on a single computer. To solve that, you can use pyspark to rewrite your script and use the spark-provided utils to manipulate data. This'll solve both HDFS access and the distribution of the workload for you.
Zsolt
Created ‎09-19-2017 06:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Zsolt,
thanks for the reply. The problem was, that I don't have the permissions to install python packages like pydoop.
I was not sure if there is a native way, but I will ask the sysadmin to install some packages.
