Support Questions

GrazittiAPI · ‎04-20-2022

I typically upload csv files into Cloudera Data Science Workbench, but I wonder if there is a way to programmatically read in a csv file from a shared server drive while in yarn mode? Using the below code, I get an error. Any Tips?

df = spark.read.format('csv').load('Q:\\project\\data_folder\\file.csv', header=True)

The error I get is:

IllegalArgumentException: 'java.net.URISyntaxException: Relative path in absolute URI: Q:%5Cproject%5Cdata_folder/file.csv'

ask_bill_brooks · ‎04-20-2022

Hi @Data1701

According to the API documentation, one can get a java.net.URISyntaxException when a passed string could not be parsed as a URI reference.

The file you are attempting to read in might very well be available on your local area network from a shared server drive, but it isn't available via a valid URI, or at the very least, the URI you are referencing in your Spark code isn't a valid and accessible URI.

What your problem boils down to is that the file isn't available via a web server, and the server that is running your Spark code can't retrieve it at the time your code executes. And that should shed light on why you had to previously upload your .csv files into CDSW, because that was the way to ensure that they could be found at runtime, since they were in a well-known/accessible location.

There are several valid approaches to addressing this, but the easiest solution, if you want to continue to use the code snippet you've written and shared here, is to place the file on some server that is accessible over the web (preferably via HTTPS) and refer to it using a fully-qualified URL. In order to do that, a functioning and secured web server will have to be available to you (you could set this up on your local workstation).

Let's assume you place the file on a web-accessible server somewhere local to your corporate network and the web-accessible directory path you place the file in turns out to be something like Data1701/project/data_folder/. Then you can change the assignment statement in your Spark code to this:


df = spark.read.format('csv').load('https://web.dept.yourcompany.com/Data1701/project/data_folder/file.csv', header=True)

…and the rest of your code should work, unchanged.

Bill Brooks, Community Moderator
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

View solution in original post

ask_bill_brooks · ‎04-20-2022