spark-submit --packages com:databricks:spark-csv_2.10:1.2.0 task.py.
I got an error that the class could not be found. My question is how can I write a CSV file using the dataframes like below
Any suggestions on how can I solve the problem of writing a CSV or a TAB file in the certification Exam, I am pretty sure, I failed it, since I could not write.
Replace the colon with dot between com and databricks.
It works well with spark 1.6.3:
/opt/spark/bin/spark-submit --packages com.databricks:spark-csv_2.10:1.2.0 test.py
Where test.py is the following:
from pyspark import SparkContext from pyspark.sql import SQLContext sc = SparkContext(appName="Python") sqlContext = SQLContext(sc) q = sqlContext.read.format("com.databricks.spark.csv").load("file:///tmp/ls.txt") q.write.format("com.databricks.spark.csv").save("file:///tmp/ls2.txt")
16/12/21 18:51:16 INFO HadoopRDD: Input split: file:/tmp/ls.txt:0+640 16/12/21 18:51:16 INFO HadoopRDD: Input split: file:/tmp/ls.txt:640+641 16/12/21 18:51:16 INFO FileOutputCommitter: Saved output of task 'attempt_201612211851_0001_m_000000_1' to file:/tmp/ls2.txt/_temporary/0/task_201612211851_0001_m_000000 16/12/21 18:51:16 INFO SparkHadoopMapRedUtil: attempt_201612211851_0001_m_000000_1: Committed 16/12/21 18:51:16 INFO FileOutputCommitter: Saved output of task 'attempt_201612211851_0001_m_000001_2' to file:/tmp/ls2.txt/_temporary/0/task_201612211851_0001_m_000001 16/12/21 18:51:16 INFO SparkHadoopMapRedUtil: attempt_201612211851_0001_m_000001_2: Committed
It also depends from your configuration (are you running it local or on yarn?). Please post the exact exception and your spark-default.conf and spark-env.sh
If you have connectivity issue you can also try to download the required jar files manually and use the --jars option of spark-submit:
/opt/spark/bin/spark-submit --jars /tmp/spark-csv_2.11-1.2.0.jar,/tmp/commons.csv-1.1.jar test.py
Where the two jars file are downloaded from the maven central repository:
Unfortunately, I can't post the conf file context and the .sh file content as the error occurred during the certification exam.
Thank you. Much appreciated. But the issue here is, how can I handle the problem in the certification exam, since the usual way is not working. I tried to communicate with the certification officials on the issue, but in vain. I paid exam fee to take the next week again, but if the writing of the file does not work again, I don't know how to handle the same then.
I am able to get the same work in 3 environments I am working on....
Sorry, had a typo here, but I did try with a "." than a ":" in the exam but ended up getting class not found error. Pretty confused since then, any help is greatly appreciated.
What is your Spark version?
It was 1.6.3, as I remember, when I took the exam...
I have HortonWorks Sandbox, with Spark version 1.6.0, and the same works flawlessly. I am clueless, why could not the same work in the certification exam...Thanks for taking time to answer.
I think the error I got was almost as the one below.
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.databricks#spark-csv_2.10;1.2.0: not found]