Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

spark com.databricks.spark.csv doesnt work

avatar
Master Collaborator

Hi:

Iam trying to use the com.databricks.spark.csv class, but doesnt work, iam behind proxy, so how can i donwload:

pyspark --master yarn --deploy-mode client --num-executors 5 --executor-cores 1 --executor-memory 1G --jars ./spark-csv_2.11-1.4.0.jar --jars ./commons-csv-1.4.jar

also doesnt work like that:

pyspark --master yarn --deploy-mode client --num-executors 5 --executor-cores 1 --executor-memory 1G --packages com.databricks:spark-csv_2.11-1.4.0

so, any suggestions??

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/hdp/2.4.0.0-169/spark/python/pyspark/sql/readwriter.py", line 137, in load
    return self._df(self._jreader.load(path))
  File "/usr/hdp/2.4.0.0-169/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/usr/hdp/2.4.0.0-169/spark/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/usr/hdp/2.4.0.0-169/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o45.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
        at scala.util.Try$.apply(Try.scala:161)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
        at scala.util.Try.orElse(Try.scala:82)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
        ... 14 more


>>> :: resolution report :: resolve 252496ms :: artifacts dl 0ms
        :: modules in use:
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   1   |   0   |   0   |   0   ||   0   |   0   |
        ---------------------------------------------------------------------


:: problems summary ::
:::: WARNINGS
                module not found: com.databricks#spark-csv_2.10;1.3.0


        ==== local-m2-cache: tried


          file:/root/.m2/repository/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.pom


          -- artifact com.databricks#spark-csv_2.10;1.3.0!spark-csv_2.10.jar:


          file:/root/.m2/repository/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar


        ==== local-ivy-cache: tried


          /root/.ivy2/local/com.databricks/spark-csv_2.10/1.3.0/ivys/ivy.xml


        ==== central: tried


          https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.pom


          -- artifact com.databricks#spark-csv_2.10;1.3.0!spark-csv_2.10.jar:


          https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar


        ==== spark-packages: tried


          http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0....


          -- artifact com.databricks#spark-csv_2.10;1.3.0!spark-csv_2.10.jar:


          http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0....


                ::::::::::::::::::::::::::::::::::::::::::::::


                ::          UNRESOLVED DEPENDENCIES         ::


                ::::::::::::::::::::::::::::::::::::::::::::::


                :: com.databricks#spark-csv_2.10;1.3.0: not found


                ::::::::::::::::::::::::::::::::::::::::::::::




:::: ERRORS
        Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.pom (java.net.ConnectException: Connection timed out)


        Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar (java.net.ConnectException: Connection timed out)


        Server access error at url http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.... (java.net.ConnectException: Connection timed out)


        Server access error at url http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.... (java.net.ConnectException: Connection timed out)




:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.databricks#spark-csv_2.10;1.3.0: not found]
        at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1068)
        at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:287)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:154)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


1 ACCEPTED SOLUTION

avatar
Master Collaborator

Hi:

i resolved like that:

pyspark --master yarn --deploy-mode client --num-executors 5 --executor-cores 1 --executor-memory 1G --jars ./spark-csv_2.11-1.4.0.jar --jars ./commons-csv-1.4.jar --jars ./univocity-parsers-2.2.1.jar

View solution in original post

5 REPLIES 5

avatar

@Roberto Sancho : From the trace it looks like connection is timing out. Can you check ?

Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar (java.net.ConnectException: Connection timed out

avatar
Master Collaborator

hi; i have behind proxy, so there is any configuration special from spark files??

avatar
Master Collaborator

Hi:

i resolved like that:

pyspark --master yarn --deploy-mode client --num-executors 5 --executor-cores 1 --executor-memory 1G --jars ./spark-csv_2.11-1.4.0.jar --jars ./commons-csv-1.4.jar --jars ./univocity-parsers-2.2.1.jar

avatar

I was able to successfully test the CSV reader.

Please refer to an article I wrote on this:

https://community.hortonworks.com/articles/52866/hive-on-tez-vs-pyspark-for-weblogs-parsing.html

avatar
New Contributor

@Roberto Sancho

I tried to use the solution suggested but I am getting following error.

I ran pyspark like this.

pyspark --jars ./spark-csv_2.11-1.4.0.jar --jars ./commons-csv-1.4.jar --jars ./univocity-parsers-2.2.1.jar

Error:

=====

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/readwriter.py", line 137, in load return self._df(self._jreader.load(path)) File "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o44.load. : java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at scala.util.Try.orElse(Try.scala:82) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62) ... 14 more >>>