Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

spark com.databricks.spark.csv doesnt work

Solved Go to solution
Highlighted

spark com.databricks.spark.csv doesnt work

Super Collaborator

Hi:

Iam trying to use the com.databricks.spark.csv class, but doesnt work, iam behind proxy, so how can i donwload:

pyspark --master yarn --deploy-mode client --num-executors 5 --executor-cores 1 --executor-memory 1G --jars ./spark-csv_2.11-1.4.0.jar --jars ./commons-csv-1.4.jar

also doesnt work like that:

pyspark --master yarn --deploy-mode client --num-executors 5 --executor-cores 1 --executor-memory 1G --packages com.databricks:spark-csv_2.11-1.4.0

so, any suggestions??

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/hdp/2.4.0.0-169/spark/python/pyspark/sql/readwriter.py", line 137, in load
    return self._df(self._jreader.load(path))
  File "/usr/hdp/2.4.0.0-169/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/usr/hdp/2.4.0.0-169/spark/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/usr/hdp/2.4.0.0-169/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o45.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
        at scala.util.Try$.apply(Try.scala:161)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
        at scala.util.Try.orElse(Try.scala:82)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
        ... 14 more


>>> :: resolution report :: resolve 252496ms :: artifacts dl 0ms
        :: modules in use:
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   1   |   0   |   0   |   0   ||   0   |   0   |
        ---------------------------------------------------------------------


:: problems summary ::
:::: WARNINGS
                module not found: com.databricks#spark-csv_2.10;1.3.0


        ==== local-m2-cache: tried


          file:/root/.m2/repository/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.pom


          -- artifact com.databricks#spark-csv_2.10;1.3.0!spark-csv_2.10.jar:


          file:/root/.m2/repository/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar


        ==== local-ivy-cache: tried


          /root/.ivy2/local/com.databricks/spark-csv_2.10/1.3.0/ivys/ivy.xml


        ==== central: tried


          https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.pom


          -- artifact com.databricks#spark-csv_2.10;1.3.0!spark-csv_2.10.jar:


          https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar


        ==== spark-packages: tried


          http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0....


          -- artifact com.databricks#spark-csv_2.10;1.3.0!spark-csv_2.10.jar:


          http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0....


                ::::::::::::::::::::::::::::::::::::::::::::::


                ::          UNRESOLVED DEPENDENCIES         ::


                ::::::::::::::::::::::::::::::::::::::::::::::


                :: com.databricks#spark-csv_2.10;1.3.0: not found


                ::::::::::::::::::::::::::::::::::::::::::::::




:::: ERRORS
        Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.pom (java.net.ConnectException: Connection timed out)


        Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar (java.net.ConnectException: Connection timed out)


        Server access error at url http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.... (java.net.ConnectException: Connection timed out)


        Server access error at url http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.... (java.net.ConnectException: Connection timed out)




:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.databricks#spark-csv_2.10;1.3.0: not found]
        at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1068)
        at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:287)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:154)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


1 ACCEPTED SOLUTION

Accepted Solutions

Re: spark com.databricks.spark.csv doesnt work

Super Collaborator

Hi:

i resolved like that:

pyspark --master yarn --deploy-mode client --num-executors 5 --executor-cores 1 --executor-memory 1G --jars ./spark-csv_2.11-1.4.0.jar --jars ./commons-csv-1.4.jar --jars ./univocity-parsers-2.2.1.jar
5 REPLIES 5

Re: spark com.databricks.spark.csv doesnt work

@Roberto Sancho : From the trace it looks like connection is timing out. Can you check ?

Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar (java.net.ConnectException: Connection timed out

Re: spark com.databricks.spark.csv doesnt work

Super Collaborator

hi; i have behind proxy, so there is any configuration special from spark files??

Re: spark com.databricks.spark.csv doesnt work

Super Collaborator

Hi:

i resolved like that:

pyspark --master yarn --deploy-mode client --num-executors 5 --executor-cores 1 --executor-memory 1G --jars ./spark-csv_2.11-1.4.0.jar --jars ./commons-csv-1.4.jar --jars ./univocity-parsers-2.2.1.jar

Re: spark com.databricks.spark.csv doesnt work

I was able to successfully test the CSV reader.

Please refer to an article I wrote on this:

https://community.hortonworks.com/articles/52866/hive-on-tez-vs-pyspark-for-weblogs-parsing.html

Re: spark com.databricks.spark.csv doesnt work

New Contributor

@Roberto Sancho

I tried to use the solution suggested but I am getting following error.

I ran pyspark like this.

pyspark --jars ./spark-csv_2.11-1.4.0.jar --jars ./commons-csv-1.4.jar --jars ./univocity-parsers-2.2.1.jar

Error:

=====

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/readwriter.py", line 137, in load return self._df(self._jreader.load(path)) File "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o44.load. : java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at scala.util.Try.orElse(Try.scala:82) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62) ... 14 more >>>