Created 08-26-2016 08:49 AM
Hi:
Iam trying to use the com.databricks.spark.csv class, but doesnt work, iam behind proxy, so how can i donwload:
pyspark --master yarn --deploy-mode client --num-executors 5 --executor-cores 1 --executor-memory 1G --jars ./spark-csv_2.11-1.4.0.jar --jars ./commons-csv-1.4.jar
also doesnt work like that:
pyspark --master yarn --deploy-mode client --num-executors 5 --executor-cores 1 --executor-memory 1G --packages com.databricks:spark-csv_2.11-1.4.0
so, any suggestions??
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/hdp/2.4.0.0-169/spark/python/pyspark/sql/readwriter.py", line 137, in load return self._df(self._jreader.load(path)) File "/usr/hdp/2.4.0.0-169/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/hdp/2.4.0.0-169/spark/python/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/usr/hdp/2.4.0.0-169/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o45.load. : java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at scala.util.Try.orElse(Try.scala:82) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62) ... 14 more >>> :: resolution report :: resolve 252496ms :: artifacts dl 0ms :: modules in use: --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 1 | 0 | 0 | 0 || 0 | 0 | --------------------------------------------------------------------- :: problems summary :: :::: WARNINGS module not found: com.databricks#spark-csv_2.10;1.3.0 ==== local-m2-cache: tried file:/root/.m2/repository/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.pom -- artifact com.databricks#spark-csv_2.10;1.3.0!spark-csv_2.10.jar: file:/root/.m2/repository/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar ==== local-ivy-cache: tried /root/.ivy2/local/com.databricks/spark-csv_2.10/1.3.0/ivys/ivy.xml ==== central: tried https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.pom -- artifact com.databricks#spark-csv_2.10;1.3.0!spark-csv_2.10.jar: https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar ==== spark-packages: tried http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.... -- artifact com.databricks#spark-csv_2.10;1.3.0!spark-csv_2.10.jar: http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.... :::::::::::::::::::::::::::::::::::::::::::::: :: UNRESOLVED DEPENDENCIES :: :::::::::::::::::::::::::::::::::::::::::::::: :: com.databricks#spark-csv_2.10;1.3.0: not found :::::::::::::::::::::::::::::::::::::::::::::: :::: ERRORS Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.pom (java.net.ConnectException: Connection timed out) Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar (java.net.ConnectException: Connection timed out) Server access error at url http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.... (java.net.ConnectException: Connection timed out) Server access error at url http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.... (java.net.ConnectException: Connection timed out) :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.databricks#spark-csv_2.10;1.3.0: not found] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1068) at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:287) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:154) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Created 08-26-2016 11:27 AM
Hi:
i resolved like that:
pyspark --master yarn --deploy-mode client --num-executors 5 --executor-cores 1 --executor-memory 1G --jars ./spark-csv_2.11-1.4.0.jar --jars ./commons-csv-1.4.jar --jars ./univocity-parsers-2.2.1.jar
Created 08-26-2016 09:12 AM
@Roberto Sancho : From the trace it looks like connection is timing out. Can you check ?
Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar (java.net.ConnectException: Connection timed out
Created 08-26-2016 09:15 AM
hi; i have behind proxy, so there is any configuration special from spark files??
Created 08-26-2016 11:27 AM
Hi:
i resolved like that:
pyspark --master yarn --deploy-mode client --num-executors 5 --executor-cores 1 --executor-memory 1G --jars ./spark-csv_2.11-1.4.0.jar --jars ./commons-csv-1.4.jar --jars ./univocity-parsers-2.2.1.jar
Created 08-26-2016 10:58 PM
I was able to successfully test the CSV reader.
Please refer to an article I wrote on this:
https://community.hortonworks.com/articles/52866/hive-on-tez-vs-pyspark-for-weblogs-parsing.html
Created 09-12-2016 06:54 PM
I tried to use the solution suggested but I am getting following error.
I ran pyspark like this.
pyspark --jars ./spark-csv_2.11-1.4.0.jar --jars ./commons-csv-1.4.jar --jars ./univocity-parsers-2.2.1.jar
Error:
=====
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/readwriter.py", line 137, in load return self._df(self._jreader.load(path)) File "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o44.load. : java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at scala.util.Try.orElse(Try.scala:82) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62) ... 14 more >>>