Member since
12-27-2016
73
Posts
34
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
17935 | 03-23-2018 09:21 PM | |
1040 | 02-05-2018 07:08 PM | |
5071 | 01-15-2018 07:21 PM | |
895 | 12-01-2017 06:35 PM | |
2864 | 03-09-2017 06:21 PM |
12-11-2018
06:44 PM
Hi, @Arnaud Bohelay From HDP 3.0, we have two database catalogs: `hive` catalog (for all transactional tables) and `spark` catalog (for HDP 2.6 style non-transactional tables). You can find the details here. - https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html
... View more
11-30-2018
06:48 PM
Please check the configuration. In HDP, STS doesn't use Derby by default. If the configuration is correct, STS should connect to Hive Metastore Service (which is running on MySQL/Postgres) instead.
... View more
09-07-2018
03:58 PM
Yes. Those three configuration are the same for Spark 1.6.3 and Spark 2.x. And you already find the solution for you; `spark.history.fs.cleaner.enabled=true`.
... View more
08-24-2018
08:29 PM
@Manikandan Jeyabal. Are you using the official Apache Spark? New ORC vectorized reader is added at Apache Spark 2.3.0. Please see SPARK-16060.
... View more
03-26-2018
01:03 AM
Great! Thank you for sharing your experience too. Your summary and
understanding is correct. For Hive, since Hive 1.2.1 ORC writer and reader is too
old, so it has some bugs of course. In general, it will read a
new data correctly. For the best performance and safety, the latest Hive
is recommended. Hive 2.3.0 starts to use Apache ORC.
For Apache ORC library, Apache
Spark 2.3 was released with Apache ORC 1.4.1 due to some reasons. Please
use with the latest one, Apache ORC 1.4.3, if possible. There is a known issue, SPARK-23340.
... View more
03-24-2018
01:35 AM
Oh, is it? I'll try to reproduce your situation. Could you share more information about your sw stack? Apache Spark 2.3 on Hadoop 2.7 and Kafka? Could you confirm that you are using new OrcFileFormat by setting `spark.sql.orc.impl=native`? The above bugs are fixed on new OrcFileFormat only.
... View more
03-23-2018
09:21 PM
2 Kudos
Although it seems that you are hitting output format issue, ORC is tested properly after SPARK-22781. As one example, `FileNotFoundException` might occur because of empty dataframe. (SPARK-15474) There are more ORC issue before Apache Spark 2.3. Please see SPARK-20901 for the full list.
... View more
03-23-2018
09:16 PM
1 Kudo
Hi, @Sanjay Gurnani Officially, Apache Spark 2.2.1 Structured Streaming document doesn't mention ORC properly. Apache Spark 2.3 document starts to include ORC. - http://spark.apache.org/docs/2.2.1/structured-streaming-programming-guide.html
... View more
03-06-2018
07:22 PM
If you can upgrade your cluster, you can use the above in HDP 2.6.4 with Spark 2.2.1, too.
... View more
03-06-2018
07:19 PM
1 Kudo
Hi, @Paul Hernandez. Do you want the following? Since you are using Spark 2.0 on HDP 2.5, I think you can install Apache Spark 2.3 there, too. scala> spark.read.option("multiLine", "true").json("/tmp/data.json").select($"meta.filename", explode($"records")).select($"filename", $"col.time", explode($"col.grids")).select($"filename", $"time", $"col.gPt").select($"filename", $"time", $"gPt"(0), $"gPt"(1), $"gPt"(2), $"gPt"(3), $"gPt"(4)).show
+--------------------+--------------------+------+------+------+------+-----------+
| filename| time|gPt[0]|gPt[1]|gPt[2]|gPt[3]| gPt[4]|
+--------------------+--------------------+------+------+------+------+-----------+
|COSMODE_single_le...|2018-02-23T12:15:00Z|45.175| 13.55| 45.2|13.575|3.366295E-7|
|COSMODE_single_le...|2018-02-23T12:15:00Z|45.175|13.575| 45.2| 13.6|3.366295E-7|
|COSMODE_single_le...|2018-02-23T12:15:00Z|45.175| 13.6| 45.2|13.625|3.366295E-7|
|COSMODE_single_le...|2018-02-23T12:30:00Z|45.175| 13.55| 45.2|13.575|4.545918E-7|
|COSMODE_single_le...|2018-02-23T12:30:00Z|45.175|13.575| 45.2| 13.6|4.545918E-7|
|COSMODE_single_le...|2018-02-23T12:30:00Z|45.175| 13.6| 45.2|13.625|4.545918E-7|
+--------------------+--------------------+------+------+------+------+-----------+
... View more
03-05-2018
07:02 PM
1 Kudo
In addition to that, Spark needs those options before generating ORC in order to generate vectorizable ORC files. Otherwise, Spark will generate old Hive 1.2.1 ORC files having dummy column names, `col1`.
... View more
03-05-2018
06:55 PM
1 Kudo
Hi, @Jayadeep Jayaraman In Spark 2.2, it happens for ORC files which have dummy column names as a ORC file schema like `col1` instead of your column `service_material_id`. Please check the file schema like the following. hive --orcfiledump thefile.orc The workaround at HDP 2.6.3 is regenerating those files with Hive 2.X. BTW, it's fixed in Apache Spark 2.3. There are several more issues before 2.3. Please see SPARK-20901
... View more
02-27-2018
04:10 PM
Hi, @prasad raju
Unfortunately, ORC doesn't support BZip2, so Hive and Spark doesn't.
- ORC Source Code
- HIVE-5067
... View more
02-14-2018
04:20 PM
1 Kudo
You are trying to create another SparkContext. Please use the existing one. In `spark-shell`, `sc` is the SparkContext which Spark created for you.
... View more
02-13-2018
04:45 AM
Thank you for confirming.
... View more
02-11-2018
09:57 PM
1 Kudo
Hi, @Mai Nakagawa You are using a mismatched jar file as you saw in your first exception message. because LLAP or Hive classes are not found. This document is about HDP 2.6.1 using Spark 2.1.1. Since HDP 2.6.3, `spark-llap` for Spark 2.2 is built-in. Please use it. $ ls -al /usr/hdp/2.6.3.0-235/spark_llap/spark-llap-assembly-1.0.0.2.6.3.0-235.jar -rw-r--r-- 1 root root 61306448 Oct 30 02:39 /usr/hdp/2.6.3.0-235/spark_llap/spark-llap-assembly-1.0.0.2.6.3.0-235.jar
... View more
02-11-2018
09:47 PM
Hi, @Paresh Baldaniya. It's irrelevant to Spark SQL itself. If you use MySQL, the same issue exists there. It's completely up to you. You had better search and find some analytics tools for that.
... View more
02-11-2018
09:41 PM
What about trying to use a standard `SQL` syntax first instead of Scala? Spark supports `CASE WHEN` syntax in `sql()`. CASE WHEN ... THEN ... ELSE ... END
... View more
02-11-2018
09:36 PM
Hi, @kiran Masani . It sounds like a configuration issue inside EMR. Please check if `hive-site.xml` exists in the YARN container, too. Or, you can simply try to use HDP on EC2.
... View more
02-11-2018
09:32 PM
Could you give some example code to reproduce your problem?
... View more
02-11-2018
09:31 PM
Sorry, but could you elaborate on that? If you don't need `stack` function in any situation, you simply don't need to use it.
... View more
02-06-2018
05:22 PM
It's a memory size for Spark executor (worker). And, there is additional overhead in Spark executor. You need to set a proper value by yourself. Of course, in YARN environment, the memory (+ overhead) should be smaller than the limitation of YARN container. So, Spark shows you the error message. It's an application property. For normal Spark jobs, users are responsible because each app can set their `spark.executor.memory` with `spark-submit`. For Spark Thrift Server, admins should manage that properly when they adjust YARN configuration. For more information, please see this. http://spark.apache.org/docs/latest/configuration.html#application-properties
... View more
02-05-2018
07:08 PM
Hi, @Michael Bronson `spark.executor.memory` seems to be 10240. Please change it in your Ambari, `spark-thrift-conf`.
... View more
01-31-2018
05:08 PM
1 Kudo
In SPARK-20901 `Feature Parity for ORC with Parquet`, you can see the
issue links marked as `is blocked by`. Among them, the following issues
are what you want to see for ORC library,
- SPARK-21422 Depend on Apache ORC 1.4.0 - SPARK-22300 Update ORC to 1.4.1
In addition to that, the following will convert Hive ORC table into Spark data sources tables to use Apache ORC 1.4.1. - SPARK-22279 Turn on spark.sql.hive.convertMetastoreOrc by default
... View more
01-31-2018
03:43 AM
For python, I checked the code. - PySpark 2.2 doesn't have that. https://github.com/apache/spark/blob/branch-2.2/python/pyspark/sql/readwriter.py - It's added in PySpark 2.3. https://github.com/apache/spark/blob/master/python/pyspark/sql/readwriter.py#L648-L649 I think you found a documentation error here.
... View more
01-31-2018
03:29 AM
Oh. My bad. I didn't try your command. You're right. For me, it works like the following in HDP 2.6.3 Scala Spark. scala> spark.version
res5: String = 2.2.0.2.6.3.0-235
scala> Seq((1,2),(3,4)).toDF("a", "b").write.option("compression","zlib").mode("overwrite").format("orc").bucketBy(10, "a").sortBy("b").saveAsTable("xx")
scala> sql("select * from xx").show
+---+---+
| a| b|
+---+---+
| 3| 4|
| 1| 2|
+---+---+
<br>
... View more
01-30-2018
06:57 PM
BTW, Apache Spark doesn't guarantee reading on a sorted table. Spark reads the largest file first in the directory.
... View more
01-30-2018
06:55 PM
Hi, @Jayadeep Jayaraman . As you see in the error message, `write` returns `DataFrameWriter`. `sortBy` is supported in `Dataset/Dataframe`. Please try the following. scala> spark.versionres7: String = 2.2.0.2.6.3.0-235 scala> df.sort("id").write.option("compression", "zlib").mode("overwrite").format("orc").saveAsTable("o2")
scala> df.sort($"id".desc).write.option("compression", "zlib").mode("overwrite").format("orc").saveAsTable("o1")
... View more
01-19-2018
06:32 PM
Hi, @Tu Nguyen Are you using HDP 2.6.3+? If then, you can try SPARK-LLAP connector. It's for the secure environment (Kerberos and Ranger), but it can read all Hive table through LLAP. https://community.hortonworks.com/articles/101181/rowcolumn-level-security-in-sql-for-apache-spark-2.html For HDP 2.6.3+ Spark 2.2, I didn't write an updated article for that, but almost the same except the SPARK-LLAP jar file is already built-in HDP 2.6.3+. You don't need to download it.
... View more