About dhyun

dhyun · ‎08-24-2018

@Manikandan Jeyabal. Are you using the official Apache Spark? New ORC vectorized reader is added at Apache Spark 2.3.0. Please see SPARK-16060.

dhyun · ‎03-26-2018

Great! Thank you for sharing your experience too. Your summary and understanding is correct. For Hive, since Hive 1.2.1 ORC writer and reader is too old, so it has some bugs of course. In general, it will read a new data correctly. For the best performance and safety, the latest Hive is recommended. Hive 2.3.0 starts to use Apache ORC. For Apache ORC library, Apache Spark 2.3 was released with Apache ORC 1.4.1 due to some reasons. Please use with the latest one, Apache ORC 1.4.3, if possible. There is a known issue, SPARK-23340.

dhyun · ‎03-24-2018

Oh, is it? I'll try to reproduce your situation. Could you share more information about your sw stack? Apache Spark 2.3 on Hadoop 2.7 and Kafka? Could you confirm that you are using new OrcFileFormat by setting `spark.sql.orc.impl=native`? The above bugs are fixed on new OrcFileFormat only.

dhyun · ‎03-23-2018

Although it seems that you are hitting output format issue, ORC is tested properly after SPARK-22781. As one example, `FileNotFoundException` might occur because of empty dataframe. (SPARK-15474) There are more ORC issue before Apache Spark 2.3. Please see SPARK-20901 for the full list.

dhyun · ‎03-23-2018

Hi, @Sanjay Gurnani Officially, Apache Spark 2.2.1 Structured Streaming document doesn't mention ORC properly. Apache Spark 2.3 document starts to include ORC. - http://spark.apache.org/docs/2.2.1/structured-streaming-programming-guide.html

dhyun · ‎02-27-2018

Hi, @prasad raju Unfortunately, ORC doesn't support BZip2, so Hive and Spark doesn't. - ORC Source Code - HIVE-5067

dhyun · ‎02-13-2018

Thank you for confirming.

dhyun · ‎02-11-2018

Hi, @Mai Nakagawa You are using a mismatched jar file as you saw in your first exception message. because LLAP or Hive classes are not found. This document is about HDP 2.6.1 using Spark 2.1.1. Since HDP 2.6.3, `spark-llap` for Spark 2.2 is built-in. Please use it. $ ls -al /usr/hdp/2.6.3.0-235/spark_llap/spark-llap-assembly-1.0.0.2.6.3.0-235.jar -rw-r--r-- 1 root root 61306448 Oct 30 02:39 /usr/hdp/2.6.3.0-235/spark_llap/spark-llap-assembly-1.0.0.2.6.3.0-235.jar

dhyun · ‎02-06-2018

It's a memory size for Spark executor (worker). And, there is additional overhead in Spark executor. You need to set a proper value by yourself. Of course, in YARN environment, the memory (+ overhead) should be smaller than the limitation of YARN container. So, Spark shows you the error message. It's an application property. For normal Spark jobs, users are responsible because each app can set their `spark.executor.memory` with `spark-submit`. For Spark Thrift Server, admins should manage that properly when they adjust YARN configuration. For more information, please see this. http://spark.apache.org/docs/latest/configuration.html#application-properties

dhyun · ‎02-05-2018

Hi, @Michael Bronson `spark.executor.memory` seems to be 10240. Please change it in your Ambari, `spark-thrift-conf`.

Online	Offline
Last Visited	‎12-30-2018 02:30 AM

Member Since	‎12-27-2016 09:30 PM
Last Visited	‎12-30-2018 02:30 AM
Posts	73
Kudos received	34

Cloudera Community

Re: Is there a issue with saving ORC data with Spa...

Re: spark thrift server not started

Re: Spark ORC Stripe Size

Re: Accessing spark dataframe in spark-shell throu...

Re: Using Kryo Serializer with Spark

Re: ORC Improvements for Apache Spark 2.2

Re: Is there a issue with saving ORC data with Spa...

Re: Is there a issue with saving ORC data with Spa...

Re: Is there a issue with saving ORC data with Spa...

Re: Is there a issue with saving ORC data with Spa...

Re: how to compress bzip2 format and insert into h...

Re: Row/Column-level Security in SQL for Apache Sp...

Re: Row/Column-level Security in SQL for Apache Sp...

Re: spark thrift server not started

Re: spark thrift server not started