About bkosaraju

bkosaraju · ‎07-31-2017

Hi @mravipati, can you please check Dynamic Resource Allocation is enabled spark.dynamicAllocation.enabled =true this will use as many as it can depends up on the system rescue availability, this may be causing the problem On the other note, this behaviour can be controlled by setting the spark.dynamicAllocation.maxExecutors = <no max limit> please note that, driver also allocated some of the containers. you need to manage the memory allocations for Executors and drivers. for instance if you have Yarn minimum container size mentioned as 2GB and your executors are requested about 2GB per executor, this will allocated 4GB per executor as you have spark.yarn.executor.memoryOverhead also to be accounted. the following KB explain more about the why it is taking more resources by spark.

bkosaraju · ‎07-25-2017

ALTER TABLE istari [PARTITION partition_spec] CONCATENATE; reducing the tasks may impact the overall the performance(however alter also run mr and consume resources.) post to your insert you can run a alter table statement. more on the same can be found at ORC documentation https://orc.apache.org/docs/hive-ddl.html

bkosaraju · ‎07-04-2017

hi @tariq abughofa, could you please SELinux Disabled or not on the driver, which looks preventing new dynamic ports refuse to connect.

bkosaraju · ‎06-15-2017

Hi @Abhijeet Rajput, In response to handling the huge SQL, Spark does lazy evolution which means you can split your code into multiple blocks and write using the multiple data frames. That will be evaluated at last and uses the optimal execution plan that can accommodate for the operation. Example : var subquery1 = sql (“select c1,c2,c3 form tbl1 join tbl2 on codition1 and condition 2”) subquery1.registerTempTable(“res1”) var subquery2 = sql (“select c1,c2,c3 form res1 join tbl3 on codition4 and condition 5”)and so on…. On the other request, there is no difference between using the DataFrame base API or SQL as the same execution plan will be generated for both, you can validate the same from DAG schedule while on execution with Spark UI.

bkosaraju · ‎06-14-2017

hi @rahul gulati, Apparently, number of partitions for your DataFrame / RDD is creating the issue. This can be controlled by adjusting the spark.default.parallelism parameter in spark context or by using .repartition(<desired number>) When you run in spark-shell please check the mode and number of cores allocated for the execution and adjust the value to which ever is working for the shell mode Alternatively you can observe the same form Spark UI and come to a conclusion on partitions. # from spark website on spark.default.parallelism For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager: Local mode: number of cores on the local machine Others: total number of cores on all executor nodes or 2, whichever is larger

bkosaraju · ‎06-14-2017

Hi @Jean-Sebastien Gourdet, There are couple of options available to reduce the shuffle (not eliminate in some cases) Using the broadcast variables By using the broad cast variable, you can eliminate the shuffle of a big table, however you must broadcast the small data across all the executors This may not be feasible all the cases, if both tables are big. The other alternative (good practice to implement) is to implement the predicated pushdown for Hive data, this filters only the data which is required for the computation at the Hive Level and extract small amount of data. This may not avoid complete shuffle but certainly speed up the shuffle as the amount of the data which pulled to memory will reduce significantly ( in some cases) sqlContext.setConf("spark.sql.orc.filterPushdown", "true") -- If you are using ORC files / spark.sql.parquet.filterPushdown in case of Parquet files. Last but not recommended approach is to extract form single partition by keeping the option .repartitin(1) to the DataFrame you will be avoided the shuffle but all the data will not count on parallelism as the single executor participate on the operation. On the other note, the shuffle will be quick if the data is evenly distributed (key being used to join the table).

bkosaraju · ‎05-25-2017

Hi @Sridhar Babu, Apparently there is an issue with library in compatable with2.11:1.3.0 and 2.11:1.4.0 please use verison com.databricks:spark-csv_2.10:1.4.0

bkosaraju · ‎05-22-2017

Hi @Mehdi Hosseinzadeh, From the requirements prospective, following is the simplistic approach which will be inline with technologies which you proposed. Read the data From HTTP using Spark Streaming job and write into Kafka Read & process data from Kafka Topic as batches/stream save the data into HDFS as parquet / Avaro /ORC etc.. Build an external Tables in Hive(on top of the data which processed in step 2) so that data is available as and when it is placed in HDFS Accessing the data from external tables has been discussed here

bkosaraju · ‎05-21-2017

Hi @Sushant Glad that worked for you, in case can you please accept that answer. in response to control the resource monitoring, not that I am aware of, but I believe you may 'not' need to prevent user 1 to see the application submitted by user1 or other user. as this does not contain any data (unless explisitly prints out to STDOUT). on the other hand you can manage the access with (authorization)SPNEGO for web UI.

bkosaraju · ‎05-16-2017

@Sudeep Mishra Please pass the user keytab in along with spark-submit command. --files /<key_tab_location>/<user_keytab.keytab> This is due to the executors are not authenticated to extract the data from HBase Region servers or any other components. by passing the keytab all the executors will have the key-tab and able to communicate

Online	Offline
Last Visited	‎04-09-2019 11:41 AM

Member Since	‎01-03-2017 05:05 AM
Last Visited	‎04-09-2019 11:41 AM
Posts	181
Kudos received	44

Cloudera Community

Re: Api to help pull yarn metrics and RM metrics

Re: NiFi Cluster Setup

Re: Hive LLAP ranger insert issue (requires defaul...

Re: Ranger Audit Log (Add filter)

Re: HDFS is not rebalancing after adding new DataN...

Re: Spark is running more containers than no. of e...

Re: How to merge multiple part files while creati...

Re: failing to connect to spark driver when submit...

Re: Pyspark SQL - Large Queries

Re: Error in Spark Application - Missing an output...

Re: How to reduce Spark shuffling caused by join w...

Re: SPARK SUBMIT - java.lang.NoSuchMethodError

Re: How to insert parquet file to Kafka and pass t...

Re: Preventing users from Killing the yarn applica...

Re: Running Spark job to query Hive HBase tables i...