Member since
01-03-2017
181
Posts
44
Kudos Received
24
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1848 | 12-02-2018 11:49 PM | |
2472 | 04-13-2018 06:41 AM | |
2043 | 04-06-2018 01:52 AM | |
2347 | 01-07-2018 09:04 PM | |
5695 | 12-20-2017 10:58 PM |
07-31-2017
04:49 AM
Hi @mravipati, can you please check Dynamic Resource Allocation is enabled spark.dynamicAllocation.enabled =true this will use as many as it can depends up on the system rescue availability, this may be causing the problem
On the other note, this behaviour can be controlled by setting the spark.dynamicAllocation.maxExecutors = <no max limit> please note that, driver also allocated some of the containers. you need to manage the memory allocations for Executors and drivers. for instance if you have Yarn minimum container size mentioned as 2GB and your executors are requested about 2GB per executor, this will allocated 4GB per executor as you have spark.yarn.executor.memoryOverhead also to be accounted. the following KB explain more about the why it is taking more resources by spark.
... View more
07-25-2017
04:47 AM
ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;
reducing the tasks may impact the overall the performance(however alter also run mr and consume resources.) post to your insert you can run a alter table statement. more on the same can be found at ORC documentation https://orc.apache.org/docs/hive-ddl.html
... View more
07-04-2017
03:28 AM
hi @tariq abughofa, could you please SELinux Disabled or not on the driver, which looks preventing new dynamic ports refuse to connect.
... View more
06-15-2017
12:31 AM
1 Kudo
Hi @Abhijeet Rajput, In response to handling the
huge SQL, Spark does lazy evolution which means you can split
your code into multiple blocks and write using the multiple data frames. That will be evaluated at
last and uses the optimal execution plan that can accommodate for the operation. Example :
var subquery1 = sql (“select
c1,c2,c3 form tbl1 join tbl2 on codition1 and condition 2”)
subquery1.registerTempTable(“res1”)
var subquery2 = sql (“select
c1,c2,c3 form res1 join tbl3 on codition4 and condition 5”)and so on…. On the other request, there
is no difference between using the DataFrame base API or SQL as the same execution
plan will be generated for both, you can validate the same from DAG schedule while on execution with Spark UI.
... View more
06-14-2017
07:08 AM
hi @rahul gulati, Apparently, number of
partitions for your DataFrame / RDD is creating the issue. This can be controlled by adjusting
the spark.default.parallelism parameter in spark context or by using
.repartition(<desired number>) When you run in spark-shell
please check the mode and number of cores allocated for the execution and
adjust the value to which ever is working for the shell mode Alternatively you can observe
the same form Spark UI and come to a conclusion on partitions. # from spark website on spark.default.parallelism For distributed shuffle
operations like reduceByKey and join, the largest number of
partitions in a parent RDD. For operations like parallelize with no
parent RDDs, it
depends on the cluster manager:
Local mode: number of cores on the local machine
Others: total number of cores on all executor
nodes or 2, whichever is larger
... View more
06-14-2017
04:33 AM
1 Kudo
Hi @Jean-Sebastien Gourdet, There are couple of options
available to reduce the shuffle (not eliminate in some cases) Using the broadcast
variables By using
the broad cast variable, you can eliminate the shuffle of a big table, however
you must broadcast the small data across all the executors This
may not be feasible all the cases, if both tables are big. The other
alternative (good practice to implement) is to implement the predicated
pushdown for Hive data, this filters only the data which is required for the
computation at the Hive Level and extract small amount of data. This may not avoid
complete shuffle but certainly speed up the shuffle as the amount of the data
which pulled to memory will reduce significantly ( in some cases) sqlContext.setConf("spark.sql.orc.filterPushdown",
"true") -- If
you are using ORC files / spark.sql.parquet.filterPushdown in case of Parquet
files.
Last but not recommended approach is
to extract form single partition by keeping the option .repartitin(1) to the DataFrame
you will be avoided the shuffle but all the data will not count on parallelism
as the single executor participate on the operation. On the other note, the
shuffle will be quick if the data is evenly distributed (key being used to join
the table).
... View more
05-25-2017
09:23 AM
Hi @Sridhar Babu, Apparently there is an issue with library in compatable with2.11:1.3.0 and 2.11:1.4.0 please use verison com.databricks:spark-csv_2.10:1.4.0
... View more
05-22-2017
03:27 AM
1 Kudo
Hi @Mehdi Hosseinzadeh, From the requirements prospective, following is the simplistic approach which will be inline with technologies which you proposed. Read the data From HTTP using Spark Streaming job and write into Kafka Read & process data from Kafka Topic as batches/stream save the data into HDFS as parquet / Avaro /ORC etc.. Build an external Tables in Hive(on top of the data which processed in step 2) so that data is available as and when it is placed in HDFS Accessing the data from external tables has been discussed here
... View more
05-21-2017
10:20 PM
Hi @Sushant Glad that worked for you, in case can you please accept that answer. in response to control the resource monitoring, not that I am aware of, but I believe you may 'not' need to prevent user 1 to see the application submitted by user1 or other user. as this does not contain any data (unless explisitly prints out to STDOUT). on the other hand you can manage the access with (authorization)SPNEGO for web UI.
... View more
05-16-2017
10:11 PM
@Sudeep Mishra Please pass the user keytab in along with spark-submit command. --files /<key_tab_location>/<user_keytab.keytab> This is due to the executors are not authenticated to extract the data from HBase Region servers or any other components. by passing the keytab all the executors will have the key-tab and able to communicate
... View more
- « Previous
- Next »