About ravi1

ravi1 · ‎05-25-2016

This is not true. Sqoop can directly import into a snappy compressed ORC table using HCatalog. Refer to my answer on how to do this.

ravi1 · ‎05-25-2016

@Andrew Watson If you are also looking at mutiple python library versions, take a look at virtualenv. This makes managing multiple python environments easier.

ravi1 · ‎05-25-2016

I think its spark.dynamicAllocation.initialExecutors that you can set per job. Try putting in a property file and passing it with --properties-file. Haven't tried this myself, so let me know how it works.

ravi1 · ‎05-25-2016

If you are not using dynamic allocation, your job that is submitted will not start until it gets all the resources. You are asking for N number of executors, so YARN will not let you proceed until you get all executors. If you are using dynamic allocation, then setting spark.dynamicAllocation.minExecutors to a higher value will mean that the job gets scheduled only if minExecutors are met.

ravi1 · ‎05-25-2016

Its not essential for all local accounts to have the same UID though this will help with easier maintenance. If you let ambari create your local accounts, then you may not get the same UIDs for local users across all nodes. If you want to get same UIDs, its better you manage create local users as part of your server configuration management process (like puppet/chef if you have one).

ravi1 · ‎05-25-2016

If you think Hadoop to be HDFS and YARN, spark can take advantage of HDFS (storage that can be horizontally expanded by adding more nodes) by reading data that is in HDFS, writing final processed data into HDFS and YARN (compute that can be horizontally expanded by adding more nodes) by running on YARN. If you are looking at usecases, look at MLlib algorithms which cover a lot of use cases that can run on top of spark.

ravi1 · ‎05-25-2016

If you are using zeppelin for spark, you can change JAVA_OPTS in zeppelin-env on zepplien configs and add something like -Dspark.yarn.queue=my_zeppelin_queuename You can add mapreduce and tez queues as well in JAVA_OPTS

ravi1 · ‎05-25-2016

This will only work at user level, not at job level. So, if the user has other jobs and he gets the % of queue, spark job will start even before it can get that.

ravi1 · ‎05-25-2016

For ease of use, local KDC for hadoop service principals and AD for users is the best way. However, you need to secure your local KDC/Kerberos. If you can secure that, there is no reason not to use local KDC for hadoop service principals. You may run into security policies that do not allow local kerberos instances. You may also run into policies where you won't get AD credentials that have permissions to create principals in an OU on AD. This will be required if you want ambari to directly create principals for you. So, which one to go with is entirely dependent on company security policies.

ravi1 · ‎05-25-2016

I believe it is using dfs.qjournal.start-segment.timeout.ms . Default for this is 20000. However there are other configs as well that you may have to adjust like dfs.qjournal.write-txns.timeout.ms. But, you are better off fixing your infrastructure issues than changes these default values.

Online	Offline
Last Visited	‎12-18-2021 05:54 PM

Member Since	‎01-09-2019 05:01 PM
Last Visited	‎12-18-2021 05:54 PM
Posts	401
Kudos received	163

Cloudera Community

Re: 2 hosts not running master services

Re: ambari restart and service restart updating kr...

Re: How to automate sqoop incremental import using...

Re: Path to core-site.xml in sandbox?

Re: Curious to know why majority of the people are...

Re: SQOOP Import to Snappy ORC

Re: Concerns with installing Python 3.X on Hadoop ...

Re: Spark - YARN Capacity Scheduler

Re: Spark - YARN Capacity Scheduler

Re: Which is better to create Hadoop accounts in L...

Re: Hadoop + Spark Use Case

Re: How to choose the queue in which you want to s...

Re: Spark - YARN Capacity Scheduler

Re: Which is better to create Hadoop accounts in L...

Re: What is the HDFS, NameNode configuration for t...