Member since
01-09-2019
401
Posts
163
Kudos Received
80
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1542 | 06-21-2017 03:53 PM | |
2198 | 03-14-2017 01:24 PM | |
1458 | 01-25-2017 03:36 PM | |
2460 | 12-20-2016 06:19 PM | |
1153 | 12-14-2016 05:24 PM |
05-25-2016
09:59 PM
1 Kudo
This is not true. Sqoop can directly import into a snappy compressed ORC table using HCatalog. Refer to my answer on how to do this.
... View more
05-25-2016
09:55 PM
1 Kudo
@Andrew Watson If you are also looking at mutiple python library versions, take a look at virtualenv. This makes managing multiple python environments easier.
... View more
05-25-2016
09:41 PM
1 Kudo
I think its spark.dynamicAllocation.initialExecutors that you can set per job. Try putting in a property file and passing it with --properties-file. Haven't tried this myself, so let me know how it works.
... View more
05-25-2016
09:04 PM
If you are not using dynamic allocation, your job that is submitted will not start until it gets all the resources. You are asking for N number of executors, so YARN will not let you proceed until you get all executors. If you are using dynamic allocation, then setting spark.dynamicAllocation.minExecutors to a higher value will mean that the job gets scheduled only if minExecutors are met.
... View more
05-25-2016
08:53 PM
1 Kudo
Its not essential for all local accounts to have the same UID though this will help with easier maintenance. If you let ambari create your local accounts, then you may not get the same UIDs for local users across all nodes. If you want to get same UIDs, its better you manage create local users as part of your server configuration management process (like puppet/chef if you have one).
... View more
05-25-2016
08:36 PM
If you think Hadoop to be HDFS and YARN, spark can take advantage of HDFS (storage that can be horizontally expanded by adding more nodes) by reading data that is in HDFS, writing final processed data into HDFS and YARN (compute that can be horizontally expanded by adding more nodes) by running on YARN. If you are looking at usecases, look at MLlib algorithms which cover a lot of use cases that can run on top of spark.
... View more
05-25-2016
08:25 PM
1 Kudo
If you are using zeppelin for spark, you can change JAVA_OPTS in zeppelin-env on zepplien configs and add something like -Dspark.yarn.queue=my_zeppelin_queuename You can add mapreduce and tez queues as well in JAVA_OPTS
... View more
05-25-2016
07:31 PM
1 Kudo
This will only work at user level, not at job level. So, if the user has other jobs and he gets the % of queue, spark job will start even before it can get that.
... View more
05-25-2016
05:57 PM
2 Kudos
For ease of use, local KDC for hadoop service principals and AD for users is the best way. However, you need to secure your local KDC/Kerberos. If you can secure that, there is no reason not to use local KDC for hadoop service principals. You may run into security policies that do not allow local kerberos instances. You may also run into policies where you won't get AD credentials that have permissions to create principals in an OU on AD. This will be required if you want ambari to directly create principals for you. So, which one to go with is entirely dependent on company security policies.
... View more
05-25-2016
03:45 PM
2 Kudos
I believe it is using dfs.qjournal.start-segment.timeout.ms . Default for this is 20000. However there are other configs as well that you may have to adjust like dfs.qjournal.write-txns.timeout.ms. But, you are better off fixing your infrastructure issues than changes these default values.
... View more