About james_jones

james_jones · ‎11-15-2016

@Peter Kim I don't think you can control file sizes other than changing the number of mappers as you seem to be doing, with one exception. If you are in direct mode you can use --direct-split-size 570000000 which will split into approximately 570MB files.

james_jones · ‎11-15-2016

Hi. @Marc Mazerolle. sqoop runs by executing the jdbc on in a mapper on the data nodes, not on the edge node. That's what makes it so fast - multiple hosts pulling data. That means the data nodes need to talk to your sql server on 1433. I'm not sure about your port forwarding workaround, but I assume you understand that using netcat (nc) to do port forwarding is not a good solution for anything but proof of concept for multiple reasons. And, with netcat, I think it would only work for a single mapper since netcat can only handle one connection at a time, so if you had multiple mappers, they could't get through netcat. I think you may need to open the port between your data nodes and the sql server.

james_jones · ‎11-10-2016

Your log has this error. Diagnostics: File does not exist: hdfs://xxx-hdfs.abc.net:8020/user/yarn/.hiveJars/hive-exec-1.2.1.2.3.4.0-3485-bb59749376792da886f093283cc8bbdb78c69612f13abcbcedbef00717030c90.jar Check if the /user/yarn directory exists and check that owner:group are yarn:hdfs

james_jones · ‎11-04-2016

+1 - @Misti Mordoch - This seems to have fixed the problem for someone else I was talking to. You can also check for out of sync hosts with the Rest API: http://:8080/api/v1/clusters//stack_versions/1>:8080/api/v1/clusters/<CLUSTER_NAME>/stack_versions/1 -u <USERNAME> and then Check for "OUT_OF_SYNC" : [ ] to make sure there are no hosts list as out of sync. Reinstall services for out of sync hosts or remove the hosts and re-add will probably fix it.

james_jones · ‎07-28-2016

Someone had entered two entries in the spark-defaults.conf which caused spark shell and pyspark to run as "spark" in yarn. spark.yarn.keytab and spark.yarn.principal. Removing them fixed it.

james_jones · ‎07-20-2016

How can we get pyspark to submit yarn jobs as the end user? We have data in a private directory (700) that a user owns. He can select data with HiveServer2's beeline, but when using pyspark, he gets permission denied because the job is submitted as the "spark" user instead of as the end-user. This is a kerberized cluster with Ranger Hive and HDFS plugins. He has access to the directory in question, just not with pyspark. He is mostly using Jupyter via Jupyterhub, which is using PAM authentication, but I think he has also run this with bin/pyspark with the same results. Here is the code: from pyspark import SparkContext, SparkConf SparkContext.setSystemProperty('spark.executor.memory', '2g') conf = SparkConf() conf.set('spark.executor.instances', 4) sc = SparkContext('yarn-client', 'myapp', conf=conf) rdd = sc.textFile('/user/johndoe/.staging/test/student.txt') rdd.cache() rdd.count() And the error: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.hadoop.security.AccessControlException: Permission denied: user=spark, access=EXECUTE, inode="/user/johndoe/.staging/test/student.txt":johndoe:hdfs:drwx------ at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) at org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkPermission(RangerHdfsAuthorizer.java:305)

james_jones · ‎07-20-2016

I may have been wrong about adding "/solr" to your zookeepers. I know I had to do that somewhere, but I guess it was when starting solr from the commandline without the "bin/solr start" command. So, you can re-upload your config directory to a configset named "lab". It will create or overwrite the current configset (which is just a ZK directory of your conf directory). The default configset name is the same as your collection. T ./zkcli.sh -zkhost localhost:2181 -cmd upconfig -confdir /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs_hdfs/conf -confname lab If your configset is called testcoll, then do this to show the contents of the solrconfig.xml in zookeeper: ./zkcli.sh -zkhost localhost:2181 -cmd get /configs/lab/solrconfig.xml I recommend running the list command which will dump everything in zookeeper, not just listing files but will print the contents of the files. That's a bit too much, so just pipe it to "less" and then search for your collection name as you would with vi (with / and ? to search). Then you'll see the path to your configs. ./zkcli.sh -zkhost localhost:2181 -cmd list |less you will see something like this (my collection is called testcoll in this example): /configs/testcoll/solrconfig.xml (0) DATA: ...supressed... /configs/testcoll/lang (38) /configs/testcoll/lang/contractions_ga.txt (0) DATA: ...supressed... /configs/testcoll/lang/stopwords_hi.txt (0) DATA: ...supressed... /configs/testcoll/lang/stopwords_eu.txt (0) DATA: ...supressed... /configs/testcoll/lang/stopwords_sv.txt (0) DATA: ...supressed... /configs/testcoll/lang/contractions_it.txt (0) I hope that helps.

james_jones · ‎07-20-2016

This will upload your config directory to a configset named "testcoll". The default configset name is the same as your collection ./zkcli.sh -zkhost localhost:2181 -cmd upconfig -confdir ../../solr/configsets/data_driven_schema_configs/conf -confname testcoll. If your configset is called testcoll, then do this to show the contents of the solrconfig.xml in zookeeper: ./zkcli.sh -zkhost localhost:2181 -cmd get /configs/testcoll/solrconfig.xml I recommend running the list command which will dump everything in zookeeper, not just listing files but will print the contents of the files. That's a bit too much, so just pipe it to "less" and then search for your collection name as you would with vi (with / and ? to search). Then you'll see the path to your configs.

james_jones · ‎07-20-2016

@Saurabh Kumar You need to add "/solr" to the end of your zookeeper host:port like this (you probably only need to list one of the zookeepers for the command) ./zkcli.sh -cmd upconfig -confdir /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs_hdfs/conf -confname labs -z m1.hdp22:2181/solr That command will upload the conf directory. I'd also suggest trying the list command ("-cmd list") to see what's in zookeeper. It has been a ehwhile since I have used it and I can't try it at the moment.

james_jones · ‎07-20-2016

You have HDFS defined in two places: in the command line and also in the solrconfig.xml. I don't understand the one in the command line since it does not include a port and that does not look like a hostname, but it could be: HDPTSTHA. You might try temporarily changing the one in in your solrconfig.xml to something bogus to see if it affects your reported error. Also, the create command says, "Re-using existing configuration directory labs", which makes me wonder if it is reusing what's already in zookeeper and perhaps that file does not match the one on your OS FS. The error reported has only one slash after "hdfs:/". Use Solr's zkcli.sh tool (which is different from the one that comes with Zookeeper) to get the contents of what's there or you could do a getfile or upconfig (to replace/update zookeeper). Remember that Solr adds "/solr" to the root except in embedded ZK mode.

Online	Offline
Last Visited	‎09-25-2025 01:25 PM

Member Since	‎01-18-2016 02:01 PM
Last Visited	‎09-25-2025 01:25 PM
Posts	169
Kudos received	31

Cloudera Community

Re: Connect Trino to Cloudera Hive with Kerberos A...

Re: How do HDFS Permissions work after Kerberos is...

Re: Ambari SPN creation on remote AD

Re: Solr on HDF

Re: Wrong timezone in Ranger admin

Re: How to set equivalent output file size after s...

Re: Sqoop running on closed cluster (only edge ser...

Re: Oozie workflow calls sqoop import (sql-server ...

Re: to add services in hdp-2.5. ambari-2.4.0.1

Re: pyspark permission errors

pyspark permission errors

Re: Solr config with zookeeper giving an error whe...

Re: Not able create SOLR COLLECTION NAMED

Re: Solr config with zookeeper giving an error whe...

Re: Solr config with zookeeper giving an error whe...