Member since
01-18-2016
169
Posts
32
Kudos Received
21
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1637 | 06-27-2025 06:00 AM | |
| 1342 | 01-14-2025 06:30 PM | |
| 1861 | 04-06-2018 09:24 PM | |
| 2013 | 05-02-2017 10:43 PM | |
| 5203 | 01-24-2017 08:21 PM |
11-15-2016
07:26 PM
@Peter Kim I don't think you can control file sizes other than changing the number of mappers as you seem to be doing, with one exception. If you are in direct mode you can use --direct-split-size 570000000 which will split into approximately 570MB files.
... View more
11-15-2016
06:27 PM
Hi. @Marc Mazerolle. sqoop runs by executing the jdbc on in a mapper on the data nodes, not on the edge node. That's what makes it so fast - multiple hosts pulling data. That means the data nodes need to talk to your sql server on 1433. I'm not sure about your port forwarding workaround, but I assume you understand that using netcat (nc) to do port forwarding is not a good solution for anything but proof of concept for multiple reasons. And, with netcat, I think it would only work for a single mapper since netcat can only handle one connection at a time, so if you had multiple mappers, they could't get through netcat. I think you may need to open the port between your data nodes and the sql server.
... View more
11-10-2016
05:59 PM
Your log has this error. Diagnostics: File does not exist: hdfs://xxx-hdfs.abc.net:8020/user/yarn/.hiveJars/hive-exec-1.2.1.2.3.4.0-3485-bb59749376792da886f093283cc8bbdb78c69612f13abcbcedbef00717030c90.jar Check if the /user/yarn directory exists and check that owner:group are yarn:hdfs
... View more
11-04-2016
03:06 PM
+1 - @Misti Mordoch - This seems to have fixed the problem for someone else I was talking to. You can also check for out of sync hosts with the Rest API: http://:8080/api/v1/clusters//stack_versions/1>:8080/api/v1/clusters/<CLUSTER_NAME>/stack_versions/1 -u <USERNAME>
and then Check for "OUT_OF_SYNC" : [ ] to make sure there are no hosts list as out of sync. Reinstall services for out of sync hosts or remove the hosts and re-add will probably fix it.
... View more
07-28-2016
07:55 PM
Someone had entered two entries in the spark-defaults.conf which caused spark shell and pyspark to run as "spark" in yarn. spark.yarn.keytab and spark.yarn.principal. Removing them fixed it.
... View more
07-20-2016
03:53 PM
How can we get pyspark to submit yarn jobs as the end user? We have data in a private directory (700) that a user owns. He can select data with HiveServer2's beeline, but when using pyspark, he gets permission denied because the job is submitted as the "spark" user instead of as the end-user. This is a kerberized cluster with Ranger Hive and HDFS plugins. He has access to the directory in question, just not with pyspark. He is mostly using Jupyter via Jupyterhub, which is using PAM authentication, but I think he has also run this with bin/pyspark with the same results. Here is the code: from pyspark import SparkContext, SparkConf
SparkContext.setSystemProperty('spark.executor.memory', '2g')
conf = SparkConf()
conf.set('spark.executor.instances', 4)
sc = SparkContext('yarn-client', 'myapp', conf=conf)
rdd = sc.textFile('/user/johndoe/.staging/test/student.txt')
rdd.cache()
rdd.count() And the error: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.security.AccessControlException: Permission denied: user=spark, access=EXECUTE, inode="/user/johndoe/.staging/test/student.txt":johndoe:hdfs:drwx------
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
at org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkPermission(RangerHdfsAuthorizer.java:305)
... View more
Labels:
- Labels:
-
Apache Spark
07-20-2016
01:06 PM
1 Kudo
I may have been wrong about adding "/solr" to your zookeepers. I know I had to do that somewhere, but I guess it was when starting solr from the commandline without the "bin/solr start" command. So, you can re-upload your config directory to a configset named "lab". It will create or overwrite the current configset (which is just a ZK directory of your conf directory). The default configset name is the same as your collection. T ./zkcli.sh -zkhost localhost:2181 -cmd upconfig -confdir /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs_hdfs/conf -confname lab If your configset is called testcoll, then do this to show the contents of the solrconfig.xml in zookeeper: ./zkcli.sh -zkhost localhost:2181 -cmd get /configs/lab/solrconfig.xml I recommend running the list command which will dump everything in zookeeper, not just listing files but will print the contents of the files. That's a bit too much, so just pipe it to "less" and then search for your collection name as you would with vi (with / and ? to search). Then you'll see the path to your configs. ./zkcli.sh -zkhost localhost:2181 -cmd list |less you will see something like this (my collection is called testcoll in this example): /configs/testcoll/solrconfig.xml (0)
DATA: ...supressed...
/configs/testcoll/lang (38)
/configs/testcoll/lang/contractions_ga.txt (0)
DATA: ...supressed...
/configs/testcoll/lang/stopwords_hi.txt (0)
DATA: ...supressed...
/configs/testcoll/lang/stopwords_eu.txt (0)
DATA: ...supressed...
/configs/testcoll/lang/stopwords_sv.txt (0)
DATA: ...supressed...
/configs/testcoll/lang/contractions_it.txt (0) I hope that helps.
... View more
07-20-2016
12:59 PM
This will upload your config directory to a configset named "testcoll". The default configset name is the same as your collection ./zkcli.sh -zkhost localhost:2181 -cmd upconfig -confdir ../../solr/configsets/data_driven_schema_configs/conf -confname testcoll. If your configset is called testcoll, then do this to show the contents of the solrconfig.xml in zookeeper: ./zkcli.sh -zkhost localhost:2181 -cmd get /configs/testcoll/solrconfig.xml I recommend running the list command which will dump everything in zookeeper, not just listing files but will print the contents of the files. That's a bit too much, so just pipe it to "less" and then search for your collection name as you would with vi (with / and ? to search). Then you'll see the path to your configs.
... View more
07-20-2016
11:24 AM
@Saurabh Kumar You need to add "/solr" to the end of your zookeeper host:port like this (you probably only need to list one of the zookeepers for the command) ./zkcli.sh -cmd upconfig -confdir /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs_hdfs/conf -confname labs -z m1.hdp22:2181/solr That command will upload the conf directory. I'd also suggest trying the list command ("-cmd list") to see what's in zookeeper. It has been a ehwhile since I have used it and I can't try it at the moment.
... View more
07-20-2016
03:46 AM
You have HDFS defined in two places: in the command line and also in the solrconfig.xml. I don't understand the one in the command line since it does not include a port and that does not look like a hostname, but it could be: HDPTSTHA. You might try temporarily changing the one in in your solrconfig.xml to something bogus to see if it affects your reported error. Also, the create command says, "Re-using existing configuration directory labs", which makes me wonder if it is reusing what's already in zookeeper and perhaps that file does not match the one on your OS FS. The error reported has only one slash after "hdfs:/". Use Solr's zkcli.sh tool (which is different from the one that comes with Zookeeper) to get the contents of what's there or you could do a getfile or upconfig (to replace/update zookeeper). Remember that Solr adds "/solr" to the root except in embedded ZK mode.
... View more