About james_jones

james_jones · ‎11-16-2016

What version of are you using?

james_jones · ‎11-16-2016

Can yo check if Postgres is running? [root@sandbox ~]# service postgresql status postmaster (pid 524) is running...

james_jones · ‎11-16-2016

Apparently there is a newer option "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" for splitting on string there is no guarantee it will split evenly.

james_jones · ‎11-16-2016

It looks like you're right about --direct-split-size being limited to postgresql. So, I think you'll need to depend on the number of mappers and the proper split-by column. What size files are you getting when you export with 4 mappers? I noticed that in your original post, you have --split-by STR, which makes me think you're trying to split by a string column which is not possible and I think it will produce an error with more than one mapper. But, let's assume you use a split-by column is actually numeric. Do you know for sure that your data is evenly distributed by the split-by column? If the numeric field is not evenly distributed, you will end up with some larger and some smaller files. Sqoop first does select min(<split-coumn>), max(split-column) from <table>. It then divides it by the number of mappers. For example, suppose you have 25 records in the database. The primary key field "id" has 24 records with id's 1 through 24, plus one other record with id=100. If, we run 4 mappers, we get min 1, max 100, divided into 4 groups by id ranges for the 4 mappers. Each mapper writes the records in its ID range. So we will end up with ONE file containing 24 records, three empty files and the one with id = 100.

james_jones · ‎11-15-2016

@Peter Kim I don't think you can control file sizes other than changing the number of mappers as you seem to be doing, with one exception. If you are in direct mode you can use --direct-split-size 570000000 which will split into approximately 570MB files.

james_jones · ‎11-15-2016

Hi. @Marc Mazerolle. sqoop runs by executing the jdbc on in a mapper on the data nodes, not on the edge node. That's what makes it so fast - multiple hosts pulling data. That means the data nodes need to talk to your sql server on 1433. I'm not sure about your port forwarding workaround, but I assume you understand that using netcat (nc) to do port forwarding is not a good solution for anything but proof of concept for multiple reasons. And, with netcat, I think it would only work for a single mapper since netcat can only handle one connection at a time, so if you had multiple mappers, they could't get through netcat. I think you may need to open the port between your data nodes and the sql server.

james_jones · ‎11-10-2016

Your log has this error. Diagnostics: File does not exist: hdfs://xxx-hdfs.abc.net:8020/user/yarn/.hiveJars/hive-exec-1.2.1.2.3.4.0-3485-bb59749376792da886f093283cc8bbdb78c69612f13abcbcedbef00717030c90.jar Check if the /user/yarn directory exists and check that owner:group are yarn:hdfs

james_jones · ‎11-04-2016

+1 - @Misti Mordoch - This seems to have fixed the problem for someone else I was talking to. You can also check for out of sync hosts with the Rest API: http://:8080/api/v1/clusters//stack_versions/1>:8080/api/v1/clusters/<CLUSTER_NAME>/stack_versions/1 -u <USERNAME> and then Check for "OUT_OF_SYNC" : [ ] to make sure there are no hosts list as out of sync. Reinstall services for out of sync hosts or remove the hosts and re-add will probably fix it.

james_jones · ‎07-28-2016

Someone had entered two entries in the spark-defaults.conf which caused spark shell and pyspark to run as "spark" in yarn. spark.yarn.keytab and spark.yarn.principal. Removing them fixed it.

james_jones · ‎07-20-2016

How can we get pyspark to submit yarn jobs as the end user? We have data in a private directory (700) that a user owns. He can select data with HiveServer2's beeline, but when using pyspark, he gets permission denied because the job is submitted as the "spark" user instead of as the end-user. This is a kerberized cluster with Ranger Hive and HDFS plugins. He has access to the directory in question, just not with pyspark. He is mostly using Jupyter via Jupyterhub, which is using PAM authentication, but I think he has also run this with bin/pyspark with the same results. Here is the code: from pyspark import SparkContext, SparkConf SparkContext.setSystemProperty('spark.executor.memory', '2g') conf = SparkConf() conf.set('spark.executor.instances', 4) sc = SparkContext('yarn-client', 'myapp', conf=conf) rdd = sc.textFile('/user/johndoe/.staging/test/student.txt') rdd.cache() rdd.count() And the error: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.hadoop.security.AccessControlException: Permission denied: user=spark, access=EXECUTE, inode="/user/johndoe/.staging/test/student.txt":johndoe:hdfs:drwx------ at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) at org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkPermission(RangerHdfsAuthorizer.java:305)

Online	Offline
Last Visited	‎11-22-2024 01:45 PM

Member Since	‎01-18-2016 02:01 PM
Last Visited	‎11-22-2024 01:45 PM
Posts	163
Kudos received	31

Cloudera Community

Re: Ambari SPN creation on remote AD

Re: Solr on HDF

Re: Wrong timezone in Ranger admin

Re: SOLR server connection refused

Re: Solr Configuration - Error uploading file

Re: nifi doesnt start

Re: Could not set the password for ambari server

Re: How to set equivalent output file size after s...

Re: How to set equivalent output file size after s...

Re: How to set equivalent output file size after s...

Re: Sqoop running on closed cluster (only edge ser...

Re: Oozie workflow calls sqoop import (sql-server ...

Re: to add services in hdp-2.5. ambari-2.4.0.1

Re: pyspark permission errors

pyspark permission errors