Member since
01-18-2016
163
Posts
32
Kudos Received
19
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1380 | 04-06-2018 09:24 PM | |
1405 | 05-02-2017 10:43 PM | |
3842 | 01-24-2017 08:21 PM | |
23567 | 12-05-2016 10:35 PM | |
6457 | 11-30-2016 10:33 PM |
11-16-2016
02:22 AM
Can yo check if Postgres is running? [root@sandbox ~]# service postgresql status
postmaster (pid 524) is running...
... View more
11-16-2016
02:09 AM
Apparently there is a newer option "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" for splitting on string there is no guarantee it will split evenly.
... View more
11-16-2016
01:56 AM
It looks like you're right about --direct-split-size being limited to postgresql. So, I think you'll need to depend on the number of mappers and the proper split-by column. What size files are you getting when you export with 4 mappers? I noticed that in your original post, you have --split-by STR, which makes me think you're trying to split by a string column which is not possible and I think it will produce an error with more than one mapper. But, let's assume you use a split-by column is actually numeric. Do you know for sure that your data is evenly distributed by the split-by column? If the numeric field is not evenly distributed, you will end up with some larger and some smaller files. Sqoop first does select min(<split-coumn>), max(split-column) from <table>. It then divides it by the number of mappers. For example, suppose you have 25 records in the database. The primary key field "id" has 24 records with id's 1 through 24, plus one other record with id=100. If, we run 4 mappers, we get min 1, max 100, divided into 4 groups by id ranges for the 4 mappers. Each mapper writes the records in its ID range. So we will end up with ONE file containing 24 records, three empty files and the one with id = 100.
... View more
11-15-2016
07:26 PM
@Peter Kim I don't think you can control file sizes other than changing the number of mappers as you seem to be doing, with one exception. If you are in direct mode you can use --direct-split-size 570000000 which will split into approximately 570MB files.
... View more
11-15-2016
06:27 PM
Hi. @Marc Mazerolle. sqoop runs by executing the jdbc on in a mapper on the data nodes, not on the edge node. That's what makes it so fast - multiple hosts pulling data. That means the data nodes need to talk to your sql server on 1433. I'm not sure about your port forwarding workaround, but I assume you understand that using netcat (nc) to do port forwarding is not a good solution for anything but proof of concept for multiple reasons. And, with netcat, I think it would only work for a single mapper since netcat can only handle one connection at a time, so if you had multiple mappers, they could't get through netcat. I think you may need to open the port between your data nodes and the sql server.
... View more
11-10-2016
05:59 PM
Your log has this error. Diagnostics: File does not exist: hdfs://xxx-hdfs.abc.net:8020/user/yarn/.hiveJars/hive-exec-1.2.1.2.3.4.0-3485-bb59749376792da886f093283cc8bbdb78c69612f13abcbcedbef00717030c90.jar Check if the /user/yarn directory exists and check that owner:group are yarn:hdfs
... View more
11-04-2016
03:06 PM
+1 - @Misti Mordoch - This seems to have fixed the problem for someone else I was talking to. You can also check for out of sync hosts with the Rest API: http://:8080/api/v1/clusters//stack_versions/1>:8080/api/v1/clusters/<CLUSTER_NAME>/stack_versions/1 -u <USERNAME>
and then Check for "OUT_OF_SYNC" : [ ] to make sure there are no hosts list as out of sync. Reinstall services for out of sync hosts or remove the hosts and re-add will probably fix it.
... View more
07-28-2016
07:55 PM
Someone had entered two entries in the spark-defaults.conf which caused spark shell and pyspark to run as "spark" in yarn. spark.yarn.keytab and spark.yarn.principal. Removing them fixed it.
... View more
07-20-2016
03:53 PM
How can we get pyspark to submit yarn jobs as the end user? We have data in a private directory (700) that a user owns. He can select data with HiveServer2's beeline, but when using pyspark, he gets permission denied because the job is submitted as the "spark" user instead of as the end-user. This is a kerberized cluster with Ranger Hive and HDFS plugins. He has access to the directory in question, just not with pyspark. He is mostly using Jupyter via Jupyterhub, which is using PAM authentication, but I think he has also run this with bin/pyspark with the same results. Here is the code: from pyspark import SparkContext, SparkConf
SparkContext.setSystemProperty('spark.executor.memory', '2g')
conf = SparkConf()
conf.set('spark.executor.instances', 4)
sc = SparkContext('yarn-client', 'myapp', conf=conf)
rdd = sc.textFile('/user/johndoe/.staging/test/student.txt')
rdd.cache()
rdd.count() And the error: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.security.AccessControlException: Permission denied: user=spark, access=EXECUTE, inode="/user/johndoe/.staging/test/student.txt":johndoe:hdfs:drwx------
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
at org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkPermission(RangerHdfsAuthorizer.java:305)
... View more
Labels:
- Labels:
-
Apache Spark