Member since
05-23-2016
9
Posts
5
Kudos Received
0
Solutions
08-29-2017
05:22 PM
Hi, We have default S3 Bucket, say A which is configured in core-site.xml. If we try to access that bucket from Spark in client or cluster mode it is working fine. But for bucket, say B which is not configured in core-site.xml, it works fine in Client mode but in cluster mode it fails with below exception. As a workaround we are passing core-site.xml with bucket B jceks file and it works. Why is this property not working in cluster mode. Let me know if we need to set any other property for cluster mode. spark.hadoop.hadoop.security.credential.provider.path Client-Mode (Working fine) spark-submit --class <ClassName> --master yarn --deploy-mode client --files .../conf/hive-site.xml --conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/.../1.jceks --jars $SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar --queue default <jar_path> Cluster- Mode (Not working ) spark-submit --class <ClassName> --master yarn --deploy-mode <strong>cluster</strong>
--files .../conf/hive-site.xml --conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/.../1.jceks --jars $SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar --queue default <jar_path>
Exception:
diagnostics: User class threw exception: java.nio.file.AccessDeniedException: s3a://<bucketname>/server_date=2017-08-23: getFileStatus on s3a://<bucketname>/server_date=2017-08-23: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 92F94B902D52D864), S3 Extended Request ID: 1Df3YxG5znruRbsOpsGhCO40s4d9HKhvD14FKk1DSt//lFFuEdXjGueNg5+MYbUIP4aKvsjrZmw= Cluster-Mode (Workaround which is working) Step 1: cp /usr/hdp/current/hadoop-client/conf/core-site.xml /<home_dir>/core-site.xml
Step2: Edit core-site.xml and replace jceks://hdfs/.../default.jceks to jceks://hdfs/.../1.jceks
Step3:Pass core-site.xml to spark submit command spark-submit --class <ClassName> --master yarn --deploy-mode cluster
--files .../conf/hive-site.xml,../conf/core-site.xml --conf
spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/.../1.jceks --jars
$SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar --verbose --queue default <jar_path> Thanks Subacini
... View more
Labels:
- Labels:
-
Apache Spark
02-03-2017
06:18 PM
Thank you Ed. String works. The issue is addressed in 2.1.0 - https://issues.apache.org/jira/browse/SPARK-17388 Thanks Subacini
... View more
01-31-2017
06:38 PM
Hi Ed, Khushbhu Thanks for your reply. Sorry about the logs, I was using two tables spark_2_test, spark_3_test. In both tables i am getting same error. From Hive shell, execute below commands CREATE EXTERNAL TABLE spark_4_test(name string, dept string ) PARTITIONED BY ( server_date date) LOCATION '/xxx/yyy/spark4' insert into table spark_4_test partition(server_date='2016-10-23') values ('a','d1') insert into table spark_4_test partition(server_date='2016-10-10') values ('a','d1') From spark-shell, execute drop partition command. It fails. hiveCtx.sql("ALTER TABLE spark_4_test DROP IF EXISTS PARTITION (server_date ='2016-10-10')") Note: If PARTITIONED BY is String, it works fine . In my case, it is date type. Thanks Subacini
... View more
01-27-2017
02:36 PM
1 Kudo
Hi,
When we execute drop partition command on hive external
table from spark-shell we are getting below error.Same command works
fine from hive shell. Spark Version : 1.5.2 ************************************************** Partition exists and drop partition command works fine in Hive shell. I had 3 partition and then issued hive drop partition command and it got succeeded. hive> show partitions spark_2_test; OK server_date=2016-10-10 server_date=2016-10-11 server_date=2016-10-13 hive> ALTER TABLE spark_2_test DROP PARTITION (server_date='2016-10-13'); Dropped the partition server_date=2016-10-13 OK Time taken: 0.217 seconds hive> show partitions spark_2_test; OK server_date=2016-10-10 server_date=2016-10-11 **************************************************** Execute same from spark shell (throws "partition not found" error even though it is present). scala> hiveCtx.sql("show partitions spark_2_test").collect().foreach(println); [server_date=2016-10-10] [server_date=2016-10-11] scala> hiveCtx.sql("ALTER TABLE spark_2_test DROP PARTITION (server_date='2016-10-10')") 17/01/26 19:28:39 ERROR Driver: FAILED: SemanticException [Error 10006]: Partition not found (server_date = 2016-10-10)
org.apache.hadoop.hive.ql.parse.SemanticException: Partition not found (server_date = 2016-10-10)
at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.addTableDropPartsOutputs(DDLSemanticAnalyzer.java:3178)
at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableDropParts(DDLSemanticAnalyzer.java:2694)
at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:278)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:451)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:440)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:278)
at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:233)
at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:270)
at org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:440)
at org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:430)
at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:561)
at org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:144)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:129)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
at $line69.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
at $line69.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
at $line69.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
at $line69.$read$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
at $line69.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
at $line69.$read$$iwC$$iwC$$iwC.<init>(<console>:37)
at $line69.$read$$iwC$$iwC.<init>(<console>:39)
at $line69.$read$$iwC.<init>(<console>:41)
at $line69.$read.<init>(<console>:43)
at $line69.$read$.<init>(<console>:47)
at $line69.$read$.<clinit>(<console>)
at $line69.$eval$.<init>(<console>:7)
at $line69.$eval$.<clinit>(<console>)
at $line69.$eval.$print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:685)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: MetaException(message:Unable to find class: 㐀org.apache.hadoop.hive.ql.udf.generic.G
Serialization trace:
typeInfo (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc))
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_by_expr_result$get_partitions_by_expr_resultStandardScheme.read(ThriftHiveMetastore.java)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_by_expr_result$get_partitions_by_expr_resultStandardScheme.read(ThriftHiveMetastore.java)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_by_expr_result.read(ThriftHiveMetastore.java)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partitions_by_expr(ThriftHiveMetastore.java:2277)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_by_expr(ThriftHiveMetastore.java:2264)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByExpr(HiveMetaStoreClient.java:1130)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
at com.sun.proxy.$Proxy44.listPartitionsByExpr(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByExpr(Hive.java:2289)
at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.addTableDropPartsOutputs(DDLSemanticAnalyzer.java:3176)
... 77 more
17/01/26 19:28:39 ERROR ClientWrapper:
======================
HIVE FAILURE OUTPUT
======================
SET hive.support.sql11.reserved.keywords=false
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table spark_3_test already exists)
OK
OK
OK
OK
FAILED: SemanticException [Error 10006]: Partition not found (server_date = 2016-10-10)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Partition already exists: Partition(values:[2016-10-10], dbName:default, tableName:spark_3_test, createTime:0, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), FieldSchema(name:dept, type:string, comment:null)], location:null, inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false, numBuckets:2, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, parameters:{serialization.format=1}), bucketCols:[dept], sortCols:[Order(col:dept, order:1)], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:null))
FAILED: SemanticException [Error 10006]: Partition not found (server_date = 2016-10-10)
OK
OK
OK
FAILED: SemanticException [Error 10006]: Partition not found (server_date = 2016-10-23)
OK
FAILED: SemanticException [Error 10006]: Partition not found (server_date = 2016-10-23)
OK
FAILED: SemanticException [Error 10006]: Partition not found (server_date = 2016-10-10)
======================
END HIVE FAILURE OUTPUT
======================
org.apache.spark.sql.execution.QueryExecutionException: FAILED: SemanticException [Error 10006]: Partition not found (server_date = 2016-10-10)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:455)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:440)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:278)
at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:233)
at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:270)
at org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:440)
at org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:430)
at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:561)
at org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:144)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:129)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
at $iwC$$iwC$$iwC.<init>(<console>:37)
at $iwC$$iwC.<init>(<console>:39)
at $iwC.<init>(<console>:41)
at <init>(<console>:43)
at .<init>(<console>:47) Thanks Subacini
... View more
Labels:
- Labels:
-
Apache Spark
08-08-2016
10:18 PM
Hi, We are using kafka 0.9 and delete.topic.enable=true is set .After we delete topic, it shows "marked for deletion" but it still shows under topic list. After restarting brokers, it is gone. Does that mean deleting a topic requires restart or some configuration property is missing. Thanks in advance.
... View more
Labels:
- Labels:
-
Apache Kafka
06-14-2016
06:22 PM
Hi Yvora, I didnt set this property and didnt face any permission issue. We are using hsftp and captured packets during transit. Data is encrypted and communication is happening over secure ports [50470, 50475 ]. Please confirm.
... View more
05-25-2016
05:02 PM
2 Kudos
Hi, I am trying to transfer HDFS files securely between two clusters using hadoop distcp hsftp://<host1>:50470/srcPath hdfs://<host2>:8020/destPath. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_Sys_Admin_Guides/content/ref-bbf68907-3bb1-4ef9-ba4e-aecb737a0222.1.html "HSFTP, uses HTTPS by default. This means that data will be encrypted in transit" Source Cluster is made secure with ssl setup on all nodes and dfs.http.policy is set to HTTP_AND_HTTPS .In destination cluster we have truststore of source cluster. I understand that Distcp hsftp command when we run on destination cluster, it talks to source name node on 50470 port which is secure. Does that mean actual data transfer between data nodes is also secure? If so, can someone explain me how it works .
... View more
Labels:
- Labels:
-
Apache Hadoop