Support Questions

Find answers, ask questions, and share your expertise

HDFS [PathNotFound error] deleting the blocks of actual file

avatar

Hi,

we are using HDP(3.1.0.0-78) launched via cloudbreak. we are using spark over HDFS for our ETL. In the flow sometimes spark table data is getting deleted. we are getting either PathNotFound error or empty data to the next spark query. Spark writes the files first to a jobStaging directory and renames the file to final destination after the job is completed (in job commit).

I can see the block deletion logs in corresponding datanode for those files.

The deletion logs looks something like this

2019-11-29 08:38:39,387 INFO  impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(225)) - Scheduling blk_1073817155_76331 replica FinalizedReplica, blk_1073817155_76331, FINALIZED

2019-11-29 08:38:39,402 INFO  impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:run(333)) - Deleted BP-640580566-10.99.135.34-1575013055635 blk_1073817155_76331 URI file:/hadoopfs/fs1/hdfs/datanode/current/BP-640580566-10.99.135.34-1575013055635/current/finalized/subdir1/subdir6/blk_1073817155

 

The following are the logs for the file to which this block belongs.

1)Following the log in hdfs-audit.log is the creation of file in task temporary directory.

grep  part-00000-3d064e1e-f2ac-4b3f-a47f-f8e67095ed36-c000 hdfs-audit.log  | grep create

2019-11-29 08:12:50,927 INFO FSNamesystem.audit: allowed=true   ugi=livy (auth:SIMPLE)  ip=/10.99.135.20        cmd=create      src=/apps/spark/warehouse/temporal_engine__hive_intermediate_15489__full.db/milestones_repartition/.hive-staging_hive_2019-11-29_08-12-32_852_6021558610039812629-11/-ext-10000/_temporary/0/_temporary/attempt_20191129081250_0185_m_000000_0/part-00000-3d064e1e-f2ac-4b3f-a47f-f8e67095ed36-c000     dst=null        perm=livy:hdfs:rw-r--r--        proto=rpc       callerContext=SPARK_TASK_application_1575013160270_0010_1_JId_135_SId_185_0_TId_9888_0

 

2)corresponding log for the HDFS block assigned  for this file is

2019-11-29 08:12:50,929 INFO  hdfs.StateChange (FSDirWriteFileOp.java:logAllocatedBlock(790)) - BLOCK* allocate blk_1073817155_76331, replicas=10.99.135.28:50010, 10.99.135.56:50010 for /apps/spark/warehouse/temporal_engine__hive_intermediate_15489__full.db/milestones_repartition/.hive-staging_hive_2019-11-29_08-12-32_852_6021558610039812629-11/-ext-10000/_temporary/0/_temporary/attempt_20191129081250_0185_m_000000_0/part-00000-3d064e1e-f2ac-4b3f-a47f-f8e67095ed36-c000

2019-11-29 08:12:50,963 INFO  hdfs.StateChange (FSNamesystem.java:completeFile(2865)) - DIR* completeFile: /apps/spark/warehouse/temporal_engine__hive_intermediate_15489__full.db/milestones_repartition/.hive-staging_hive_2019-11-29_08-12-32_852_6021558610039812629-11/-ext-10000/_temporary/0/_temporary/attempt_20191129081250_0185_m_000000_0/part-00000-3d064e1e-f2ac-4b3f-a47f-f8e67095ed36-c000 is closed by DFSClient_attempt_20191129074726_0004_m_000000_0_-1065101445_33

 

3)Block creation log in one of the Datanode:

2019-11-29 08:12:50,931 INFO  datanode.DataNode (DataXceiver.java:writeBlock(738)) - Receiving BP-640580566-10.99.135.34-1575013055635:blk_1073817155_76331 src: /10.99.135.20:34394 dest: /10.99.135.28:50010

2019-11-29 08:12:50,962 INFO  DataNode.clienttrace (BlockReceiver.java:finalizeBlock(1533)) - src: /10.99.135.20:34394, dest: /10.99.135.28:50010, bytes: 306, op: HDFS_WRITE, cliID: DFSClient_attempt_20191129074726_0004_m_000000_0_-1065101445_33, offset: 0, srvID: f1fb608e-f140-4736-ab22-2314840beea5, blockid: BP-640580566-10.99.135.34-1575013055635:blk_1073817155_76331, duration(ns): 17248141

2019-11-29 08:12:50,962 INFO  datanode.DataNode (BlockReceiver.java:run(1506)) - PacketResponder: BP-640580566-10.99.135.34-1575013055635:blk_1073817155_76331, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[10.99.135.56:50010] terminating

 

4)After the write & blocks is complete there are two rename commands to this file.  (hdfs-autit.log)

a) first rename commiting from task directory to job staging directory: 

2019-11-29 08:12:50,967 INFO FSNamesystem.audit: allowed=true   ugi=livy (auth:SIMPLE)  ip=/10.99.135.20        cmd=rename      src=/apps/spark/warehouse/temporal_engine__hive_intermediate_15489__full.db/milestones_repartition/.hive-staging_hive_2019-11-29_08-12-32_852_6021558610039812629-11/-ext-10000/_temporary/0/_temporary/attempt_20191129081250_0185_m_000000_0/part-00000-3d064e1e-f2ac-4b3f-a47f-f8e67095ed36-c000     dst=/apps/spark/warehouse/temporal_engine__hive_intermediate_15489__full.db/milestones_repartition/.hive-staging_hive_2019-11-29_08-12-32_852_6021558610039812629-11/-ext-10000/part-00000-3d064e1e-f2ac-4b3f-a47f-f8e67095ed36-c000        perm=livy:hdfs:rw-r--r--        proto=rpc       callerContext=SPARK_TASK_application_1575013160270_0010_1_JId_135_SId_185_0_TId_9888_0

b) second rename from job staging directory to final destination

2019-11-29 08:12:53,765 INFO FSNamesystem.audit: allowed=true   ugi=livy (auth:SIMPLE)  ip=/10.99.135.44        cmd=rename (options=[OVERWRITE])        src=/apps/spark/warehouse/temporal_engine__hive_intermediate_15489__full.db/milestones_repartition/.hive-staging_hive_2019-11-29_08-12-32_852_6021558610039812629-11/-ext-10000/part-00000-3d064e1e-f2ac-4b3f-a47f-f8e67095ed36-c000        dst=/apps/spark/warehouse/temporal_engine__hive_intermediate_15489__full.db/milestones_repartition/part-00000-3d064e1e-f2ac-4b3f-a47f-f8e67095ed36-c000     perm=livy:hdfs:rw-r--r--        proto=rpc       callerContext=SPARK_APPMASTER_application_1575013160270_0010_1

 

 

5)after this i don't see any requests for deleting this file or parent folder.

But at "2019-11-29 08:38:39,387"  I can see Aysc block deletion logs in the corresponding datanodes

2019-11-29 08:38:39,387 INFO  impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(225)) - Scheduling blk_1073817155_76331 replica FinalizedReplica, blk_1073817155_76331, FINALIZED

 

Posting all delete requests for this folder in hdfs-audit.log, there is no deletion request for the above blocks at that file

grep milestones_repartition hdfs-audit.log | grep delete

2019-11-29 08:12:53,733 INFO FSNamesystem.audit: allowed=true   ugi=livy (auth:SIMPLE)  ip=/10.99.135.44        cmd=delete      src=/apps/spark/warehouse/temporal_engine__hive_intermediate_15489__full.db/milestones_repartition/.hive-staging_hive_2019-11-29_08-12-32_852_6021558610039812629-11/-ext-10000/_temporary  dst=null        perm=null       proto=rpc       callerContext=SPARK_APPMASTER_application_1575013160270_0010_1

2019-11-29 08:12:53,742 INFO FSNamesystem.audit: allowed=true   ugi=livy (auth:SIMPLE)  ip=/10.99.135.44        cmd=delete      src=/apps/spark/warehouse/temporal_engine__hive_intermediate_15489__full.db/milestones_repartition/.hive-staging_hive_2019-11-29_08-12-32_852_6021558610039812629-11/-ext-10000/.spark-staging-3d064e1e-f2ac-4b3f-a47f-f8e67095ed36 dst=null        perm=null       proto=rpc       callerContext=SPARK_APPMASTER_application_1575013160270_0010_1

2019-11-29 08:12:53,789 INFO FSNamesystem.audit: allowed=true   ugi=livy (auth:SIMPLE)  ip=/10.99.135.44        cmd=delete      src=/apps/spark/warehouse/temporal_engine__hive_intermediate_15489__full.db/milestones_repartition/.hive-staging_hive_2019-11-29_08-12-32_852_6021558610039812629-11        dst=null        perm=null       proto=rpc       callerContext=SPARK_APPMASTER_application_1575013160270_0010_1

2019-11-29 08:14:56,929 INFO FSNamesystem.audit: allowed=true   ugi=livy (auth:SIMPLE)  ip=/10.99.135.70        cmd=delete      src=/apps/spark/warehouse/temporal_engine__pre_dim_intermediate_15489.db/milestones_repartition/.hive-staging_hive_2019-11-29_08-14-50_530_8634948911485589102-9/-ext-10000/_temporary      dst=null        perm=null       proto=rpc       callerContext=SPARK_APPMASTER_application_1575013160270_0081_1

2019-11-29 08:14:56,934 INFO FSNamesystem.audit: allowed=true   ugi=livy (auth:SIMPLE)  ip=/10.99.135.70        cmd=delete      src=/apps/spark/warehouse/temporal_engine__pre_dim_intermediate_15489.db/milestones_repartition/.hive-staging_hive_2019-11-29_08-14-50_530_8634948911485589102-9/-ext-10000/.spark-staging-abfb6223-6d5f-4909-a0ab-e725b2629f23     dst=null        perm=null       proto=rpc       callerContext=SPARK_APPMASTER_application_1575013160270_0081_1

2019-11-29 08:14:56,972 INFO FSNamesystem.audit: allowed=true   ugi=livy (auth:SIMPLE)  ip=/10.99.135.70        cmd=delete      src=/apps/spark/warehouse/temporal_engine__pre_dim_intermediate_15489.db/milestones_repartition/.hive-staging_hive_2019-11-29_08-14-50_530_8634948911485589102-9    dst=null        perm=null       proto=rpc       callerContext=SPARK_APPMASTER_application_1575013160270_0081_1

 

This is not reproducible happen always.Please help me here, we are stuck with this interim issue. 

 

Regards,

Satish Tvv.

2 REPLIES 2

avatar

Hi,

can someone please provide me at least some insights to where to look further or should I raise this in apache hadoop community? My suspicion is delete is executing before the rename is complete which seems unlikely but that's the only possible explanation I can think of. I'm not able to validate my  suspicion as this is not always reproducible, When this occurs Yarn usage goes 100% but not vice versa. 

appreciate your help

 

Thanks,

Satish Tvv.

avatar

The issue is not with HDFS, Its with hive metastore and posted the question here : https://community.cloudera.com/t5/Support-Questions/create-database-race-condition-in-Hive-metastore...