About yves_name

yves_name · ‎01-23-2018

In case useful for others . The hdfs get at some stage corrupted. i made an fsck -delete, but ended up in a instable situation . All the given directory get totally full on all the node . This is related to the block scanner, which is a facility to scan all block and do necessary verification . This only occur every 3 weeks by default due to the intensity of disk scan and IO. So to claim back those blockpool you have to trigger the Block Scanner, which is not possible through command line . One option can be set dfs.datanode.scan.period.hours to 1 . You may also consider to delete the scanner.cursor files rm -rf `locate scanner.cursor` then restart the datanode . http://hadoopinrealworld.com/datanode-block-scanner/ https://community.hortonworks.com/questions/6931/in-hdfs-why-corrupted-blocks-happens.html https://blog.cloudera.com/blog/2016/12/hdfs-datanode-scanners-and-disk-checker-explained/

yves_name · ‎01-22-2018

HDP 2.6.3.0-235 One of my hadoop data directory is full on all the cluster instance (same drive all the time ) 100% usage . I have deleted almost all data in hdfs with skiptrash + expunge. I even try to reboot all boxes but still the directory is full on all cluster member When i dive into the directory structure i can see that it is the hdfs blockpool area . >hdfs dfs -du / 45641 /app-logs 247478401 /apps 92202 /ats 950726849 /hdp 0 /livy-recovery 0 /livy2-recovery 0 /mapred0 /mr-history 0 /project 5922 /spark-history 0 /spark2-history 2 /system 98729320 /tmp 981081678 /user 0 /webhdfs >hdfs dfs -df / Filesystem Size Used Available Use% hdfs://X:8020 412794792448 186773504950 149000060339 45% ==== if i go down into the data directory i end up finding blockpool file that are not known when you try to fsck them by blockId will others are. >cd /hadoop/hdfs/data/current/BP-1356934633-X.X.X.X-1513618933915/current/finalized/subdir0/subdir150/ >ls blk_1073780387 blk_1073780392 blk_1073780395 blk_1073780463 blk_1073780475 blk_1073780387_39569.meta blk_1073780392_39574.meta blk_1073780395_39577.meta blk_1073780463_39645.meta blk_1073780475_39657.meta >hdfs fsck -locations -files -blockId blk_1073780463 Connecting to namenode via http://X.X.X.X:50070/fsck?ugi=hdfs&locations=1&files=1&blockId=blk_1073780463+&path=%2F FSCK started by hdfs (auth:X) from /X.X.X.X at Mon Jan 22 14:30:02 GMT 2018 Block blk_1073780463 does not exist > ===== Anyone ever seen something like that, sound the file is deleted in namenode but not on the file system, is their a command to run to check that integrity and or can i delete any blk_nnnnn file if not known when doing fsck ? Thanks in advance for your help

yves_name · ‎04-04-2017

Faced similar issue with HDP 2.5.3, the article is good but might be good to include in ambari the run of the yum -y erase hdp-select Otherwhise each time the installation fail and you have to retry you get the issue

yves_name · ‎09-21-2016

Is their a clean and good way to trigger an execution of script or oozie workflow at the completion of a file storage in hdfs on HDP. When file lands on the hdfs. I can't use NIFI, so please don't respond NIFI. While going around in forums i only found people saying "not available in hdfs current api" or people making an Oozie job polling directory on a regular basis . Issue is that the more directory you have to trigger the more polling jobs you will have which is a waste of ressources. Also this will in all cases generate a delay in processing to balance with unnecessary workload for polling frequency. Best sounds to be informed of file save and match if it correspond to a regexp. The below idea is definitevely not enterprise class and base on the namenode log parsing, is their a better and cleaner way to process and has anything being missed . Consider to monitor /var/log/hadoop/hdfs/hadoop-hdfs-namenode-*.log, for the sentence *"INFO hdfs.StateChange (FSNamesystem.java:completeFile"*"completeFile"*"is closed by"*. The code could look like the below and allow to detect that a file is present in a given directory or directory tree . tail -f /var/log/hadoop/hdfs/hadoop-hdfs-namenode-*.log | while read line; do case "$line" in *"INFO hdfs.StateChange (FSNamesystem.java:completeFile"*"completeFile"*"is closed by"*) v_filename=`echo $line | sed -e 's?^.* completeFile: $.*$ is closed by.*?\1?' ` v_dirname=`dirname $v_filename` echo File created [$v_filename] Dirname [$v_dirname] #echo line $line case "$v_dirname" in "/data/ingest"* ) echo WATCH DIRECTORY directory $v_dirname : file $v_filename #FileTriggerExec.sh $v_dirname $v_filename ;; esac ;; esac done Any comments , improvment, more industrial solution ?

yves_name · ‎06-01-2016

Currently reading different paper and article i'm wondering if their is a well known set of good tools/pattern to transfer, process and land on hdfs large ingest files & logs. I saw this article , but apart from saying NIFI are their other solution ? Currently we use SFTP, but this is not parallel FTP and may face performance issues based on size and latency. I had a look to flume but unfortunately it sounds a non production idea to use flume to tranfer gzipped files. You have to use a blob that load the all file in memory. I'm a little surprised that nothing exist out of the box to chunk a file and send the data in parallel over several TCP connection. Likely the code exist for video transfer and i'm wondering if someone somewhere in apache have incoporate such code to transfer large log files and get them landed on hdfs. Any hints opinion welcome : Confirm that flume isn't appropriate or provide configuration for it ( many people asking but no firm config so far ) Any other tools or pattern ?

Online	Offline
Last Visited	‎02-13-2018 09:28 AM

Member Since	‎06-01-2016 05:36 AM
Last Visited	‎02-13-2018 09:28 AM
Posts	5

Cloudera Community

Re: Local File system full, due to hadoop data dir...

Local File system full, due to hadoop data directo...

Re: /usr/hdp/current/hadoop-client/conf doesn't ex...

HDFS Best way to trigger execution at File arrival

Best tools for file transfer and ingest.