About Fawze

Fawze · ‎06-12-2017

Hi Guys, is there is an alternative to the --jars option of spark-submit in the spark notebook in Hue?

naveen1 · ‎05-24-2017

After recomission can just add the datanode back and Name node will identify all the blocks that were previously present in this datanode. Once Namenode identifies this information, It will wipe out the third replica that it created during the datanode decomission. You may have to run hdfs balancer if you format the disks and then recomision it to the cluster which is not a best practise.

Fawze · ‎05-11-2017

@mageru9 https://www.cloudera.com/documentation/enterprise/release-notes/topics/cm_rn_known_issues.html

Fawze · ‎05-09-2017

@code0404 See how i'm doing this, but i collect the aggregate metrics per pool, you can just use by the application name or the user. STARTDATE=`date -d " -1 day " +%s%N | cut -b1-13` ENDDATE=`date +%s%N | cut -b1-13` result=`curl -s http://your-yarn-history-server:8088/ws/v1/cluster/apps?finishedTimeBegin=$STARTDATE&finishedTimeEnd=$ENDDATE` echo $result | python -m json.tool | sed 's/["|,]//g' | grep -E "queue|coreSeconds" | awk -v DC="$DC" ' /queue/ { queue = $2 } /vcoreSeconds/ { arr[queue]+=$2 ; } END { for (x in arr) {print DC ".yarn." x ".cpums="arr[x]} } ' echo $result | python -m json.tool | sed 's/["|,]//g' | grep -E "queue|memorySeconds" | awk -v DC="$DC" ' /queue/ { queue = $2 } /memorySeconds/ { arr1[queue]+=$2 ; } END { for (y in arr1) {print DC ".yarn." y ".memorySeconds="arr1[y]} } '

mbigelow · ‎04-19-2017

man took a bit of trial and error. The issue with the first run is that it is returning an empty line. I tried a few awk specific was to get around it but they didn't work. So here is a hack. And using the variable withing awk as well. DC=PN hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}' PN.hadoop.hdfs.archive.folderscount 9 PN.hadoop.hdfs.archive.filescount 103 PN.hadoop.hdfs.archive.size 928524788 PN.hadoop.hdfs.dae.folderscount 1 PN.hadoop.hdfs.dae.filescount 13 PN.hadoop.hdfs.dae.size 192504874 PN.hadoop.hdfs.schema.folderscount 1 PN.hadoop.hdfs.schema.filescount 14 PN.hadoop.hdfs.schema.size 45964 DC=VA hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}' VA.hadoop.hdfs.archive.folderscount 9 VA.hadoop.hdfs.archive.filescount 103 VA.hadoop.hdfs.archive.size 928524788 VA.hadoop.hdfs.dae.folderscount 1 VA.hadoop.hdfs.dae.filescount 13 VA.hadoop.hdfs.dae.size 192504874 VA.hadoop.hdfs.schema.folderscount 1 VA.hadoop.hdfs.schema.filescount 14 VA.hadoop.hdfs.schema.size 45964

wenjie · ‎04-10-2017

Sorry for late reply. I need to test it after business time. I set it to 2046->1024->512-> 256 and found NetWork IO take a low value with the number reducing. I'm sure the parameter is working for it. For someone who will refer this ticket. The speed of data transformation will take a very low speed for NetworkIO if you set a low value to this parameter. so please adjust schedul of jobs.

Fawze · ‎04-02-2017

Thanks, Indeed in my case the memory I assigned to the executor was overrides by the memory passed in the workflow so the executors were running with 1 GB instead of 8GB. I fixed it by passing the memory in the workflow xml

Fawze · ‎03-29-2017

My CM is 5.5.2 and CDH 5.5.4 and all is working fine.

Tim Armstrong · ‎03-21-2017

I think it was probably unable to get enough memory because of other concurrently executing queries. This is somewhat counterintuitive, but if you set the mem_limit query option to an amount of memory that the query can reliably obtain, e.g. 2GB, then when it hits that limit spill-to-disk will kick in and the query should be able to complete (albeit slow than running fully in-memory). We generally recommend that all queries run with a mem_limit set. You can configure a default mem_limit via the "default query options" config or by setting up memory-based admission control. We have some good docs about how to set up memory-based admission control here: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html#admission_memory We're actively working on improving this so that it's more hands-off.

Fawze · ‎03-05-2017

When i checked the job/the query that occur prior to the alert on the JN, i found one hive query that runs on a data of 6 months and recreate the hive table from new, which resulted in a good percentage of edit logs, i contacted the query owner and he reduced the his running window from 6 months to 2 months which solve for us the issue.

Online	Offline
Last Visited	‎10-19-2023 10:11 PM

Member Since	‎01-25-2017 01:09 PM
Last Visited	‎10-19-2023 10:11 PM
Posts	396
Kudos received	27

Cloudera Community

Re: How to make Yarn deploy resources to new added...

Re: Upgrade to CDH 6.0.x from 5.15

Re: How to define concrete resource consumption fo...

Re: Excution of the following command gives warnin...

Re: Excution of the following command gives warnin...

Re: HUE Spark notebook ClassNotFoundException

Re: Best practices to join nodes back into the clu...

Re: Spark Exception running /etc/spark/conf.cloude...

Re: Finding Yarn Aggregate Resource Allocation

Re: parsing the HDFS dfs -count output

Re: How to limit speed ?

Re: ERROR YarnScheduler: Lost executor 7 on host r...

Re: How can I upgrade CDH 5.10 from 5.9.1 ?

Re: Impala query failed on memory limit

Re: Intermittently one of the journal nodes get ou...