Member since
01-25-2017
396
Posts
28
Kudos Received
11
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
830 | 10-19-2023 04:36 PM | |
4355 | 12-08-2018 06:56 PM | |
5450 | 10-05-2018 06:28 AM | |
19822 | 04-19-2018 02:27 AM | |
19844 | 04-18-2018 09:40 AM |
06-12-2017
08:47 AM
Hi Guys, is there is an alternative to the --jars option of spark-submit in the spark notebook in Hue?
... View more
05-24-2017
02:31 PM
After recomission can just add the datanode back and Name node will identify all the blocks that were previously present in this datanode. Once Namenode identifies this information, It will wipe out the third replica that it created during the datanode decomission. You may have to run hdfs balancer if you format the disks and then recomision it to the cluster which is not a best practise.
... View more
05-11-2017
01:20 PM
1 Kudo
@mageru9 https://www.cloudera.com/documentation/enterprise/release-notes/topics/cm_rn_known_issues.html
... View more
05-09-2017
01:10 PM
@code0404 See how i'm doing this, but i collect the aggregate metrics per pool, you can just use by the application name or the user. STARTDATE=`date -d " -1 day " +%s%N | cut -b1-13` ENDDATE=`date +%s%N | cut -b1-13` result=`curl -s http://your-yarn-history-server:8088/ws/v1/cluster/apps?finishedTimeBegin=$STARTDATE&finishedTimeEnd=$ENDDATE` echo $result | python -m json.tool | sed 's/["|,]//g' | grep -E "queue|coreSeconds" | awk -v DC="$DC" ' /queue/ { queue = $2 } /vcoreSeconds/ { arr[queue]+=$2 ; } END { for (x in arr) {print DC ".yarn." x ".cpums="arr[x]} } ' echo $result | python -m json.tool | sed 's/["|,]//g' | grep -E "queue|memorySeconds" | awk -v DC="$DC" ' /queue/ { queue = $2 } /memorySeconds/ { arr1[queue]+=$2 ; } END { for (y in arr1) {print DC ".yarn." y ".memorySeconds="arr1[y]} } '
... View more
04-19-2017
02:39 PM
man took a bit of trial and error. The issue with the first run is that it is returning an empty line. I tried a few awk specific was to get around it but they didn't work. So here is a hack. And using the variable withing awk as well. DC=PN
hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}'
PN.hadoop.hdfs.archive.folderscount 9
PN.hadoop.hdfs.archive.filescount 103
PN.hadoop.hdfs.archive.size 928524788
PN.hadoop.hdfs.dae.folderscount 1
PN.hadoop.hdfs.dae.filescount 13
PN.hadoop.hdfs.dae.size 192504874
PN.hadoop.hdfs.schema.folderscount 1
PN.hadoop.hdfs.schema.filescount 14
PN.hadoop.hdfs.schema.size 45964
DC=VA
hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}'
VA.hadoop.hdfs.archive.folderscount 9
VA.hadoop.hdfs.archive.filescount 103
VA.hadoop.hdfs.archive.size 928524788
VA.hadoop.hdfs.dae.folderscount 1
VA.hadoop.hdfs.dae.filescount 13
VA.hadoop.hdfs.dae.size 192504874
VA.hadoop.hdfs.schema.folderscount 1
VA.hadoop.hdfs.schema.filescount 14
VA.hadoop.hdfs.schema.size 45964
... View more
04-10-2017
07:05 PM
Sorry for late reply. I need to test it after business time. I set it to 2046->1024->512-> 256 and found NetWork IO take a low value with the number reducing. I'm sure the parameter is working for it. For someone who will refer this ticket. The speed of data transformation will take a very low speed for NetworkIO if you set a low value to this parameter. so please adjust schedul of jobs.
... View more
04-02-2017
09:24 PM
1 Kudo
Thanks, Indeed in my case the memory I assigned to the executor was overrides by the memory passed in the workflow so the executors were running with 1 GB instead of 8GB. I fixed it by passing the memory in the workflow xml
... View more
03-21-2017
11:29 AM
I think it was probably unable to get enough memory because of other concurrently executing queries. This is somewhat counterintuitive, but if you set the mem_limit query option to an amount of memory that the query can reliably obtain, e.g. 2GB, then when it hits that limit spill-to-disk will kick in and the query should be able to complete (albeit slow than running fully in-memory). We generally recommend that all queries run with a mem_limit set. You can configure a default mem_limit via the "default query options" config or by setting up memory-based admission control. We have some good docs about how to set up memory-based admission control here: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html#admission_memory We're actively working on improving this so that it's more hands-off.
... View more
03-05-2017
02:26 PM
When i checked the job/the query that occur prior to the alert on the JN, i found one hive query that runs on a data of 6 months and recreate the hive table from new, which resulted in a good percentage of edit logs, i contacted the query owner and he reduced the his running window from 6 months to 2 months which solve for us the issue.
... View more
- « Previous
- Next »