Member since
03-11-2016
73
Posts
16
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1050 | 08-21-2019 06:03 AM | |
34827 | 05-24-2018 07:55 AM | |
4563 | 04-25-2018 08:38 AM | |
6282 | 01-23-2018 09:41 AM | |
1995 | 10-11-2017 09:44 AM |
12-12-2017
09:35 AM
@Gaurav Parmar Here is the documentation of the Cluster Applications API you are using. As you can see under the "Query Parameters Supported", to list jobs for a particular time frame you can use 4 parameters: startedTimeBegin startedTimeEnd finishedTimeBegin finishedTimeEnd All the parameters are specified in milliseconds since epoch, so you have to convert your time interval to Unix Timestamp. For example last week is 2017-12-04:00:00:01=1512345601000 - 2017-12-10:23:59:59=1512950399000. To list all the applications that were started and finished this week you can use http://hostname:8088/ws/v1/cluster/apps?startedTimeBegin=1512345601000&finishedTimeEnd=1512950399000&states=FINISHED
... View more
11-29-2017
02:45 PM
1 Kudo
@Joe Karau What is the exact HDP version you are using? In 2.6 the -filters option should be available to exclude certain files. It is documented as "The path to a file containing a list of pattern strings, one string per line, such that paths matching the pattern will be excluded from the copy. Supports regular expressions specified by java.util.regex.Pattern." However, it's questionable whether the filtering happens before the exception. Can you give it a try? If it doesn't work, unfortunately I think the easiest way to fix this is to specify only the "correct" files to be copied.
... View more
10-11-2017
09:44 AM
@Saikiran Parepally I don't think queue level preemption metrics exists. However, they are fairly easy to calculate from the app level metrics. curl http://YOUR_RM_ADDRESS.com:8088/ws/v1/cluster/apps > /tmp/apps
queues=$(cat /tmp/apps | jq '.apps.app[].queue' | sort -u)
for queue in $queues; do
echo $queue
metrics="preemptedResourceMB preemptedResourceVCores numNonAMContainerPreempted numAMContainerPreempted"
for metric in $metrics; do
printf "%30s: " $metric
cat /tmp/apps | jq -r ".apps.app[] | select(.queue == $queue) .$metric" | paste -s -d+ - | bc
done
done
Most likely there are more efficient ways to to do this calculation in higher level programming languages, or if you are a jq expert.
... View more
09-25-2017
02:42 PM
@raouia Based on your result.png, you are actually using python 3 in jupyter, you need the parentheses after print in python 3 (and not in python 2). To make sure, you should run this in your notebook: import sys
print(sys.version)
... View more
08-23-2017
09:05 AM
1 Kudo
@pbarna I think the Java API should be the fastest. FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf);
class DirectoryThread extends Thread {
private int from;
private int count;
private static final String basePath = "/user/d";
public DirectoryThread(int from, int count) {
this.from = from;
this.count = count;
}
@Override
public void run() {
for (int i = from; i < from + count; i++) {
Path path = new Path(basePath + i);
try {
fs.mkdirs(path);
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
long startTime = System.currentTimeMillis();
int threadCount = 8;
Thread threads[] = new Thread[threadCount];
int total = 1000000;
int countPerThread = total / threadCount;
for (int j = 0; j < threadCount; j++) {
Thread thread = new DirectoryThread(j * countPerThread, countPerThread);
thread.start();
threads[j] = thread;
}
for (Thread thread : threads) {
thread.join();
}
long endTime = System.currentTimeMillis();
System.out.println("Total: " + (endTime - startTime) + " milliseconds"); Obviously, use as many threads as you can. But still, this takes 1-2 minutes, I wonder how @bkosaraju could "complete in few seconds with your code"
... View more
08-21-2017
07:57 AM
@swathi thukkaraju I'm not completely sure what you mean by 'incremental load format', but here are some hints: To read FTP server files you can simply use the builtin python module urllib, more specifically urlopen or urlretrieve To write to HDFS you can Use an external library, like HdfsCLI Use the HDFS shell and call it from python with subprocess Mount your HDFS with HDFS NFS Gateway and simply write with the normal write() method. Beware, that using this solution you won't be able to append! Here's an implementation for you using urlopen and HdfsCli. To try it first install HdfsCli with pip install hdfs. from urllib.request import urlopen
from hdfs import InsecureClient
# You can also use KerberosClient or custom client
namenode_address = 'your namenode address'
webhdfs_port = 'your webhdfs port' # default for Hadoop 2: 50070, Hadoop 3: 9870
user = 'your user name'
client = InsecureClient('http://' + namenode_address + ':' + webhdfs_port, user=user)
ftp_address = 'your ftp address'
hdfs_path = 'where you want to write'
with urlopen(ftp_address) as response:
content = response.read()
# You can also use append=True
# Further reference: https://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client.write
with client.write(hdfs_path) as writer:
writer.write(content
... View more
08-17-2017
06:38 AM
The default queue's AM limit is 6144 MB -> the default queue's capacity must be 7 GB (for 6 the limit would be 5 and for 8 it would be 7 with the maximum-am-resource-percent of 0.9). Since default.capacity = 60 the whole cluster's capacity equals to ~ 100 / 60 * 7, which could indicate 12 or 13 GBs in total, but the latter would be very unusual. Did you manage to overcome your issue with any of my suggestions?
... View more
08-07-2017
11:05 AM
@Karan Alang Based on this information I assume you have 12 GBs of memory and the minimum allocation is set to 1024 MB, the default queue has a configured capacity of 60%, 7 GBs. The AM limit is 6 GBs (7.2 * 0.9 rounded to GB), and it is full, probably three other AMs are running. Please correct me if I'm wrong! To get more memory, you might try these things: Add more memory to the cluster 😛 Increase the maximum-capacity of the default queue, so that it can use more resources when the LLAP queue doesn't use them Increase the maximum-am-resource-percent of the default queue to 1 Decrease the minimum-allocation-mb: this way the other AMs (and containers) might use less resources (e.g. if you need 1.2 GBs - just for the sake of the example - then with the default 1 GB minimum allocation you still need to get a 2 GB container) Kill other applications from the queue or wait until they finish
... View more
08-07-2017
05:59 AM
@Karan Alang Could you please share (at least the relevant part of) your Capacity Scheduler config and tell us how much memory your default queue in total should have? Based on your error your default queue's AM limit is in fact exceeded..
... View more
07-20-2017
07:36 AM
@Bala Vignesh N V Your problem statement can be interpreted in two ways. The first (and for me more logical) way is that a movie has multiple genres, and you want to count how many movies each genre has: genres = movies.flatMap(lambda line: line.split(',')[2].split('|'))
genres.countByValue() We map each lines into multiple output items (genres), that why we use flatMap. First, we split each line by ',' and get the 3rd column, then we split the genres by '|' and omit them. This gives you: 'Adventure': 2,
'Animation': 1,
'Children': 2,
'Comedy': 4,
'Drama': 1,
'Fantasy': 2,
'Romance': 2
Your 'SQL' query (select genres, count(*)) suggests another approach: if you want to count the combinations of genres, for example movies that are Comedy AND Romance. In that case you can simply use: genre_combinations = movies.map(lambda line: line.split(',')[2])
genre_combinations.countByValue()
This gives you: 'Adventure|Animation|Children|Comedy|Fantasy': 1,
'Adventure|Children|Fantasy': 1,
'Comedy': 1,
'Comedy|Drama|Romance': 1,
'Comedy|Romance': 1
... View more