About gnovak

gnovak · ‎12-12-2017

@Gaurav Parmar Here is the documentation of the Cluster Applications API you are using. As you can see under the "Query Parameters Supported", to list jobs for a particular time frame you can use 4 parameters: startedTimeBegin startedTimeEnd finishedTimeBegin finishedTimeEnd All the parameters are specified in milliseconds since epoch, so you have to convert your time interval to Unix Timestamp. For example last week is 2017-12-04:00:00:01=1512345601000 - 2017-12-10:23:59:59=1512950399000. To list all the applications that were started and finished this week you can use http://hostname:8088/ws/v1/cluster/apps?startedTimeBegin=1512345601000&finishedTimeEnd=1512950399000&states=FINISHED

gnovak · ‎11-29-2017

@Joe Karau What is the exact HDP version you are using? In 2.6 the -filters option should be available to exclude certain files. It is documented as "The path to a file containing a list of pattern strings, one string per line, such that paths matching the pattern will be excluded from the copy. Supports regular expressions specified by java.util.regex.Pattern." However, it's questionable whether the filtering happens before the exception. Can you give it a try? If it doesn't work, unfortunately I think the easiest way to fix this is to specify only the "correct" files to be copied.

gnovak · ‎10-11-2017

@Saikiran Parepally I don't think queue level preemption metrics exists. However, they are fairly easy to calculate from the app level metrics. curl http://YOUR_RM_ADDRESS.com:8088/ws/v1/cluster/apps > /tmp/apps queues=$(cat /tmp/apps | jq '.apps.app[].queue' | sort -u) for queue in $queues; do echo $queue metrics="preemptedResourceMB preemptedResourceVCores numNonAMContainerPreempted numAMContainerPreempted" for metric in $metrics; do printf "%30s: " $metric cat /tmp/apps | jq -r ".apps.app[] | select(.queue == $queue) .$metric" | paste -s -d+ - | bc done done Most likely there are more efficient ways to to do this calculation in higher level programming languages, or if you are a jq expert.

gnovak · ‎09-25-2017

@raouia Based on your result.png, you are actually using python 3 in jupyter, you need the parentheses after print in python 3 (and not in python 2). To make sure, you should run this in your notebook: import sys print(sys.version)

gnovak · ‎08-23-2017

@pbarna I think the Java API should be the fastest. FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf); class DirectoryThread extends Thread { private int from; private int count; private static final String basePath = "/user/d"; public DirectoryThread(int from, int count) { this.from = from; this.count = count; } @Override public void run() { for (int i = from; i < from + count; i++) { Path path = new Path(basePath + i); try { fs.mkdirs(path); } catch (IOException e) { e.printStackTrace(); } } } } long startTime = System.currentTimeMillis(); int threadCount = 8; Thread threads[] = new Thread[threadCount]; int total = 1000000; int countPerThread = total / threadCount; for (int j = 0; j < threadCount; j++) { Thread thread = new DirectoryThread(j * countPerThread, countPerThread); thread.start(); threads[j] = thread; } for (Thread thread : threads) { thread.join(); } long endTime = System.currentTimeMillis(); System.out.println("Total: " + (endTime - startTime) + " milliseconds"); Obviously, use as many threads as you can. But still, this takes 1-2 minutes, I wonder how @bkosaraju could "complete in few seconds with your code"

gnovak · ‎08-21-2017

@swathi thukkaraju I'm not completely sure what you mean by 'incremental load format', but here are some hints: To read FTP server files you can simply use the builtin python module urllib, more specifically urlopen or urlretrieve To write to HDFS you can Use an external library, like HdfsCLI Use the HDFS shell and call it from python with subprocess Mount your HDFS with HDFS NFS Gateway and simply write with the normal write() method. Beware, that using this solution you won't be able to append! Here's an implementation for you using urlopen and HdfsCli. To try it first install HdfsCli with pip install hdfs. from urllib.request import urlopen from hdfs import InsecureClient # You can also use KerberosClient or custom client namenode_address = 'your namenode address' webhdfs_port = 'your webhdfs port' # default for Hadoop 2: 50070, Hadoop 3: 9870 user = 'your user name' client = InsecureClient('http://' + namenode_address + ':' + webhdfs_port, user=user) ftp_address = 'your ftp address' hdfs_path = 'where you want to write' with urlopen(ftp_address) as response: content = response.read() # You can also use append=True # Further reference: https://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client.write with client.write(hdfs_path) as writer: writer.write(content

gnovak · ‎08-17-2017

The default queue's AM limit is 6144 MB -> the default queue's capacity must be 7 GB (for 6 the limit would be 5 and for 8 it would be 7 with the maximum-am-resource-percent of 0.9). Since default.capacity = 60 the whole cluster's capacity equals to ~ 100 / 60 * 7, which could indicate 12 or 13 GBs in total, but the latter would be very unusual. Did you manage to overcome your issue with any of my suggestions?

gnovak · ‎08-07-2017

@Karan Alang Based on this information I assume you have 12 GBs of memory and the minimum allocation is set to 1024 MB, the default queue has a configured capacity of 60%, 7 GBs. The AM limit is 6 GBs (7.2 * 0.9 rounded to GB), and it is full, probably three other AMs are running. Please correct me if I'm wrong! To get more memory, you might try these things: Add more memory to the cluster 😛 Increase the maximum-capacity of the default queue, so that it can use more resources when the LLAP queue doesn't use them Increase the maximum-am-resource-percent of the default queue to 1 Decrease the minimum-allocation-mb: this way the other AMs (and containers) might use less resources (e.g. if you need 1.2 GBs - just for the sake of the example - then with the default 1 GB minimum allocation you still need to get a 2 GB container) Kill other applications from the queue or wait until they finish

gnovak · ‎08-07-2017

@Karan Alang Could you please share (at least the relevant part of) your Capacity Scheduler config and tell us how much memory your default queue in total should have? Based on your error your default queue's AM limit is in fact exceeded..

gnovak · ‎07-20-2017

@Bala Vignesh N V Your problem statement can be interpreted in two ways. The first (and for me more logical) way is that a movie has multiple genres, and you want to count how many movies each genre has: genres = movies.flatMap(lambda line: line.split(',')[2].split('|')) genres.countByValue() We map each lines into multiple output items (genres), that why we use flatMap. First, we split each line by ',' and get the 3rd column, then we split the genres by '|' and omit them. This gives you: 'Adventure': 2, 'Animation': 1, 'Children': 2, 'Comedy': 4, 'Drama': 1, 'Fantasy': 2, 'Romance': 2 Your 'SQL' query (select genres, count(*)) suggests another approach: if you want to count the combinations of genres, for example movies that are Comedy AND Romance. In that case you can simply use: genre_combinations = movies.map(lambda line: line.split(',')[2]) genre_combinations.countByValue() This gives you: 'Adventure|Animation|Children|Comedy|Fantasy': 1, 'Adventure|Children|Fantasy': 1, 'Comedy': 1, 'Comedy|Drama|Romance': 1, 'Comedy|Romance': 1

Online	Offline
Last Visited	‎08-26-2019 03:26 AM

Member Since	‎03-11-2016 12:20 PM
Last Visited	‎08-26-2019 03:26 AM
Posts	73
Kudos received	16

Cloudera Community

Re: NiFi ConvertRecord StringIndexOutOfBoundsExcep...

Re: Move file from one HDFS directoy to another us...

Re: how to get data from Yarn Resource Manager RES...

Re: Yarn queue has fair ordering policy only on le...

Re: YARN - Is there a metric in RM for Number of C...

Re: How can we list yarn jobs for a particular tim...

Re: DistCP is failing when when creating file list...

Re: YARN - Is there a metric in RM for Number of C...

Re: How to specify Python version to use with Pysp...

Re: What is the fastest way to create large number...

Re: How read ftp server files and load into hdfs i...

Re: spark-shell not getting launched - Queue's AM ...

Re: spark-shell not getting launched - Queue's AM ...

Re: spark-shell not getting launched - Queue's AM ...

Re: Using countByValue() for a particular column i...