Member since
03-11-2016
73
Posts
16
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
545 | 08-21-2019 06:03 AM | |
24963 | 05-24-2018 07:55 AM | |
2826 | 04-25-2018 08:38 AM | |
4165 | 01-23-2018 09:41 AM | |
1143 | 10-11-2017 09:44 AM |
08-21-2019
06:03 AM
1 Kudo
This might be caused by NIFI-5525. Check for double quotes in your CSV. Either remove them or update NiFi to >=1.8.0.
... View more
06-01-2018
08:27 AM
@RAUI wholeTextFile() is not part of the HDFS API, I'm assuming you're using Spark, with which I'm not too familiar. I suggest you to post another question for this to HCC.
... View more
05-31-2018
02:48 PM
1 Kudo
@RAUI No, it won't create it, the target directory must exist. However, if the target directory doesn't exist, it won't throw an exception, it will only indicate the error via the return value (as described in the documentation). So 1) you should create the target directory before you call rename() and 2) you should check the return value, like this: fs.mkdirs(new Path("/your/target/path"));
boolean result = fs.rename(
new Path("/your/source/path/your.file"),
new Path("/your/target/path/your.file"));
if (!result) {
...
}
... View more
05-24-2018
07:55 AM
@RAUI The answer is no. Renaming is the way to move files on HDFS: FileSystem.rename(). Actually, this is exactly what the HDFS shell command "-mv" does as well, you can check it in the source code. If you think about it, it's pretty logical, since when you move a file on the distributed file system, you don't really move any blocks of the file, you just update the "path" metadata of the file in the NameNode.
... View more
05-17-2018
08:43 AM
You can figure that out from the RM UI, or the RM/application logs.
... View more
05-17-2018
08:27 AM
@Himanshu Kukreja It's hard to figure out from your screenshot what kind of applications these are. I'd recommend you to dig deeper, find out as much about these as you can: more application info, application logs, etc. Then you should be able to stop them from spawning.
... View more
05-15-2018
07:49 AM
@Dinesh Chitlangia Unfortunately the native build on OS X is broken by HDFS-13403 at this moment on trunk. You have two options: If you don't need native build, you can build hadoop without the -Pnative option successfully. The build issue is fixed by HDFS-13534, but it's not merged yet (at the time of writing this answer). You can either wait until it gets merged, or apply it manually: wget https://issues.apache.org/jira/secure/attachment/12922534/HDFS-13534.001.patch
git apply HDFS-13534.001.patch
... View more
04-25-2018
08:38 AM
1 Kudo
@Manikandan Jeyabal Your question is not quite clear to me. If you really want to fetch data from the YARN Resource Manager REST API in Java, all you need to do is open an HttpURLConnection and get the data from any endpoint. E.g.: URL url = new URL("http://" + rmHost + ":8088/ws/v1/cluster/apps");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream()));
... // read and process your data
conn.disconnect(); But there is a much easier solution to get data from the RM in Java: YarnClient, which is basically a Java API for YARN. YarnClient yarnClient = YarnClient.createYarnClient();
Configuration conf = new YarnConfiguration();
conf.set("yarn.resourcemanager.hostname", "your RM hostname");
yarnClient.init(conf);
yarnClient.start();
for (ApplicationReport applicationReport : yarnClient.getApplications()) {
System.out.println(applicationReport.getApplicationId());
}
... View more
03-08-2018
09:55 AM
@Jon Page Could you please provide more information about what kind of "job" are you trying to run through yarn? Are you using Spark? Custom native YARN App? Distributed Shell? Knit? Nevertheless, you could run a simple distributed shell app to see which python version YARN picks up: yarn jar path/to/hadoop-yarn-applications-distributedshell.jar -jar path/to/hadoop-yarn-applications-distributedshell.jar -shell_command python -shell_args -V Or you can check the same with the framework you are using.
... View more
01-23-2018
02:01 PM
@Anton P I'm glad it works. I'm not sure how exactly the "fair" ordering policy works inside one queue, but preemption is only for between queues. I assume, that it will try to give resources to the applications/users in the same queue equally, but once a container is running it will not preempt it. If you would like to achieve that, you should consider creating sub-queues.
... View more
01-23-2018
09:41 AM
@Anton P You are doing everything just fine, this is by design. The "Ordering Policy" can indeed only be set for leaf queues, because it defines the ordering policy between applications in the same queue. So it has nothing to do with your use case. "I try to run two Yarn queues where if only one queue is active it will consume all the resources and once a job will arrive to the second queue Yarn will preempt some of the resources of the first queue to start the second job." To achieve this, you need to configure your queues like this (I think, you already did this): yarn.scheduler.capacity.root.queues=test1,test2
yarn.scheduler.capacity.root.test1.capacity=50
yarn.scheduler.capacity.root.test1.maximum-capacity=100
yarn.scheduler.capacity.root.test2.capacity=50
yarn.scheduler.capacity.root.test2.maximum-capacity=100
... and enable preemption (as described in the article you attached). This will let the first application in the first queue to use all the resources, until the second job arrives to the second queue, then the resources will be devided equally between the two queues. Hope this makes everything clear, give it a try 🙂
... View more
01-03-2018
06:44 PM
@Jon Page You can't move just the usercache directory, but you can move its parent: set yarn.nodemanager.local-dirs to a different location.
... View more
01-03-2018
09:08 AM
@Nishant Verma In the first case your 'dummy' user can't authenticate to the Timeline Server to get the Delegation Token. In the second case you missed an important step: you need to kinit with your spark/hdfs user to get a TGT. You need to use a password or a keytab file to do so.
... View more
12-12-2017
04:48 PM
@Amithesh Merugu Try to use the IP address of the NameNode. And also add the port (default is 8020).
... View more
12-12-2017
10:35 AM
@Amithesh Merugu Use this method: copyFromLocalFile(Path src,
Path dst). The first parameter is a path on your local disk (in your example /tmp/files) and the second is the HDFS path (hdfs://user/username). The documentation doesn't make it clear, but the source can be a dictionary and then the whole content is copied to the HDFS. FileSystem fs = FileSystem.get(hdfsUri, conf);
fs.copyFromLocalFile(new Path("/tmp/files"), new Path("/user/username"));
... View more
12-12-2017
09:35 AM
@Gaurav Parmar Here is the documentation of the Cluster Applications API you are using. As you can see under the "Query Parameters Supported", to list jobs for a particular time frame you can use 4 parameters: startedTimeBegin startedTimeEnd finishedTimeBegin finishedTimeEnd All the parameters are specified in milliseconds since epoch, so you have to convert your time interval to Unix Timestamp. For example last week is 2017-12-04:00:00:01=1512345601000 - 2017-12-10:23:59:59=1512950399000. To list all the applications that were started and finished this week you can use http://hostname:8088/ws/v1/cluster/apps?startedTimeBegin=1512345601000&finishedTimeEnd=1512950399000&states=FINISHED
... View more
11-29-2017
02:45 PM
1 Kudo
@Joe Karau What is the exact HDP version you are using? In 2.6 the -filters option should be available to exclude certain files. It is documented as "The path to a file containing a list of pattern strings, one string per line, such that paths matching the pattern will be excluded from the copy. Supports regular expressions specified by java.util.regex.Pattern." However, it's questionable whether the filtering happens before the exception. Can you give it a try? If it doesn't work, unfortunately I think the easiest way to fix this is to specify only the "correct" files to be copied.
... View more
10-11-2017
09:44 AM
@Saikiran Parepally I don't think queue level preemption metrics exists. However, they are fairly easy to calculate from the app level metrics. curl http://YOUR_RM_ADDRESS.com:8088/ws/v1/cluster/apps > /tmp/apps
queues=$(cat /tmp/apps | jq '.apps.app[].queue' | sort -u)
for queue in $queues; do
echo $queue
metrics="preemptedResourceMB preemptedResourceVCores numNonAMContainerPreempted numAMContainerPreempted"
for metric in $metrics; do
printf "%30s: " $metric
cat /tmp/apps | jq -r ".apps.app[] | select(.queue == $queue) .$metric" | paste -s -d+ - | bc
done
done
Most likely there are more efficient ways to to do this calculation in higher level programming languages, or if you are a jq expert.
... View more
09-25-2017
02:42 PM
@raouia Based on your result.png, you are actually using python 3 in jupyter, you need the parentheses after print in python 3 (and not in python 2). To make sure, you should run this in your notebook: import sys
print(sys.version)
... View more
09-11-2017
07:52 AM
@jpj To the best of my knowledge, MAPREDUCE-6304 is only part of HDP-2.6.2.0. Please check this document to upgrade.
... View more
08-23-2017
09:05 AM
1 Kudo
@pbarna I think the Java API should be the fastest. FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf);
class DirectoryThread extends Thread {
private int from;
private int count;
private static final String basePath = "/user/d";
public DirectoryThread(int from, int count) {
this.from = from;
this.count = count;
}
@Override
public void run() {
for (int i = from; i < from + count; i++) {
Path path = new Path(basePath + i);
try {
fs.mkdirs(path);
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
long startTime = System.currentTimeMillis();
int threadCount = 8;
Thread threads[] = new Thread[threadCount];
int total = 1000000;
int countPerThread = total / threadCount;
for (int j = 0; j < threadCount; j++) {
Thread thread = new DirectoryThread(j * countPerThread, countPerThread);
thread.start();
threads[j] = thread;
}
for (Thread thread : threads) {
thread.join();
}
long endTime = System.currentTimeMillis();
System.out.println("Total: " + (endTime - startTime) + " milliseconds"); Obviously, use as many threads as you can. But still, this takes 1-2 minutes, I wonder how @bkosaraju could "complete in few seconds with your code"
... View more
08-21-2017
08:06 AM
.*(?<!History)$
... View more
08-21-2017
07:57 AM
@swathi thukkaraju I'm not completely sure what you mean by 'incremental load format', but here are some hints: To read FTP server files you can simply use the builtin python module urllib, more specifically urlopen or urlretrieve To write to HDFS you can Use an external library, like HdfsCLI Use the HDFS shell and call it from python with subprocess Mount your HDFS with HDFS NFS Gateway and simply write with the normal write() method. Beware, that using this solution you won't be able to append! Here's an implementation for you using urlopen and HdfsCli. To try it first install HdfsCli with pip install hdfs. from urllib.request import urlopen
from hdfs import InsecureClient
# You can also use KerberosClient or custom client
namenode_address = 'your namenode address'
webhdfs_port = 'your webhdfs port' # default for Hadoop 2: 50070, Hadoop 3: 9870
user = 'your user name'
client = InsecureClient('http://' + namenode_address + ':' + webhdfs_port, user=user)
ftp_address = 'your ftp address'
hdfs_path = 'where you want to write'
with urlopen(ftp_address) as response:
content = response.read()
# You can also use append=True
# Further reference: https://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client.write
with client.write(hdfs_path) as writer:
writer.write(content
... View more
08-17-2017
06:38 AM
The default queue's AM limit is 6144 MB -> the default queue's capacity must be 7 GB (for 6 the limit would be 5 and for 8 it would be 7 with the maximum-am-resource-percent of 0.9). Since default.capacity = 60 the whole cluster's capacity equals to ~ 100 / 60 * 7, which could indicate 12 or 13 GBs in total, but the latter would be very unusual. Did you manage to overcome your issue with any of my suggestions?
... View more
08-09-2017
11:46 AM
@Dennis Hude Unfortunately I don't think there is, Hadoop is mainly designed for Linux clusters. But if you're interested, you can definitely try to write a ResourceCalculatorProcessTree implementation for Mac or just open a Jira ticket for it and see if someone else is interested.
... View more
08-07-2017
11:05 AM
@Karan Alang Based on this information I assume you have 12 GBs of memory and the minimum allocation is set to 1024 MB, the default queue has a configured capacity of 60%, 7 GBs. The AM limit is 6 GBs (7.2 * 0.9 rounded to GB), and it is full, probably three other AMs are running. Please correct me if I'm wrong! To get more memory, you might try these things: Add more memory to the cluster 😛 Increase the maximum-capacity of the default queue, so that it can use more resources when the LLAP queue doesn't use them Increase the maximum-am-resource-percent of the default queue to 1 Decrease the minimum-allocation-mb: this way the other AMs (and containers) might use less resources (e.g. if you need 1.2 GBs - just for the sake of the example - then with the default 1 GB minimum allocation you still need to get a 2 GB container) Kill other applications from the queue or wait until they finish
... View more
08-07-2017
05:59 AM
@Karan Alang Could you please share (at least the relevant part of) your Capacity Scheduler config and tell us how much memory your default queue in total should have? Based on your error your default queue's AM limit is in fact exceeded..
... View more
08-06-2017
07:50 AM
@Paul Yang You can simply use the UI of the JobHistory server remotely, can't you?
... View more
08-02-2017
09:41 AM
1 Kudo
@Dennis Hude
What operating system are you using? CPU_MILLISECONDS, PHYSICAL_MEMORY_BYTES and VIRTUAL_MEMORY_BYTES are collected by ResourceCalculatorProcessTree which has two implementations: ProcfsBasedProcessTree (for Linux that uses /proc) and WindowsBasedProcessTree. You can check the syslogs from your application containers to see which one is used, grep for this message: "Using ResourceCalculatorProcessTree :". Not having an initiated ResourceCalculatorProcessTree instance explains your 0 values. If you have one, there might be a problem with that which requires further investigation (the container logs can help with that, too).
... View more
08-01-2017
07:11 AM
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_release-notes/content/upgrading_parent.html
... View more