Member since
10-01-2015
3933
Posts
1150
Kudos Received
374
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2792 | 05-03-2017 05:13 PM | |
2339 | 05-02-2017 08:38 AM | |
2486 | 05-02-2017 08:13 AM | |
2527 | 04-10-2017 10:51 PM | |
1151 | 03-28-2017 02:27 AM |
03-13-2017
08:17 PM
1 Kudo
@Sunile Manjee you can leverage WebHCat for this as one idea, https://cwiki.apache.org/confluence/display/Hive/WebHCat+UsingWebHCat#WebHCatUsingWebHCat-ErrorCodesandResponses # this will execute a hive query and save result to hdfs file in your home directory called output curl -s -d execute="select+*+from+sample_08;" \
-d statusdir="output" \
'http://localhost:50111/templeton/v1/hive?user.name=root' # if you ls on the directory, it will have two files, stderr and stdout hdfs dfs -ls output # if the job succeeded, you can cat the stdout file and view the results hdfs dfs -cat output/stdout when you invoke the job, you will get a response with job id, then you can also check whether output directory exists and there's no error log with webhdfs API, in that case job succeedd. curl -i "http://sandbox.hortonworks.com:50070/webhdfs/v1/user/root/output/?op=LISTSTATUS" another idea is to leverage Oozie to wire the jobs together, once job completes, you can use SLA monitoring features of Oozie to check whether job completed or send an email (SLA not needed for this) whichever way you go, you can have Nifi watch these events either from JMS topic in ActiveMQ if you intend to use SLA or email alert. https://community.hortonworks.com/articles/83787/apache-ambari-workflow-manager-view-for-apache-ooz-1.html probably even better idea is to query ATS via REST API https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/TimelineServer.html I think this is probably the most sane approach, you can query ATS for finished job and get status. So once you know the job ID, there are ways to get it, one of them is via my first example, then in the second processor you can query ATS for completion state.
... View more
03-13-2017
06:11 PM
2 Kudos
@Amit Panda here's a slightly modified script from stack overflow thread #!/bin/bash
usage="Usage: dir_diff.sh [directory] [days]"
if [[ $# -ne 2 ]]
then
echo $usage
exit 1
fi
now=$(date +%s)
hadoop fs -ls -R $1 | grep "^d" | while read f; do
dir_date=`echo $f | awk '{print $6}'`
difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
if [ $difference -gt $2 ]; then
echo $f;
fi
done
I don't have files older than 10 days on my HDFS so I execute with 1 day argument like so: sudo sh dir_diff.sh /tmp 1
drwx------ - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/ambari-qa
drwx------ - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/ambari-qa/staging
drwxr-xr-x - hdfs hdfs 0 2017-03-11 15:39 /tmp/entity-file-history
drwxr-xr-x - yarn hadoop 0 2017-03-11 15:39 /tmp/entity-file-history/active
drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/17c0213c-358a-4c89-b803-800762144a21
drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/17c0213c-358a-4c89-b803-800762144a21/_tmp_space.db
drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/96049638-4aee-42cc-95f6-0652b3a66cae
drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/96049638-4aee-42cc-95f6-0652b3a66cae/_tmp_space.db
drwx------ - hive hdfs 0 2017-03-11 15:41 /tmp/hive/hive/e4fe18d1-5cb4-4088-93ff-cf4aac410301
drwx------ - hive hdfs 0 2017-03-11 15:41 /tmp/hive/hive/e4fe18d1-5cb4-4088-93ff-cf4aac410301/_tmp_space.db
drwxr-xr-x - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/tezsmokeinput
On my 2.5 Sandbox, it returns this sh dir_diff.sh /tmp 10
drwxr-xr-x - hdfs hdfs 0 2016-10-25 07:48 /tmp/entity-file-history
drwxr-xr-x - yarn hadoop 0 2016-10-25 07:48 /tmp/entity-file-history/active
drwxrwxrwx - guest hdfs 0 2017-01-12 18:42 /tmp/freewheel
drwxrwxrwx - guest hdfs 0 2017-01-12 18:46 /tmp/freewheel/hdfs
drwx-wx-wx - ambari-qa hdfs 0 2016-10-25 07:51 /tmp/hive
drwx------ - ambari-qa hdfs 0 2016-10-25 08:09 /tmp/hive/ambari-qa
drwx------ - hive hdfs 0 2017-01-23 20:51 /tmp/hive/hive/_tez_session_dir
drwx------ - hive hdfs 0 2017-01-16 16:03 /tmp/hive/hive/ff5fb9ba-01db-45d3-b924-e1bd6ee5203b
drwx------ - hive hdfs 0 2017-01-16 16:03 /tmp/hive/hive/ff5fb9ba-01db-45d3-b924-e1bd6ee5203b/_tmp_space.db
Once you get a list of those files, you can issue hdfs dfs -mv file newdir We're adding some new Grafana dashboards in the next release of Ambari that can tell with granularity who are hdfs users and what files they're creating. There's also an activity explorer dashboard you can check out in latest Ambari + Smartsense for some other HDFS file statistics, especially when you're looking for small files.
... View more
03-13-2017
02:17 PM
@Ali benchmarks are based on many factors, your setup may differ from other deployments. In my experience, separating the two would give optimal performance as you can see from one of the responses in the links I provided. It really depends on your volumes, to be cost-effective sure you can colocate but when your application becomes mission critical, you will regret making those decisions.
... View more
03-13-2017
12:44 PM
it is best to keep them separate, here are two threads for you to review with findings from the field. https://community.hortonworks.com/questions/10868/best-practices-for-storm-deployment-on-a-hadoop-cl.html https://community.hortonworks.com/articles/550/unofficial-storm-and-kafka-best-practices-guide.html
... View more
03-11-2017
02:44 AM
1 Kudo
adding to sharelib of distcp core-site.xml with this property does not take effect and including it in lib directory of the workflow doesn't take effect either.
... View more
03-11-2017
02:19 AM
@Predrag Minovic & @Venkat Ranganathan the only way this works with secure and insecure cluster is if core-site.xml has the property below on the secured cluster side. <property>
<name>ipc.client.fallback-to-simple-auth-allowed</name>
<value>true</value>
</property>
uploading core-site.xml with this property in it in the lib folder of the workflow doesn't work. <arg> and <java-opts> for this property take no effect. job config would not update, the only way to make it work was to update on the hdfs side globally. Thank you both for your recommendations, eventually it was Predrag who suggested to update HDFS with the property. This is not optimal but for lack of better option will do. It's best to have either two unsecure or two secure clusters.
... View more
03-11-2017
01:00 AM
@Venkat Ranganathan tried many different options, as part of configuration block as well as a part of submission, i.e. oozie job -D ipc.client.fallback-to-simple-auth-allowed=true -run no luck, I read in the Oozie docs that I need property oozie.launcher.mapreduce.job.hdfs-servers and my jobs stopped getting submitted to YARN, hence I commented it out. I also added hadoop.proxy.oozie.hosts and guests to the 2nd cluster as per docs, no luck. <action name="distcp_1">
<distcp xmlns="uri:oozie:distcp-action:0.2">
<job-tracker>${resourceManager}</job-tracker>
<name-node>${nameNode}</name-node>
<!--
<configuration>
<property>
<name>oozie.launcher.mapreduce.job.hdfs-servers</name>
<value>${nameNode},${nameNode2}</value>
</property>
</configuration>
-->
<arg>-D</arg>
<arg>ipc.client.fallback-to-simple-auth-allowed=true</arg>
<arg>-overwrite</arg>
<arg>${nameNode}/user/centos/primary</arg>
<arg>${nameNode2}/tmp/</arg>
</distcp>
<ok to="end"/>
<error to="kill"/>
</action>
Finally, I stumbled on this note and I'm afraid I hit such case IMPORTANT: The DistCp action may not work properly with all configurations (secure, insecure) in all versions of Hadoop. I take instances where one cluster is secured and 2nd cluster is not is not suppored in Distcp action spec 0.2. https://oozie.apache.org/docs/4.2.0/DG_DistCpActionExtension.html
... View more
03-10-2017
07:17 PM
@Venkat Ranganathan That didn't work, getting Error: Could not find or load main class ipc.client.fallback-to-simple-auth-allowed=true
in the stderr log. My workflow as of now looks like so <action name="distcp_1">
<distcp xmlns="uri:oozie:distcp-action:0.2">
<job-tracker>${resourceManager}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<java-opts>-D ipc.client.fallback-to-simple-auth-allowed=true</java-opts>
<arg>hdfs://aervits-hdp70:8020/tmp/hellounsecure</arg>
<arg>hdfs://hacluster:8020/user/centos/</arg>
</distcp>
<ok to="end"/>
<error to="kill"/>
</action>
... View more
03-10-2017
05:10 PM
@Venkat Ranganathan can you be more specific, I tried with space character, I'm getting Error: E0701 : E0701: XML schema error, cvc-complex-type.2.4.a: Invalid content was found starting with element 'java-opts'. One of '{"uri:oozie:distcp-action:0.2":arg}' is expected.
is this passed as an <arg> or <java-opts>? <java-opts>-D ipc.client.fallback-to-simple-auth-allowed=true</java-opts>
... View more
03-10-2017
04:41 PM
1 Kudo
my environment requires that I pass -D ipc.client.fallback-to-simple-auth-allowed=true to distcp command, in distcp 0.2 action specification for Oozie 4.2, I see java-opts option and I can't seem to make workflow run by passing this property. The only way I can imagine it work is if I put the property in core-site.xml which in production clusters is not feasible. My workflow for reference is <action name="distcp_1">
<distcp xmlns="uri:oozie:distcp-action:0.2">
<job-tracker>${resourceManager}</job-tracker>
<name-node>${nameNode}</name-node>
<arg>hdfs://aervits-hdp70:8020/tmp/hellounsecure</arg>
<arg>hdfs://hacluster:8020/user/centos/</arg>
<java-opts>-Dipc.client.fallback-to-simple-auth-allowed=true</java-opts>
</distcp>
<ok to="end"/>
<error to="kill"/>
</action>
... View more
Labels:
- Labels:
-
Apache Oozie