About aervits

aervits · ‎03-13-2017

@Sunile Manjee you can leverage WebHCat for this as one idea, https://cwiki.apache.org/confluence/display/Hive/WebHCat+UsingWebHCat#WebHCatUsingWebHCat-ErrorCodesandResponses # this will execute a hive query and save result to hdfs file in your home directory called output curl -s -d execute="select+*+from+sample_08;" \ -d statusdir="output" \ 'http://localhost:50111/templeton/v1/hive?user.name=root' # if you ls on the directory, it will have two files, stderr and stdout hdfs dfs -ls output # if the job succeeded, you can cat the stdout file and view the results hdfs dfs -cat output/stdout when you invoke the job, you will get a response with job id, then you can also check whether output directory exists and there's no error log with webhdfs API, in that case job succeedd. curl -i "http://sandbox.hortonworks.com:50070/webhdfs/v1/user/root/output/?op=LISTSTATUS" another idea is to leverage Oozie to wire the jobs together, once job completes, you can use SLA monitoring features of Oozie to check whether job completed or send an email (SLA not needed for this) whichever way you go, you can have Nifi watch these events either from JMS topic in ActiveMQ if you intend to use SLA or email alert. https://community.hortonworks.com/articles/83787/apache-ambari-workflow-manager-view-for-apache-ooz-1.html probably even better idea is to query ATS via REST API https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/TimelineServer.html I think this is probably the most sane approach, you can query ATS for finished job and get status. So once you know the job ID, there are ways to get it, one of them is via my first example, then in the second processor you can query ATS for completion state.

aervits · ‎03-13-2017

@Amit Panda here's a slightly modified script from stack overflow thread #!/bin/bash usage="Usage: dir_diff.sh [directory] [days]" if [[ $# -ne 2 ]] then echo $usage exit 1 fi now=$(date +%s) hadoop fs -ls -R $1 | grep "^d" | while read f; do dir_date=`echo $f | awk '{print $6}'` difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) )) if [ $difference -gt $2 ]; then echo $f; fi done I don't have files older than 10 days on my HDFS so I execute with 1 day argument like so: sudo sh dir_diff.sh /tmp 1 drwx------ - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/ambari-qa drwx------ - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/ambari-qa/staging drwxr-xr-x - hdfs hdfs 0 2017-03-11 15:39 /tmp/entity-file-history drwxr-xr-x - yarn hadoop 0 2017-03-11 15:39 /tmp/entity-file-history/active drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/17c0213c-358a-4c89-b803-800762144a21 drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/17c0213c-358a-4c89-b803-800762144a21/_tmp_space.db drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/96049638-4aee-42cc-95f6-0652b3a66cae drwx------ - hive hdfs 0 2017-03-11 15:42 /tmp/hive/hive/96049638-4aee-42cc-95f6-0652b3a66cae/_tmp_space.db drwx------ - hive hdfs 0 2017-03-11 15:41 /tmp/hive/hive/e4fe18d1-5cb4-4088-93ff-cf4aac410301 drwx------ - hive hdfs 0 2017-03-11 15:41 /tmp/hive/hive/e4fe18d1-5cb4-4088-93ff-cf4aac410301/_tmp_space.db drwxr-xr-x - ambari-qa hdfs 0 2017-03-11 15:41 /tmp/tezsmokeinput On my 2.5 Sandbox, it returns this sh dir_diff.sh /tmp 10 drwxr-xr-x - hdfs hdfs 0 2016-10-25 07:48 /tmp/entity-file-history drwxr-xr-x - yarn hadoop 0 2016-10-25 07:48 /tmp/entity-file-history/active drwxrwxrwx - guest hdfs 0 2017-01-12 18:42 /tmp/freewheel drwxrwxrwx - guest hdfs 0 2017-01-12 18:46 /tmp/freewheel/hdfs drwx-wx-wx - ambari-qa hdfs 0 2016-10-25 07:51 /tmp/hive drwx------ - ambari-qa hdfs 0 2016-10-25 08:09 /tmp/hive/ambari-qa drwx------ - hive hdfs 0 2017-01-23 20:51 /tmp/hive/hive/_tez_session_dir drwx------ - hive hdfs 0 2017-01-16 16:03 /tmp/hive/hive/ff5fb9ba-01db-45d3-b924-e1bd6ee5203b drwx------ - hive hdfs 0 2017-01-16 16:03 /tmp/hive/hive/ff5fb9ba-01db-45d3-b924-e1bd6ee5203b/_tmp_space.db Once you get a list of those files, you can issue hdfs dfs -mv file newdir We're adding some new Grafana dashboards in the next release of Ambari that can tell with granularity who are hdfs users and what files they're creating. There's also an activity explorer dashboard you can check out in latest Ambari + Smartsense for some other HDFS file statistics, especially when you're looking for small files.

aervits · ‎03-13-2017

@Ali benchmarks are based on many factors, your setup may differ from other deployments. In my experience, separating the two would give optimal performance as you can see from one of the responses in the links I provided. It really depends on your volumes, to be cost-effective sure you can colocate but when your application becomes mission critical, you will regret making those decisions.

aervits · ‎03-13-2017

it is best to keep them separate, here are two threads for you to review with findings from the field. https://community.hortonworks.com/questions/10868/best-practices-for-storm-deployment-on-a-hadoop-cl.html https://community.hortonworks.com/articles/550/unofficial-storm-and-kafka-best-practices-guide.html

aervits · ‎03-11-2017

adding to sharelib of distcp core-site.xml with this property does not take effect and including it in lib directory of the workflow doesn't take effect either.

aervits · ‎03-11-2017

@Predrag Minovic & @Venkat Ranganathan the only way this works with secure and insecure cluster is if core-site.xml has the property below on the secured cluster side. <property> <name>ipc.client.fallback-to-simple-auth-allowed</name> <value>true</value> </property> uploading core-site.xml with this property in it in the lib folder of the workflow doesn't work. <arg> and <java-opts> for this property take no effect. job config would not update, the only way to make it work was to update on the hdfs side globally. Thank you both for your recommendations, eventually it was Predrag who suggested to update HDFS with the property. This is not optimal but for lack of better option will do. It's best to have either two unsecure or two secure clusters.

aervits · ‎03-11-2017

@Venkat Ranganathan tried many different options, as part of configuration block as well as a part of submission, i.e. oozie job -D ipc.client.fallback-to-simple-auth-allowed=true -run no luck, I read in the Oozie docs that I need property oozie.launcher.mapreduce.job.hdfs-servers and my jobs stopped getting submitted to YARN, hence I commented it out. I also added hadoop.proxy.oozie.hosts and guests to the 2nd cluster as per docs, no luck. <action name="distcp_1"> <distcp xmlns="uri:oozie:distcp-action:0.2"> <job-tracker>${resourceManager}</job-tracker> <name-node>${nameNode}</name-node>  <arg>-D</arg> <arg>ipc.client.fallback-to-simple-auth-allowed=true</arg> <arg>-overwrite</arg> <arg>${nameNode}/user/centos/primary</arg> <arg>${nameNode2}/tmp/</arg> </distcp> <ok to="end"/> <error to="kill"/> </action> Finally, I stumbled on this note and I'm afraid I hit such case IMPORTANT: The DistCp action may not work properly with all configurations (secure, insecure) in all versions of Hadoop. I take instances where one cluster is secured and 2nd cluster is not is not suppored in Distcp action spec 0.2. https://oozie.apache.org/docs/4.2.0/DG_DistCpActionExtension.html

aervits · ‎03-10-2017

@Venkat Ranganathan That didn't work, getting Error: Could not find or load main class ipc.client.fallback-to-simple-auth-allowed=true in the stderr log. My workflow as of now looks like so <action name="distcp_1"> <distcp xmlns="uri:oozie:distcp-action:0.2"> <job-tracker>${resourceManager}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <java-opts>-D ipc.client.fallback-to-simple-auth-allowed=true</java-opts> <arg>hdfs://aervits-hdp70:8020/tmp/hellounsecure</arg> <arg>hdfs://hacluster:8020/user/centos/</arg> </distcp> <ok to="end"/> <error to="kill"/> </action>

aervits · ‎03-10-2017

@Venkat Ranganathan can you be more specific, I tried with space character, I'm getting Error: E0701 : E0701: XML schema error, cvc-complex-type.2.4.a: Invalid content was found starting with element 'java-opts'. One of '{"uri:oozie:distcp-action:0.2":arg}' is expected. is this passed as an <arg> or <java-opts>? <java-opts>-D ipc.client.fallback-to-simple-auth-allowed=true</java-opts>

aervits · ‎03-10-2017

my environment requires that I pass -D ipc.client.fallback-to-simple-auth-allowed=true to distcp command, in distcp 0.2 action specification for Oozie 4.2, I see java-opts option and I can't seem to make workflow run by passing this property. The only way I can imagine it work is if I put the property in core-site.xml which in production clusters is not feasible. My workflow for reference is <action name="distcp_1"> <distcp xmlns="uri:oozie:distcp-action:0.2"> <job-tracker>${resourceManager}</job-tracker> <name-node>${nameNode}</name-node> <arg>hdfs://aervits-hdp70:8020/tmp/hellounsecure</arg> <arg>hdfs://hacluster:8020/user/centos/</arg> <java-opts>-Dipc.client.fallback-to-simple-auth-allowed=true</java-opts> </distcp> <ok to="end"/> <error to="kill"/> </action>

Online	Offline
Last Visited	‎08-15-2019 06:35 AM

Member Since	‎10-01-2015 11:46 AM
Last Visited	‎08-15-2019 06:35 AM
Posts	3,933
Kudos received	1074

Cloudera Community

Re: Where can I get latest resource_management.c...

Re: How to Kerberize Flume?

Re: Load Hive Table form Pig Output File.

Re: HDP 2.6 Cluster Issues with Hive Metastore

Re: which HDP release will storm 1.1.0 be packaged...

Re: Is there a way to easily detect when a MR/Tez ...

Re: How to move HDFS files from one directory to o...

Re: Collocate Storm and Kafka

Re: Collocate Storm and Kafka

Re: Apache Oozie Distcp Action with Kerberos enabl...

Re: Apache Oozie Distcp Action with Kerberos enabl...

Re: Apache Oozie Distcp Action with Kerberos enabl...

Re: Apache Oozie Distcp Action with Kerberos enabl...

Re: Apache Oozie Distcp Action with Kerberos enabl...

Apache Oozie Distcp Action with Kerberos enabled c...