About landon_t_robins

landon_t_robins · ‎12-22-2015

Hey Balu, that worked. Perfectly, actually! The only thing left I was curious about... is there a Falcon-based way to remove the _SUCCESS file after replication has completed? I know how we can do it after some time or after creating it in the process (and waiting a few minutes perhaps), but if there's a Falcon method or tag I'd love to leverage that. Please let me know! And thanks for all the great help.

landon_t_robins · ‎12-22-2015

Balu, thanks for that answer! We knew of this tag, but weren't sure how to truly use it. I'm testing this now and will let you know asap! Thank you kindly.

landon_t_robins · ‎12-21-2015

Scenario: we have data that is ingested to the cluster via a Falcon process. It leverages a Falcon feed as an output to replicate the ingested data to a backup cluster. We'd like the feed to not replicate until the process has completed. We currently use a delay to semi-accomplish this, but it's not perfect. Question: How can we (if at all) tell Falcon to wait until the process has completed to begin replication? Currently we have process.xml code as below, which tells the Feed to start "now" but with a delay of 2 hours (this delay being specified in the feed.xml). To be clear, we have replication working and all that just fine -- we're just after a more elegant way for Falcon to only replicate after a process has confirmed completed. Is there a way? Process <outputs> <output name="hdp0001-my-feed" feed="hdp0001-my-feed" instance="now(0,0)"/> </outputs> Feed <cluster name="primary-cluster" type="source" delay="hours(1)"> <validity start="2015-12-04T09:30Z" end="2099-12-31T23:59Z"/> <retention limit="months(9999)" action="archive"/> <locations> <location type="data" path="/hdfs/data/path/to/my_table/"/> </locations> </cluster>

landon_t_robins · ‎12-11-2015

We have solved this problem with our environment in a number of different ways. With Falcon, we have Sqoop imports on a regular frequency that run during determined "not busy" time intervals. We've also done some logic in bash scripts run through Oozie that will execute, but determine "not safe" to run at the moment, and will sleep or terminate for that instance, and try again later. If you have a window, say 3am-5am, in which you could feasibly connect and pull data, you could set up a sleep/wait loop until either a specific exact time has passed or the system is available. Plenty of options, definitely feasible what you mentioned!

landon_t_robins · ‎12-11-2015

Have you looked into CompressedStorage features on Hive? You should be able to use this (for Snappy at least): SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.type=BLOCK;

landon_t_robins · ‎12-09-2015

I have some notes but nothing formal cobbled together. Most stuff goes on hadoopsters.net when I learn about it. I'm hoping to publish some more on Falcon here very soon. The article feature makes it easy.

landon_t_robins · ‎12-09-2015

I've tried HDP on a Pi. Works pretty good. Will add Nifi to my list of things to POC...

landon_t_robins · ‎12-09-2015

Yeah, you should be able to pass any properties to the workflow.xml from Falcon's process.xml. Like this: This would go before the <workflow> tag and after the <output> tags. They're used in the workflow.xml as ${workflowName}, ${hiveDB}, ${queueName}, and so on. <properties> <property name="workflowName" value="1234-my-workflow" /> <property name="rawDataDirectoryHDFS" value="/path/to/hdfs/files/" /> <property name="hiveDB" value="my_db" /> <property name="jobTracker" value="hdpcluster003.company.com:8050"/> <property name="nameNode" value="hdfs://MYHA:8020"/> <property name="queueName" value="dev"/> </properties>

landon_t_robins · ‎12-09-2015

SQuirreL is quite good, and free. Built in Java, and works well with Hive/Beeline. We've overall had success with: Teradata Studio Squirrel Aquadata Studio http://squirrel-sql.sourceforge.net/

landon_t_robins · ‎12-09-2015

You can point to it directly via its address, or you can do as @bvellanki (balu) mentioned, and list its HA. For example, if your HA for your backup cluster is called DRHA, your address would be hdfs://DRHA:8020. See below: <interface type="readonly" endpoint="hftp://DRHA.company.com:50070" version="2.2.0"/> <interface type="write" endpoint="hdfs://DRHA.company.com:8020" version="2.2.0"/> #You can also do this, depending on preference <interface type="readonly" endpoint="hftp://DRHA:50070" version="2.2.0"/> <interface type="write" endpoint="hdfs://DRHA:8020" version="2.2.0"/>

Online	Offline
Last Visited	‎12-19-2016 09:29 PM

Member Since	‎12-09-2015 03:10 PM
Last Visited	‎12-19-2016 09:29 PM
Posts	34
Kudos received	12

Cloudera Community

Re: Including WITH UR in Sqoop Free-form query Imp...

Re: Limiting Sqoop imports to some time intervals

Re: How to reduce HDFS output file size?

Re: Feed Replication Only after Process Completion

Re: Feed Replication Only after Process Completion

Feed Replication Only after Process Completion

Re: Limiting Sqoop imports to some time intervals

Re: How to reduce HDFS output file size?

Re: Has anyone tried using a Apache NiFi on Raspbe...

Re: Has anyone tried using a Apache NiFi on Raspbe...

Re: Falcon Passing parameters to existing Oozie Wo...

Re: Hive GUI !

Re: How do you specify a highly-available HDFS nam...