About bleonhardi

bleonhardi · ‎04-12-2016

@emaxwellLDAP would provide the authentication for linux, ambari, Hive, hue etc. What it wouldn't cover would provide authentication for the native apis that is correct. But if you work in an environment where you basically trust the users and don't have too sensitive data i.e. you just want to make sure they don't accidentally do something bad ( like in a scientific environment ) its definitely still a possibility.

bleonhardi · ‎04-12-2016

You can also run the hive script in a shell/ssh action, parse the output using your shell script and output some parameters that you then use in your oozie flow ( see my answer to that question ) https://community.hortonworks.com/questions/24182/where-is-the-output-of-an-oozie-workflow-is-stored.html

bleonhardi · ‎04-11-2016

Basically syncing huge fact tables that are also updated is a pain in the neck. 1) Do it as you suggested and fully reload your tables every week and then run sqoop jobs during the week based on the Incremented ID. The problem is that in this case you do not get updates during the week. If that is possible you can just sqoop into your daily table. Do you really get updates? In Warehouse environments you normally have a FACT table that is not updated and dimension tables that you can indeed reload completely. If you do not you can just continue using sqoop as before using the increment field instead of the date. ( You just need to fix the old data ) 2) If you want to get updates with Sqoop during the week you will need something like a last updated at date. If you have that you can look at the approach pbalasundaram wrote about. But I personally don't like this too much since this view does a lot of processing and will make queries slower. If you can recreate the table every night based on the query from the article you should do it. However you need a short outage for this. ( The good thing is that Hadoop is pretty good at writing multi-terabyte data once it is in the cluster so you might be able to get it done at night and do a quick rename operation to update it. 3) Tools like GoldenGate/IBM CDC are definitely an option as well. They monitor the transaction log of the source database and can for example insert into Kafka/Hbase. Even slow speeds can sum up to big volumes for continuous tasks. The problem here is not the speed of these tools but Hive updates which are still very new and mostly usable for streaming inserts. So if you do not want to switch to something like Apache Phoenix as your data store ( which is ok for small aggregation queries with millions of rows but definitely not for fully aggregating a tb table ) you would need to use CDC into Kafka and then write your own storm/spark-streaming etc. app that takes the values from kafka and pushes them into Hive. However as mentioned Hive ACID is currently very young and mostly good for streaming inserts. Inserting the new data might work well and updating some old values may work as well but a huge amount of updates across a large timerange would ask for trouble. ACID tables also still have some limitations but hopefully they will be much more stable in the near future then this should be a valid option. http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/

bleonhardi · ‎04-11-2016

So at the moment it is not possible.There has been a jira out to provide this kind of functionality through knox but there are no fixed dates yet. So I wouldn't count on it for the near future. https://issues.apache.org/jira/browse/FALCON-1026

bleonhardi · ‎04-11-2016

I don't think thats possible. No PAM/LDAP authentication for Falcon UI. I would like it as well.

bleonhardi · ‎04-11-2016

I don't think there is any preferred tool. I have seen a lot of customers use Github + Jenkins and maven/sbt for build ( potentially with internal repositories )

bleonhardi · ‎04-11-2016

Does it support Kerberos by now? Would be very nice to be able to use it instead of os.system("hadoop ... ") commands as I currently do.

bleonhardi · ‎04-11-2016

So it depends: In reality a lot of AD teams will not even consider giving admin access to any outside tool ( even if its resticted to an OU. ). So 2. is definitely used in reality. But yes it is cumbersome. Regarding 1 and 3: MIT KDC Has the advantage of not needing to touch the AD system. You can have your own Kerberos instance for all service users and you then simply need to enable a trust from the MIT system to the AD system to have your business users being able to access the cluster too. Reasons for using an MIT KDC with trust from AD: - There are often restrictions of putting service users in the corporate AD - if the corporate AD somehow gets inaccessible or the whole cluster would be stopped. - You put less stress on the corporate AD for large systems - For small clusters that are used by a single team or if the cluster is purely automated ( i.e. not directly accessible to lots of business users ) you can use MIT KDC only and create local users for work. No need for AD at all. This is the fastest and most pain free way to setup a kerberized cluster ( using PAM authentication for Hive and local users in hue and ambari ) . However this obviously fails down once you need to give access to a large number of business users. AD directly also has big advantages: - Normally your business users are already in the AD and will stay there so you need to create hadoop specific groups anyhow and add your users to them. So the MIT KDC while easier to setup doesn't really provide any real purpose that AD couldn't do on its own. - The AD team will take care of backup/DR/security and you do not need to worry about that on your MIT KDC. I think in general if the AD team is pretty flexible and accessible going AD alone is preferable. You would do MIT KDC + AD trust if you expect problems with that and want to have as much control as possible in the hadoop team.

bleonhardi · ‎04-08-2016

But all of them are failed. Does the queue also exist on the left side? When you click on Scheduler

bleonhardi · ‎04-08-2016

haha no worries. Thanks for the flowers. Yeah its a bit weird apparently Phoenix takes hints only globally for the whole query ( not for a sub select ) and looks for them after the first main action word.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Can ranger work with AD without Kerberos?

Re: Capture output from Hive action and use that a...

Re: SQOOP CDC Jobs with weekly full refreshes

Re: Can we configure falcon UI with username and p...

Re: Can we configure falcon UI with username and p...

Re: Automated Code Deployment Tools

Re: What is Snakebite ? and How to use it with HDP...

Re: Hadoop Security

Re: Failed to execute tez graph

Re: Phoenix - Query - Error Size of hash cache (10...