About drussell

drussell · ‎05-18-2018

https://roaringelephant.org/2018/04/24/episode-85-dataworks-summit-community-showcase-exhibitor-soundbites/ This is the final part of our coverage of the DataWorks Summit Berlin 2018. Normally we would not have had an episode this week, since we were in Berlin last week, but we had lightning interviews with the vendors in the Community Expo Are and used that coverage to make this episode. Audio Player 00:00 00:00 Play in new window | Download (Duration: 30:34 — 21.0MB) So less of “Dave & Jhon” and more “ecosystem tech” snippets this time. Even though this does stray a bit from our usual content, we still hope it is useful. This was recorded in a hotel room and on the expo floor so the audio quality is not up to our usual standards, we hope you’ll forgive us! Here is a timestamped list of the lightning interviews: 02:41 Hortonworks https://hortonworks.com/ 06:28 Alation https://alation.com/ 08:45 Arcadia Data https://www.arcadiadata.com/ 11:12 Attunity https://www.attunity.com/ 13:10 BlueMetrix https://www.bluemetrix.com/ 15:27 BMW https://www.bmw.com 18:04 IBM https://www.ibm.com 19:54 Microsoft https://www.microsoft.com 22:15 Nutanix https://www.nutanix.com/ 23:26 Syncsort https://www.syncsort.com 24:54 Synerscope http://www.synerscope.com/ 27:05 Talend https://www.talend.com 27:59 Teradata https://www.teradata.com/ 29:02 -Interview End-

drussell · ‎05-18-2018

https://roaringelephant.org/2018/04/19/episode-84-dataworks-summit-berlin-day-2-recap/ And with the end of day two of the 2018 DataWorks Summit in Berlin comes the end of this years Europe Summit. But never fear, we have an extra 90 minutes of DataWorks goodness for you to consume on your way home. Audio Player 00:00 00:00 Play in new window | Download (Duration: 1:30:26 — 62.3MB) No real editing on this one, recording in a hotel room so audio quality may not be up to our usual standards, we hope you’ll forgive us! Enjoy! Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

drussell · ‎05-18-2018

https://roaringelephant.org/2018/04/18/episode-83-dataworks-summit-berlin-day-1-recap/ Another year, another European Dataworks Summit, and yes, another daily recap show from Jhon and Dave. We walk through the keynotes and sessions we attended and give our thoughts and views. This should be useful for anyone who wasn’t able to attend or those seeking to peek into sessions they couldn’t make. Audio Player 00:00 00:00 Play in new window | Download (Duration: 1:23:45 — 57.8MB) No real editing on this one, recording in a hotel room so audio quality may not be up to our usual standards, we hope you’ll forgive us! Enjoy! Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

drussell · ‎04-11-2018

https://roaringelephant.org/2018/04/10/episode-82-dataworks-summit-berlin-2018-preview/ Next week is DataWorks Summit Berlin week! Your two hosts will be in attendance and in this episode we go over the agenda and plan which sessions we want to attend and why. Peppered throughout we add further insights and experiences from previous years. Audio Player 00:00 00:00 Play in new window | Download (Duration: 47:38 — 33.0MB) Unfortunately, Dave’s network was a little unstable and there are a couple audio glitches in this episode. For some session statistics or if you can use some help deciding what sessions you want to attend, you can use the dashboard we created: DSW2018 Berlin dashboard (http://aka.ms/DWS2018) Click the screenshot above or go to http://aka.ms/DWS2018 to access the dashboard. It is a dynamic report: clicking on graph elements (bars of pie slices) will apply filters on all the visualizations and the session list. Use control-click to combine filters. The Summit agenda is still seeing some small changes here and there. We will try and keep the dashboard up to date, but make sure you double check with the official agenda! At some point the dashboard will dissapear because t is no longer relevant. for future reference, here is a large version of the screenshot. Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

drussell · ‎04-11-2018

https://roaringelephant.org/2018/04/03/episode-81-roaring-news/ In this installment of Big Data News, we talk about the recent Facebook leak, how everybody is still doing it wrong (according to some at least) and installing Hadoop “the old-fashioned way”. Also briefly covered is Elastic’s X-Pack, now even more “open” than before, but still rather closed it would seem. Breaking News Audio Player 00:00 00:00 Play in new window | Download (Duration: 26:19 — 18.3MB) Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

drussell · ‎03-28-2018

NOTE: This was recorded before everything kicked off with Facebook and Cambridge Analytica. Interesting timing. https://roaringelephant.org/2018/03/27/episode-80-big-data-tracking/ Last June, Wolfie Christl published a 93 page report Corporate Surveillance in Everyday Life using big data tracking. Apart from the massive pdf that can be downloaded on the net, an extensive summary can be found on the Cracked Labs website. In this episode we go over the content and give our views on the subject. Podcast: Play in new window | Download (Duration: 51:25 — 35.6MB) If you want to follow along with us while we are discussing the different point in the onlin earticle, here is the link: http://crackedlabs.org/en/corporate-surveillance Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.

drussell · ‎08-04-2017

Hi @Alberto Ramon, three questions in one! Just as a hint, in the future you may get quicker responses if you break your questions down to single question per post. Anyway, to answer your question, Metastore HA is more of an Active/Standby type pattern, from the documentation: "Failover Scenario A Hive metastore client always uses the first URI to connect with the metastore server. If the metastore server becomes unreachable, the client randomly picks up a URI from the list and attempts to connect with that" For more information please look here https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_hadoop-high-availability/content/ha-hive-use-and-failover.html I would not recommend that you use the Metastore HA outside of its intended usage, there could be unforseen concequences. Hive Metastore HA is compatible with Ranger and is compatible with Kerberised clusters. Hope that helps!

drussell · ‎05-02-2017

Hi there @Duhit Choudhary unfortunately you'll need to work closer with Ab Initio on this, as their documentation is only available for paying customers. Typically Ab Initio integration is used to manipulate files on HDFS, and it can write out files into Hive tables, there are three main methods of integrating Ab Initio. The first one is just to keep it as a completely separate cluster, and passing files to and from HDP. The second one is to more tightly couple the Ab Initio instances by installing them on edge nodes, this means they have more direct access to HDFS and are closer to being part of the cluster, HDP client tools and libraries are installed on the edge nodes for easier direct access, however scaling this can be difficult depending on how you are deploying Ab Initio. The third method is to run Ab Initio on the HDP cluster itself. Ab Initio does have some support for YARN integration, however as yet it is not fully YARN certified so your mileage may vary on your experiences. My utmost recommendation is to speak to Ab Initio, as they should be able to point you to the integration documentation that is not in the public domain. Good luck!

drussell · ‎04-25-2017

There are several areas where a traditional RDBMS platform is used within an HDP environment, Ambari uses one to store the cluster configuration, Hive stores it's metastore information, Oozie stores its jobs and config and Ranger stores its policies. There are a range of DB options you can choose from for many different components, an example compatibility matrix is shown here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_support-matrices/content/ch_matrices-ambari.html#ambari_database One element that is not very well documented is how much space may be required if you're starting down the path of building a fairly large cluster. There are a number of reasons for this, the main one being that it does actually vary with how the cluster is used. However that being said I've gathered some database, cluster and time metrics from a number of production environments used by Hortonworks customers and come up with a simple formula that may at least get you a rough order of magnitude estimate of the size database that's required for each major component. There are two major variables that seem to play a part in some of the calculations, the first is the node count within the cluster, the second is the duration that the cluster is run for. For simplicity sake I'm using them in everything just to keep this article simple, and while not strictly accurate it should give you a rough estimate. Node count is also an indicator towards environment complexity during these calculations. So, the numbers in this case are: Ambari 0.7MB Ranger 0.5MB Oozie 0.5MB Hive Metastore 5MB Then all you need to do is take the number above and multiply it by the number of nodes in the cluster, and the duration (in months) you want to calculate the cluster DB utilisation duration for. For example: Ambari on a 100 node cluster over 2 years would be: 0.7 x 100 x 24 = 1680MB or 1.68GB approx Hive Metastore on a 75 node cluster over 1 year would be: 5 x 75 x 12 = 4500MB or 4.5GB approx Now, please remember that this is a very rough approximation, built from a handful of data points from a small set of customers with real world clusters, don't take this simplistic estimate as a concrete promise. As always with this, your utilisation of the cluster can severely skew any of these statistics, for example if you run thousands of jobs via Oozie every day, expect that to increase significantly quicker, similarly if you are making continuous config changes via Ambari on the API for example. However I think the above is a reasonable start, and feedback would be very welcome. In the longer term once I've received some more feedback I'll look to get this into the formal Hortonworks documentation further down the line. Hope this helps.

drussell · ‎04-25-2017

Hi @J. D. Bacolod please take a look at this HCC article for using the API to configure processors on the fly: https://community.hortonworks.com/articles/3160/update-nifi-flow-on-the-fly-via-api.html Hope that helps!

Online	Offline
Last Visited	‎12-10-2018 10:03 AM

Member Since	‎09-18-2015 08:21 AM
Last Visited	‎12-10-2018 10:03 AM
Posts	191
Kudos received	80

Cloudera Community

Re: Metastore HA Active/Active ?

Re: Hi All, I want to integrate Ab initio tool wit...

Re: Hadoop Rack-Awareness is only for datanode ser...

Re: Kafka installation best practices in HDF

Re: Best tools for file transfer and ingest.

Roaring Elephant Podcast - Episode 85 – DataWorks ...

Roaring Elephant Podcast - Episode 84 – DataWorks ...

Roaring Elephant Podcast - Episode 83 – DataWorks ...

Roaring Elephant Podcast - Episode 82 – DataWorks ...

Roaring Elephant Podcast - Episode 81 - Roaring Ne...

Roaring Elephant Podcast - Episode 80 - Big Data T...

Re: Metastore HA Active/Active ?

Re: Hi All, I want to integrate Ab initio tool wit...

HDP Supporting Database Sizing Guidelines

Re: Dynamic Creation of Processors in NiFi