Member since
09-18-2015
191
Posts
81
Kudos Received
40
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2045 | 08-04-2017 08:40 AM | |
5421 | 05-02-2017 01:18 PM | |
1109 | 04-24-2017 08:35 AM | |
1116 | 04-24-2017 08:21 AM | |
1333 | 06-01-2016 08:54 AM |
05-18-2018
04:48 PM
https://roaringelephant.org/2018/04/24/episode-85-dataworks-summit-community-showcase-exhibitor-soundbites/ This
is the final part of our coverage of the DataWorks Summit Berlin 2018.
Normally we would not have had an episode this week, since we were in
Berlin last week, but we had lightning interviews with the vendors in
the Community Expo Are and used that coverage to make this episode. Audio Player
00:00
00:00
Play in new window | Download (Duration: 30:34 — 21.0MB) So
less of “Dave & Jhon” and more “ecosystem tech” snippets this time.
Even though this does stray a bit from our usual content, we still hope
it is useful. This was recorded in a hotel room and on the expo
floor so the audio quality is not up to our usual standards, we hope
you’ll forgive us! Here is a timestamped list of the lightning interviews: 02:41 Hortonworks https://hortonworks.com/ 06:28 Alation https://alation.com/ 08:45 Arcadia Data https://www.arcadiadata.com/ 11:12 Attunity https://www.attunity.com/ 13:10 BlueMetrix https://www.bluemetrix.com/ 15:27 BMW https://www.bmw.com 18:04 IBM https://www.ibm.com 19:54 Microsoft https://www.microsoft.com 22:15 Nutanix https://www.nutanix.com/ 23:26 Syncsort https://www.syncsort.com 24:54 Synerscope http://www.synerscope.com/ 27:05 Talend https://www.talend.com 27:59 Teradata https://www.teradata.com/ 29:02 -Interview End-
... View more
05-18-2018
04:44 PM
https://roaringelephant.org/2018/04/19/episode-84-dataworks-summit-berlin-day-2-recap/ And with the end of day two of the 2018 DataWorks Summit in Berlin
comes the end of this years Europe Summit. But never fear, we have an
extra 90 minutes of DataWorks goodness for you to consume on your way
home. Audio Player
00:00
00:00
Play in new window | Download (Duration: 1:30:26 — 62.3MB) No
real editing on this one, recording in a hotel room so audio quality
may not be up to our usual standards, we hope you’ll forgive us! Enjoy!
Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.
... View more
05-18-2018
04:42 PM
https://roaringelephant.org/2018/04/18/episode-83-dataworks-summit-berlin-day-1-recap/ Another year, another European Dataworks Summit, and yes, another
daily recap show from Jhon and Dave. We walk through the keynotes and
sessions we attended and give our thoughts and views. This should be
useful for anyone who wasn’t able to attend or those seeking to peek
into sessions they couldn’t make. Audio Player
00:00
00:00
Play in new window | Download (Duration: 1:23:45 — 57.8MB) No
real editing on this one, recording in a hotel room so audio quality
may not be up to our usual standards, we hope you’ll forgive us! Enjoy!
Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.
... View more
04-11-2018
10:35 AM
https://roaringelephant.org/2018/04/10/episode-82-dataworks-summit-berlin-2018-preview/ Next
week is DataWorks Summit Berlin week! Your two hosts will be in
attendance and in this episode we go over the agenda and plan which
sessions we want to attend and why. Peppered throughout we add further
insights and experiences from previous years. Audio Player
00:00
00:00
Play in new window | Download (Duration: 47:38 — 33.0MB) Unfortunately, Dave’s network was a little unstable and there are a couple audio glitches in this episode. For
some session statistics or if you can use some help deciding what
sessions you want to attend, you can use the dashboard we created:
DSW2018 Berlin dashboard (http://aka.ms/DWS2018) Click the screenshot above or go to http://aka.ms/DWS2018
to access the dashboard. It is a dynamic report: clicking on graph
elements (bars of pie slices) will apply filters on all the
visualizations and the session list. Use control-click to combine
filters. The Summit agenda is still seeing some small changes here
and there. We will try and keep the dashboard up to date, but make sure
you double check with the official agenda! At some point the dashboard will dissapear because t is no longer relevant. for future reference, here is a large version of the screenshot.
Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.
... View more
04-11-2018
10:33 AM
https://roaringelephant.org/2018/04/03/episode-81-roaring-news/ In
this installment of Big Data News, we talk about the recent Facebook
leak, how everybody is still doing it wrong (according to some at least)
and installing Hadoop “the old-fashioned way”. Also briefly covered is
Elastic’s X-Pack, now even more “open” than before, but still rather
closed it would seem.
Breaking News Audio Player
00:00
00:00
Play in new window | Download (Duration: 26:19 — 18.3MB)
Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.
... View more
Labels:
03-28-2018
08:01 AM
1 Kudo
NOTE: This was recorded before everything kicked off with Facebook and Cambridge Analytica. Interesting timing. https://roaringelephant.org/2018/03/27/episode-80-big-data-tracking/ Last June, Wolfie Christl published a 93 page report Corporate
Surveillance in Everyday Life using big data tracking. Apart from the
massive pdf that can be downloaded on the net, an extensive summary can
be found on the Cracked Labs website. In this episode we go over the content and give our views on the subject. Podcast: Play in new window | Download (Duration: 51:25 — 35.6MB) If you want to follow along with us while we are discussing the different point in the onlin earticle, here is the link: http://crackedlabs.org/en/corporate-surveillance Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.
... View more
08-04-2017
08:40 AM
2 Kudos
Hi @Alberto Ramon, three questions in one! Just as a hint, in the future you may get quicker responses if you break your questions down to single question per post. Anyway, to answer your question, Metastore HA is more of an Active/Standby type pattern, from the documentation: "Failover Scenario A
Hive metastore client always uses the first URI to connect with the
metastore server. If the metastore server becomes unreachable, the
client randomly picks up a URI from the list and attempts to connect
with that" For more information please look here https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_hadoop-high-availability/content/ha-hive-use-and-failover.html I would not recommend that you use the Metastore HA outside of its intended usage, there could be unforseen concequences. Hive Metastore HA is compatible with Ranger and is compatible with Kerberised clusters. Hope that helps!
... View more
05-02-2017
01:18 PM
Hi there @Duhit Choudhary unfortunately you'll need to work closer with Ab Initio on this, as their documentation is only available for paying customers. Typically Ab Initio integration is used to manipulate files on HDFS, and it can write out files into Hive tables, there are three main methods of integrating Ab Initio. The first one is just to keep it as a completely separate cluster, and passing files to and from HDP. The second one is to more tightly couple the Ab Initio instances by installing them on edge nodes, this means they have more direct access to HDFS and are closer to being part of the cluster, HDP client tools and libraries are installed on the edge nodes for easier direct access, however scaling this can be difficult depending on how you are deploying Ab Initio. The third method is to run Ab Initio on the HDP cluster itself. Ab Initio does have some support for YARN integration, however as yet it is not fully YARN certified so your mileage may vary on your experiences. My utmost recommendation is to speak to Ab Initio, as they should be able to point you to the integration documentation that is not in the public domain. Good luck!
... View more
04-25-2017
06:18 PM
4 Kudos
There are several areas where a traditional RDBMS platform is used within an HDP environment, Ambari uses one to store the cluster configuration, Hive stores it's metastore information, Oozie stores its jobs and config and Ranger stores its policies. There are a range of DB options you can choose from for many different components, an example compatibility matrix is shown here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_support-matrices/content/ch_matrices-ambari.html#ambari_database
One element that is not very well documented is how much space may be required if you're starting down the path of building a fairly large cluster. There are a number of reasons for this, the main one being that it does actually vary with how the cluster is used. However that being said I've gathered some database, cluster and time metrics from a number of production environments used by Hortonworks customers and come up with a simple formula that may at least get you a rough order of magnitude estimate of the size database that's required for each major component. There are two major variables that seem to play a part in some of the calculations, the first is the node count within the cluster, the second is the duration that the cluster is run for. For simplicity sake I'm using them in everything just to keep this article simple, and while not strictly accurate it should give you a rough estimate. Node count is also an indicator towards environment complexity during these calculations. So, the numbers in this case are: Ambari 0.7MB Ranger 0.5MB Oozie 0.5MB Hive Metastore 5MB Then all you need to do is take the number above and multiply it by the number of nodes in the cluster, and the duration (in months) you want to calculate the cluster DB utilisation duration for. For example: Ambari on a 100 node cluster over 2 years would be: 0.7 x 100 x 24 = 1680MB or 1.68GB approx Hive Metastore on a 75 node cluster over 1 year would be: 5 x 75 x 12 = 4500MB or 4.5GB approx Now, please remember that this is a very rough approximation, built from a handful of data points from a small set of customers with real world clusters, don't take this simplistic estimate as a concrete promise. As always with this, your utilisation of the cluster can severely skew any of these statistics, for example if you run thousands of jobs via Oozie every day, expect that to increase significantly quicker, similarly if you are making continuous config changes via Ambari on the API for example. However I think the above is a reasonable start, and feedback would be very welcome. In the longer term once I've received some more feedback I'll look to get this into the formal Hortonworks documentation further down the line. Hope this helps.
... View more
Labels:
04-25-2017
09:22 AM
2 Kudos
Hi @J. D. Bacolod please take a look at this HCC article for using the API to configure processors on the fly: https://community.hortonworks.com/articles/3160/update-nifi-flow-on-the-fly-via-api.html
Hope that helps!
... View more