About drussell

drussell · ‎04-24-2017

Hi @Avijeet Dash Generally speaking NiFi will handle that absolutely fine, I've seen it used to move very large video files with no issue, you'll need to ensure that the nodes have relevant disk, cpu and memory to support the file sizes you're interested, but otherwise no major concerns! Hope that helps!

drussell · ‎04-24-2017

Hi @Peter Kim Rack Awareness is only used by HDFS to ensure it accurately places replicas of data off-rack, therefore it only needs to have the datanodes listed. Hope that helps!

drussell · ‎04-24-2017

Hi @Sanaz Janbakhsh I'm not sure I understand your first question, ideally the zookeeper instances will be running on separate hardware to the Kafka nodes, however it is possible to co-locate them as long as you have enough bandwidth on the nodes, and preferably dedicate spindles to ZK. In terms of best practices overall I'd suggest you download the slides and watch the video from this session at the recent DataWorks summit Munich: Apache Kafka Best Practices Session video: https://www.youtube.com/watch?v=maD_7ZdyuAU Session slides: https://www.slideshare.net/HadoopSummit/apache-kafka-best-practices I hope this helps!

drussell · ‎06-15-2016

Hi @Uday Vakalapudi Typically you will always be better off with multiple machines (scale out) rather than a smaller number of large machines (scale up). If you consider the way that Hadoop works, jobs are effectively distributed across the whole cluster and all the resources can be utilised simultaneously. This is the opposite of what virtualisation typically handles, which is multiple machines with different workloads and different workload profiles (I/O, cpu, memory). My short suggestion would be if you're just looking at a test/dev/pilot system, then multiple VM's is fine. But for production, consider scale out on bare metal. Hope that helps. , Typically you will always be better off with multiple machines (scale out) rather than a smaller number of large machines (scale up). If you consider the way that Hadoop works, jobs are effectively distributed across the whole cluster and all the resources can be utilised simultaneously. This is the opposite of what virtualisation typically handles, which is multiple machines with different workloads and different workload profiles (I/O, cpu, memory). My short suggestion would be if you're just looking at a test/dev/pilot system, then multiple VM's is fine. But for production, consider scale out on bare metal.

drussell · ‎06-01-2016

Hi @Kaliyug Antagonist You can use whatever filesystem you like for the O/S filesystems etc, our recommendations are primarily targeted at the HDFS data drives. I wouldn't use ext3 for anything any more, ext4 and xfs have moved forward as being the primary default options now. So to try and address your options one by one: 1) No, don't do this. 2) Perfectly acceptable, take care that data drives are mounted with recommended mount options (noatime etc) 3) Also perfectly acceptable, I see more people using XFS everywhere now, ext4 less so, but the deltas are relatively small, I'd go with whichever option you're more comfortable with as an organisation. 4) I wouldn't recommend that, if you're happy using XFS, use it everywhere, it just makes things easier but see point 2) about mount options for data drives 5) You can absolutely use LVM for your O/S partitions, just ideally don't use it for datanode and log directories. Hope that helps!

drussell · ‎06-01-2016

All great points, I've always wished that our tutorials came with a short "release notes" link for each one too.

drussell · ‎06-01-2016

Hi @Sree Venkata. To ad to Neeraj's already excellent answer and to follow your comment, NiFi now *does* support kerberised clusters. Also there is now an RDBMS connector, although I'd still say, use SQOOP if you're transferring very large chunks of RDBMS data and you want it parellised across the whole hadoop cluster, use NiFi if youve got smaller chunks to transfer that can be parallelised over a smaller NiFi cluster. Hope that (in combination with Neeraj's answer) fulfills your requirements.

drussell · ‎06-01-2016

Hi @x name. I don't believe there is anything pre-built within flume to do exactly what you need. Flume itself is certainly production ready and has been in constant use by a very wide range of people for a long time now, it just hasn't evolved past that point very much. It also starts to struggle with significant load under the kind of scenario's you're discussing unless it's very carefully managed. You've already identified the tool set that I'd probably recommend for your requirement which is NiFi. You've also identified another article so I won't go into that any further. As for other tools or patterns, I've seen people build some of their own ingest frameworks using a combination of scripts and things like webhdfs, or indeed a lot of custom code on top of Kafka. However with the way that the technology is stacking up now, unless you have a strong reason not to, NiFi solves all the issues you bring up and is easy to use as well, I'd strongly recommend it. If you do find something else please do add a comment here, likewise if you try NiFi and you get stuck at all, don't hesitate to fire over another question! Hope that helps.

drussell · ‎05-31-2016

Hi @rogier werschkull. You're correct, those steps currently wouldn't be tracked as a lineage, depending on exactly how the data is manipulated. Some of the Hive linage may be tracked depending on how those tools integrate with the data via the Hive service for example. SAS and an increasing number of partners, customers and community members are part of the Data Governance Initiative (DGI). You can reasonably expect those members of the DGI to be first in the queue to have their solutions more integrated into Atlas for the shared metadata exchange. Hope that helps.

drussell · ‎05-26-2016

Are all services started? Have you tried restarting the ones that have issues?

Online	Offline
Last Visited	‎12-10-2018 10:03 AM

Member Since	‎09-18-2015 08:21 AM
Last Visited	‎12-10-2018 10:03 AM
Posts	191
Kudos received	80

Cloudera Community

Re: Metastore HA Active/Active ?

Re: Hi All, I want to integrate Ab initio tool wit...

Re: Hadoop Rack-Awareness is only for datanode ser...

Re: Kafka installation best practices in HDF

Re: Best tools for file transfer and ingest.

Re: Nifi for batch ingest

Re: Hadoop Rack-Awareness is only for datanode ser...

Re: Kafka installation best practices in HDF

Re: Which way of HDP cluster setup is best, having...

Re: HDP 2.4 installation on prod. cluster - filesy...

Re: Hortonworks Tutorials - Created/Modified Dates

Re: Best tools to ingest data to hadoop

Re: Best tools for file transfer and ingest.

Re: Atlas based based lineage / Ranger based acce...

Re: critical error for hive and atlas