Member since
09-18-2015
191
Posts
81
Kudos Received
40
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2054 | 08-04-2017 08:40 AM | |
5425 | 05-02-2017 01:18 PM | |
1112 | 04-24-2017 08:35 AM | |
1116 | 04-24-2017 08:21 AM | |
1344 | 06-01-2016 08:54 AM |
04-24-2017
11:11 AM
Hi @Avijeet Dash Generally speaking NiFi will handle that absolutely fine, I've seen it used to move very large video files with no issue, you'll need to ensure that the nodes have relevant disk, cpu and memory to support the file sizes you're interested, but otherwise no major concerns! Hope that helps!
... View more
04-24-2017
08:35 AM
Hi @Peter Kim Rack Awareness is only used by HDFS to ensure it accurately places replicas of data off-rack, therefore it only needs to have the datanodes listed. Hope that helps!
... View more
04-24-2017
08:21 AM
Hi @Sanaz Janbakhsh I'm not sure I understand your first question, ideally the zookeeper instances will be running on separate hardware to the Kafka nodes, however it is possible to co-locate them as long as you have enough bandwidth on the nodes, and preferably dedicate spindles to ZK. In terms of best practices overall I'd suggest you download the slides and watch the video from this session at the recent DataWorks summit Munich: Apache Kafka Best Practices Session video: https://www.youtube.com/watch?v=maD_7ZdyuAU Session slides: https://www.slideshare.net/HadoopSummit/apache-kafka-best-practices
I hope this helps!
... View more
06-15-2016
11:46 AM
1 Kudo
Hi @Uday Vakalapudi Typically you will always be better off with multiple machines (scale out) rather than a smaller number of large machines (scale up). If you consider the way that Hadoop works, jobs are effectively distributed across the whole cluster and all the resources can be utilised simultaneously. This is the opposite of what virtualisation typically handles, which is multiple machines with different workloads and different workload profiles (I/O, cpu, memory). My short suggestion would be if you're just looking at a test/dev/pilot system, then multiple VM's is fine. But for production, consider scale out on bare metal. Hope that helps. , Typically you will always be better off with multiple machines (scale out) rather than a smaller number of large machines (scale up). If you consider the way that Hadoop works, jobs are effectively distributed across the whole cluster and all the resources can be utilised simultaneously. This is the opposite of what virtualisation typically handles, which is multiple machines with different workloads and different workload profiles (I/O, cpu, memory). My short suggestion would be if you're just looking at a test/dev/pilot system, then multiple VM's is fine. But for production, consider scale out on bare metal.
... View more
06-01-2016
12:03 PM
1 Kudo
Hi @Kaliyug Antagonist You can use whatever filesystem you like for the O/S filesystems etc, our recommendations are primarily targeted at the HDFS data drives. I wouldn't use ext3 for anything any more, ext4 and xfs have moved forward as being the primary default options now. So to try and address your options one by one: 1) No, don't do this. 2) Perfectly acceptable, take care that data drives are mounted with recommended mount options (noatime etc) 3) Also perfectly acceptable, I see more people using XFS everywhere now, ext4 less so, but the deltas are relatively small, I'd go with whichever option you're more comfortable with as an organisation. 4) I wouldn't recommend that, if you're happy using XFS, use it everywhere, it just makes things easier but see point 2) about mount options for data drives 5) You can absolutely use LVM for your O/S partitions, just ideally don't use it for datanode and log directories. Hope that helps!
... View more
06-01-2016
09:01 AM
All great points, I've always wished that our tutorials came with a short "release notes" link for each one too.
... View more
06-01-2016
08:58 AM
Hi @Sree Venkata. To ad to Neeraj's already excellent answer and to follow your comment, NiFi now *does* support kerberised clusters. Also there is now an RDBMS connector, although I'd still say, use SQOOP if you're transferring very large chunks of RDBMS data and you want it parellised across the whole hadoop cluster, use NiFi if youve got smaller chunks to transfer that can be parallelised over a smaller NiFi cluster. Hope that (in combination with Neeraj's answer) fulfills your requirements.
... View more
06-01-2016
08:54 AM
2 Kudos
Hi @x name. I don't believe there is anything pre-built within flume to do exactly what you need. Flume itself is certainly production ready and has been in constant use by a very wide range of people for a long time now, it just hasn't evolved past that point very much. It also starts to struggle with significant load under the kind of scenario's you're discussing unless it's very carefully managed. You've already identified the tool set that I'd probably recommend for your requirement which is NiFi. You've also identified another article so I won't go into that any further. As for other tools or patterns, I've seen people build some of their own ingest frameworks using a combination of scripts and things like webhdfs, or indeed a lot of custom code on top of Kafka. However with the way that the technology is stacking up now, unless you have a strong reason not to, NiFi solves all the issues you bring up and is easy to use as well, I'd strongly recommend it. If you do find something else please do add a comment here, likewise if you try NiFi and you get stuck at all, don't hesitate to fire over another question! Hope that helps.
... View more
05-31-2016
08:39 AM
1 Kudo
Hi @rogier werschkull. You're correct, those steps currently wouldn't be tracked as a lineage, depending on exactly how the data is manipulated. Some of the Hive linage may be tracked depending on how those tools integrate with the data via the Hive service for example. SAS and an increasing number of partners, customers and community members are part of the Data Governance Initiative (DGI). You can reasonably expect those members of the DGI to be first in the queue to have their solutions more integrated into Atlas for the shared metadata exchange. Hope that helps.
... View more
05-26-2016
12:07 PM
Are all services started? Have you tried restarting the ones that have issues?
... View more