About hinxx

hinxx · ‎07-22-2021

Thank you for the insights! Would you consider a custom Kafka producer and HDFS Kafka consumer to be adequate replacement for Flume/NiFi? At a glance it looks like I could void the whole separate ingest stage and just go with Kafka.

hinxx · ‎07-22-2021

Is use of Flume in 2021 still the way to go? I tried searching for some insights on this topic and did not find much. Looking at the latest release it is in 2019, little or no activity in github, lots of unmerged pull requests, too. I'm planning on coding a custom source for Flume. My data is binary (not text), unstructured, arriving from a legacy control system. I do not plan (ATM) on having any filtering or processing applied to the data stream. The data stream will be then routed through memory channel to HDFS. FWIW, the environment I will be working in is closed and I can not use cloud based solution. Also the data rates will be to big to pipe them to through the internet. Looking for alternatives, there is lots of suggestions that sometimes do not feel like a Flume alternative, but then again, I'm new to this "big data" ecosystem and I might be mistaken. For example, Apache NiFi, Spark and alike. It might be that I could go with Kafka and skip the Flume all together, I guess, but I need to educate myself more about the options. Thank you for the input!

hinxx · ‎07-06-2021

I'm looking at a possible use of Hadoop ecosystem with the high rate and high volume scientific data. Data that needs to be stored is arriving in a stream updating ~20 times a second, with each data stream delivering anywhere from a single value to an array of 500k values (could be integers, doubles,..). Streams have a name, data is binary (not text), and comes with timestamps. There could be millions of such streams to handle. I would look into store these input streams with the help Avro to the HDFS. From the client perspective I would like to preferably work with python (not really looking for SQL-like access at the moment). User should query for data using stream name and be able to fetch data from different time slices. Assuming I would be able to scale up the node count and storage space as required, is this use case something that Hadoop ecosystem would be good at? Are there any use cases like this out there? Any benchmarks I can look at? Thank you in advance!

Online	Offline
Last Visited	‎07-24-2021 12:43 PM

Member Since	‎07-06-2021 05:45 AM
Last Visited	‎07-24-2021 12:43 PM
Posts	3

Cloudera Community

Re: Apache Flume in 2021

Apache Flume in 2021

scientific data in hadoop