About raghavendran_c

raghavendran_c · ‎04-29-2016

Hi, We have a HDP 2.3.2 cluster (around 50 nodes). We have many jobs that process millions of records of data every day (sometimes as high as a billion records a day). We need to assign a unique ID (UUID) for each of these records and are looking to use java.util.UUID.randomUUID() for this. From the documentation and wikipedia we see that randomUUID is good - but there is a very small chance that duplicates can be generated. I checked the entropy of our machines and they are >150. While we can be sure that randomUUID will work for now, is there guidance on when *not* to use randomUUID? We don't want to go to a centralized service for ID generation as that will create bottlenecks. Are there any other alternatives to generating UUIDs in the hadoop cluster? We have looked at SnowFlake, Flake & FaukxFlake - but are not yet convinced they will work for us. Any pointers on this will be appreciated. thanks, Raga

raghavendran_c · ‎01-11-2016

Hi, the following link says Kafka Source & Kafka Sink are supported in the Flume that comes with HDP2.3. http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_HDP_RelNotes/content/new-features-230.html But the JIRA (https://issues.apache.org/jira/browse/FLUME-2242) mentioned in the hortonworks docs says that it is available only from Flume 1.6 (while HDP 2.3.2 has only Flume 1.5.2). Can someone confirm that Kafka Source & kafka Sink is indeep available in the Flume that comes with HDP 2.3.2 ? thanks, Raga

raghavendran_c · ‎12-14-2015

Hi, Is SPARK Streaming (may be along with SPARKSQL) suited for interactive querying – to generate reporting dashboards using Tableau? We are building a data lake with all our organization’s data in the data lake (as AVRO formatted files). We need to create dashboards & reports using Tableau with the data available in the data lake. The challenge is that some of these reports have to process millions of records and have strict timelines for loading (sometimes as strict as <10 seconds load time for reports). We are right now forced to create an Oracle datamart (populated with the data from the data lake) – from where Tableau pull data to generate reports. We want to avoid creating a separate data mart and hence are looking at connecting Tableau directly to the Hadoop Datalake. While pure SPARK is ruled out as the Reports need to be interactive, go to know (from yesterday’s Hortonworks webinar on SPARK) that SPARK Streaming can be used here. Is SPARK Streaming (may be along with SPARKSQL) suited for interactive querying – to generate reporting dashboards using Tableau? Are there any similar example use cases that you can point me to please? thanks, Raga

Online	Offline
Last Visited	‎12-21-2016 07:46 PM

Member Since	‎12-14-2015 02:56 PM
Last Visited	‎12-21-2016 07:46 PM
Posts	8
Kudos received	6

Cloudera Community

Using java.util.UUID.randomUUID() for UUID generat...

Is Flume to Kafka Sink supported in HDP 2.3.2?

Use of Spark Streaming for interactive Reporting/V...