Member since
12-14-2015
8
Posts
6
Kudos Received
0
Solutions
04-29-2016
02:28 PM
1 Kudo
Hi, We have a HDP 2.3.2 cluster (around 50 nodes). We have many jobs that process millions of records of data every day (sometimes as high as a billion records a day). We need to assign a unique ID (UUID) for each of these records and are looking to use java.util.UUID.randomUUID() for this. From the documentation and wikipedia we see that randomUUID is good - but there is a very small chance that duplicates can be generated. I checked the entropy of our machines and they are >150. While we can be sure that randomUUID will work for now, is there guidance on when *not* to use randomUUID? We don't want to go to a centralized service for ID generation as that will create bottlenecks. Are there any other alternatives to generating UUIDs in the hadoop cluster? We have looked at SnowFlake, Flake & FaukxFlake - but are not yet convinced they will work for us. Any pointers on this will be appreciated. thanks, Raga
... View more
Labels:
- Labels:
-
Apache Hadoop
01-11-2016
06:56 PM
1 Kudo
Hi, the following link says Kafka Source & Kafka Sink are supported in the Flume that comes with HDP2.3. http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_HDP_RelNotes/content/new-features-230.html But the JIRA (https://issues.apache.org/jira/browse/FLUME-2242) mentioned in the hortonworks docs says that it is available only from Flume 1.6 (while HDP 2.3.2 has only Flume 1.5.2). Can someone confirm that Kafka Source & kafka Sink is indeep available in the Flume that comes with HDP 2.3.2 ? thanks, Raga
... View more
Labels:
- Labels:
-
Apache Flume
-
Apache Kafka
12-14-2015
03:25 PM
Hi, Is SPARK Streaming (may be along with SPARKSQL) suited for interactive querying – to generate reporting dashboards using Tableau? We are building a data lake with all our organization’s data in the data lake (as AVRO formatted files). We need to create dashboards & reports using Tableau with the data available in the data lake. The challenge is that some of these reports have to process millions of records and have strict timelines for loading (sometimes as strict as <10 seconds load time for reports). We are right now forced to create an Oracle datamart (populated with the data from the data lake) – from where Tableau pull data to generate reports. We want to avoid creating a separate data mart and hence are looking at connecting Tableau directly to the Hadoop Datalake. While pure SPARK is ruled out as the Reports need to be interactive, go to know (from yesterday’s Hortonworks webinar on SPARK) that SPARK Streaming can be used here.
Is SPARK Streaming (may be along with SPARKSQL) suited for interactive querying – to generate reporting dashboards using Tableau? Are there any similar example use cases that you can point me to please? thanks,
Raga
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark