Member since
06-29-2016
81
Posts
43
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
500 | 03-16-2016 08:26 PM |
10-27-2017
11:15 AM
I have come across DataPlane service which was announced recently. Few questions in that regard, 1. As far as i understand, it contains Atlas, Ranger and Knox. Is this correct? 2. What is the motivation behind this new product? If i have HDP, i get Ranger, Knox, Atlas anyways. 3. What is the applicability of this product? Can i use this with other Hadoop distributions? 4. Has the sandbox for Dataplane been released? Where can i find the download?
... View more
03-14-2017
03:34 PM
My question on HDFS using SAN as the backend storage has 3 main parts 1. Is it feasible to use SAN as the back end storage for HDFS? 2. What are the pros and cons of using SAN or NAS for HDFS? 3. Has it been tested for performance and may be other aspects?
... View more
Labels:
- Labels:
-
Apache Hadoop
01-23-2017
01:52 PM
@Devin Pinkston For some reason the link does not work. But essentially, i would to know HWs recommendation of using BlueData for HDP and on that platform i also get the other services running (multi-tenancy).
... View more
01-23-2017
09:46 AM
I came across this and found the BlueData's BDAAS interesting. Whats Hortownworks recommendation for the same in using HDP on BlueData?
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
01-06-2017
04:16 AM
1 Kudo
@Tom McCuch Thanks for the clarification. One another related question is that in general what are the advantages that Mesos would bring over Yarn? Especially given the fact that Hortonworks is making efforts to support HDP on Mesos. I mean why care. If HDP on the cloud, its still YARN thats going to be the cluster manager.
... View more
01-05-2017
04:41 PM
1 Kudo
Is it possible to deploy HDP docker container in Mesos using Marathon? If so, where can i get the docker images from and the Marathon recipes? If its not possible with the combination above, what are the options to deploy HDP on Mesos? How is it going to be better than running on Yarn?
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
12-30-2016
09:58 AM
2 Kudos
My understanding along with questions as below, AWS-HDCloud Manual scaling using Ambari or AWS UI possible. Auto Scaling 1. Is it possible to auto-scale in this option (while creating the cluster can i set auto-scaling group)? 1.1. In that case, how is the data re-balanced? i.e. if a new node is added, then compute may not gain data locality. -------------------------------------------------------------------------------------------------------------------------------------------------------------- AWS-HDP on IaaS Manual scaling using Ambari is possible. Auto Scaling-Without CloudBreak 2. Is it possible to auto-scale in this option (while creating the cluster can i set auto-scaling group)? 2.1. In that case, how is the data re-balanced? i.e. if a new node is added, then compute may not gain data locality. Auto Scaling-WithCloudBreak Auto-scaling may be possible, but question 2.1 applies here as well. -------------------------------------------------------------------------------------------------------------------------------------------------------------- Azure-HdInsights Manual scaling using Ambari or Azure UI possible. Auto Scaling 3. Is it possible to auto-scale in this option (while creating the cluster can i set auto-scaling group)? 3.1. In that case, how is the data re-balanced? i.e. if a new node is added, then compute may not gain data locality. -------------------------------------------------------------------------------------------------------------------------------------------------------------- Azure-HDP in MarketPlace Manual scaling using Ambari or Azure UI possible. Auto Scaling 4. Is it possible to auto-scale in this option (while creating the cluster can i set auto-scaling group)? 4.1. In that case, how is the data re-balanced? i.e. if a new node is added, then compute may not gain data locality. -------------------------------------------------------------------------------------------------------------------------------------------------------------- Azure-HDP on IaaS Same questions as AWS-HDP on IaaS
... View more
Labels:
- Labels:
-
Hortonworks Cloudbreak
12-30-2016
09:38 AM
@Tom McCuch One last question which i got after reading your answer again. WASB in Azure is supported on both HDP on Azure IaaS and HDP in Azure MarketPlace. Does this mean that WASB is natively optimized in Hadoop 2.x? If so, this would also mean that any distribution with Hadoop 2.x deployed on Azure can use WASB for storage?
... View more
12-28-2016
03:09 PM
@Tom McCuch So to summarize, please correct as appropriate 1. HDI 3.5 - WASB and ADLS 2. Pre HDI 3.5 - Only WASB 3. HDP on Asure IaaS - Only WASB and HDFS on VHD 4. HDP from Azure Marketplace - Only WASB and HDFS on VHD 5. HDCloud 2.5 - S3 Only 6. HDP on AWS IaaS - HDFS on Ephemeral or EBS
... View more
12-27-2016
08:29 AM
@Tom McCuch Thanks. Can you also please talk a little bit about ADLS? Do you still recommend WASB over ADLS? And i am not clear on the parallelism factor on s3 and WASB. Are you saying that S3 does not offer parallelism and suitable for larger number of smaller files? whats you take on parallelism when it comes to WASB? And can i use WASB, ADLS and S3 when i install HDP on Azure's IaaS (using CloudBreak) as the HDFS layer?
... View more
12-22-2016
03:51 AM
5 Kudos
What are the storage options possible when deploying HDP on Cloud? My understanding as follows, 1. Azure (HDInsight, HDP via CloudBreak, HDP in the MarketPlace) WASB - What about parallelism here? i.e. if i store a file here and run a map reduce job processing this file. Would i achieve the same effect as i achieve in HDFS storage? ADLS - Although not co-located, performance can be improved by means of parallelism. HDFS itself - I can move the data to the edge node then copy into HDFS What are my options to move my data into WASB, ADLS? This thread suggests NI-FI but my requirement is ephemeral and NIFI investment may not sell. 2. AWS (Below questions apply to HDCloud, HDP via CloudBreak to AWS) S3 - What about parallelism here? i.e. if i store a file here and run a map reduce job processing this file. Would i achieve the same effect as i achieve in HDFS storage? HDFS itself - I can move the data to the edge node then copy into HDFS And out of these storage options, which one is better over the other and for what reason?
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
12-21-2016
04:42 PM
@Greg Keys Thanks again. Hopefully last set of questions 1. With HDP in Azure marketplace, we cannot use the OS of our choice. With CloudBreak, can we specify the OS? 2. Storage in Azure - HDFS, WSAB, ADLS are options for all deployment options of HDP IaaS (CloudBreak, Marketplace), HDInsights? 3. With HDC can i choose the OS? 4. What are the storage options for HDCloud? Is it HDFS and S3 (same as that for HDP on AWS IaaS through CloudBreak)? 5. Can i deploy HDP via CloudBreak in AWS VPC similar to the way that i can deploy in the AWS public cloud? 6. Can i deploy HDC on AWS VPC? 7. What are my options to move data from on-premise to AWS public cloud (S3, HDFS) and AWS VPC (S3, HDFS)? (This may not be strictly HDP question!) 8. What are my options to move data from on-premise to Azure public cloud (WASB, ADLS, HDFS) ? 9. Can i spin HDInsights or HDP (Cloudbreak or marketplace) in Azure private cloud? (I assume that Azure offers two flavors of private cloud - on-premise hosted and the other one similar to VPC)
... View more
12-21-2016
02:15 PM
@Greg Keys Thanks a lot. Few follow up questions 1. Option 2 that i was talking about is what i see in the Azure portal. Please see the attachments. hdponazure.png and hdponazure-clustercreation.png 2. What about the "Data Lake store" as an option for storage on all options? 3. With respect to performance, my question was more around the issues due to compute and storage not colocated. 4. And what is the purpose of HDCoud? Is it similar to CloudBreak for AWS? Is it for HDP on AWS IaaS? 5. And HDC that you mentioned above - is that a HDP as a service Offering from AWS?
... View more
12-21-2016
11:26 AM
2 Kudos
I see 3 different options to deploy HDP on Hadoop HDInsights (Built on top of HDP) HDP as a Service Deploying HDP on Azure's bare metal In my understanding 1 and 2 are managed services where the control is limited when it comes to the choice of OS etc. HDInsight has multiple cluster types (not sure whats the rationale behind this though) Questions: Whats the rationale behind having multiple cluster types for HDInsight? Why are two services (1 and 2 above) offered? When to use what? (apart from this) Are there any performance benchmarks done on HDInsight or HDP on Azure in a production situation? What are the different storage types possible on the above services? Atleast on HDInsight i see that Blob storage and Data Lake Store are options but both are external to the compute nodes. May hit performance, hence curious about question 3 apart from the fact that the cluster run on the virtual machines. What are the option to provision HDP on Azure bare metal nodes (Option 3)? Does CloudBreak help there?
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
08-25-2016
08:50 AM
@Tom McCuch Thanks a lot for the views and inputs. It definitely helps.
... View more
08-22-2016
06:14 AM
@Tom McCuch Thanks again. Do you recommend, data to be sorted for ORC optimization to work? Or it does not really matter? And any benchmark volume with performance testing done for adhoc queries with the optimization mentioned above?
... View more
08-18-2016
09:45 AM
@Tom McCuch Thanks for the detailed response. In terms of querying capabilities (from a BI tool or a CLI or Hue), to achieve faster query response as its required in the operational reporting, one way is to structure the data (by means of partition etc) for pre-defined queries but for adhoc operational reporting queries, whats your take on ODS in hadoop to achieve the desired performance? One way is restrict the volume of data (in addition to ORC format, Tez etc) in the ODS layer as its for operational needs anyways (so history may not be required). Please share your thoughts.
... View more
08-16-2016
09:48 AM
2 Kudos
I am asked to build an ODS (Operational data store) in hadoop for an insurance client. In this regard, few questions
First of all, is it recommended to build the ODS in hadoop? What are the pros and cons of buildingODS in hadoop? Any best practices around this topic? The ODS should facilitate the operational reporting needs that should support adhoc queries.
... View more
Labels:
- Labels:
-
Apache Hadoop
06-29-2016
01:00 PM
@Benjamin Leonhardi Thanks, makes sense
... View more
06-28-2016
10:13 PM
Data comes from multiple sources and these are exposed in the hive table for the users. A specific column is sensitive and needs to be given restricted access. If a user who wants to join 2 such tables on the column that he does not have access to, then whats the best approach to make it work? One option is to link the sensitive column with a generated key so that the user can join on the generated key. Is this a good idea or any better idea?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
05-13-2016
01:53 PM
What does it mean for a hive table with ORC or Avro format to have the Field delimiter specified? Does hive ignore even if its specified? For example, CREATE TABLE if not exists T (
C1 STRING ,
C2 STRING )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS ORC tblproperties ("orc.compress"="SNAPPY")
... View more
Labels:
- Labels:
-
Apache Hive
04-11-2016
02:50 PM
Can a non-numeric column be specified for a --split-by key parameter? What are the potential issues in doing so?
... View more
Labels:
- Labels:
-
Apache Sqoop
03-30-2016
01:24 PM
@Benjamin Leonhardi I think i got it. Its still the same number of files but with more reducers. In my mind, it was always just the buckets not the partitions. So i thought its 30 files (30 buckets and 40 partitions), but in fact its still 1200 files in both the case but in optimized its more number of reducers.
... View more
03-28-2016
01:21 PM
1 Kudo
@Benjamin Leonhardi On 3, by normal load i referred to your slide 14 and 15. As per that, if you have 30 buckets and 40 partitions, you would have 30 reducers in total (one reducer per bucket across all partitions). So its only 30 files versus 1200 files in the optimized case. That's why i still wonder how it fixes the small file problem (as per slide 16). At the same time i understand the fact about the performance and memory issue. Its really optimized in these 2 cases.
... View more
03-25-2016
09:47 PM
1 Kudo
@Benjamin Leonhardi, Thanks. Just one last set of questions 1. Sort by is only sorting within a reducer. So it you have 10 reducers, it ends up in 10 different ORC files. If you apply sort by on column C1, it may still happen that the same C1 may appear in 10 files unless you distribute by C1. But within each of those files, sorting may help to skip blocks.Am i right? 2. Does ORC maintain index at the block level or the stripe level? (as per slide 6 it looks like block level but as per slide 4, its at the stripe level). If its at the stripe level, it can skip the stripe but if a stripe has to be read, it has to read the entire stripe? 3.And on "Optimized", I understand in terms of the performance but still it has more reducers than the normal load, so how does it fix the small file problem? 4. May be PPD is only for ORC format, but the other concepts of partitioning, bucketing, optimized apply to other formats as well?
... View more
03-25-2016
08:01 PM
@Joseph Niemiec You mentioned "Left outerjoin and test for null in the WHERE is probably better for scaling then UNION DISTINCT if you are worried about a reducer problem. Same join syntax as the example below..." How left outer join avoids reducer (unless its a map join)? Do you recommend left outer join than union distinct? And in the point "We have found a fun case where if you try to use this to dedupe or clean.....", so my understanding is that if a partition has 5 records which are duplicates (the initial master load already had it), there is no way to remove unless a 6th records which is a duplicate of those 5 records come in the staging load. Am i right? If so, what is your recommendation to remove duplicates in the initial load itself?
... View more
03-25-2016
07:38 PM
1 Kudo
@Benjamin Leonhardi I went through your slides and got few questions around that.
Since its related to bucketing, partitioning, I think it makes sense to
continue in the same thread itself.
In dynamic partition (DP) loading, you used the
word standard load. Does that mean setting the number of reducers to 0 or you
mean something else?
You mentioned that larger number of writers and
large number of partitions lead to small files. The number of mappers is based
on the number of blocks and each mapper writes separate files to every
partition. So irrespective of the number of partitions, the large number of
mappers itself can lead to small files. Am I right?
What is the default key used for distribution if
you don’t use the distribute by clause? What is the distribution key when there
are more than 1 partition?
Slide 13 – to enable this kind of load, do you recommend
to just set the number of reducers to be same as the number of partitions? And
any reducer can get data for any partition, which may lead to small files, is
this what you mean(hash conflict)?
Slide 14 – One reducer for each bucket across
all partitions lead to ORC writer memory issues. Why is this the case?
Optimized Dynamic sorted partitioning – one reducer
for each partition and bucket. From above point, there are 5 partitions and 4
buckets then 4 reducers only, but in the optimized case there are 20 reducers.
More the number of reducers, smaller the files are going to be? How can this
solve the small file problem?
Sort by for PPD – ORC index would anyways help
to ignore reading some blocks. But when it comes to reading the block which has
the predicate value, sorting helps performance only when the predicate value is
reached quickly when reading the file. If the value happens to be at the end of
the file then you still end up reading the whole file. So performance
improvement in PPD with Sort by really depends, am I right?
... View more
03-22-2016
03:00 PM
@Joseph Niemiec Looks like this approach puts a restriction that the columns needed to be compared for duplication have to be the partition columns. Not all columns may qualify to be partition columns. Even if i find the hash of those columns, partition on that hash column may not qualify due to high cardinality. Any other option apart from dynamic partition pruning?
... View more
03-21-2016
02:13 AM
5 Kudos
Problem Statement: I have a huge history data set in HDFS on top of which i want to remove duplicates to begin with and also the daily ingested data have to be compared with the history to remove duplicates plus the daily data may have duplicates within itself as well. Duplicates could mean
If the keys in 2 records are the same then they are duplicates. Depends on few columns. If those columns match, then they are duplicates. Question:
What is an optimized solution to remove duplicates in both these situations? Can we avoid reducer at all? If so what are the options? How hashing would help here? I see vague solutions around but its not very well documented and hard to understand. I have already looked at link, but its not clear. Code samples would help.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
03-17-2016
02:48 PM
2 Kudos
@Artem Ervits HCatOutputFormat class is in fact in the jar /usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar only. Actually its not about this specific class or jar to begin with; Its actually the command that i used; Changing it to the following works. Note that the program arguments come at the end. The link that you suggested made me try this. Also, the other jars mentioned in the above link need to be added as it complained about other classes as well one by one. Those jars may be different based on the distribution and version. In HDP 2.3.2 i did the following export HCAT_HOME=/usr/hdp/current/hive-webhcat
export HIVE_HOME=/usr/hdp/current/hive-client
export LIB_JARS=$HCAT_HOME/share/hcatalog/hive-hcatalog-core.jar,$HIVE_HOME/lib/hive-metastore.jar,$HIVE_HOME/lib/libthrift-0.9.2.jar,$HIVE_HOME/lib/hive-exec.jar,$HIVE_HOME/lib/libfb303-0.9.2.jar,$HIVE_HOME/lib/jdo-api-3.0.1.jar,$HIVE_HOME/lib/datanucleus-api-jdo-3.2.6.jar
hadoop jar mr-hcat.jar <mainclass> -libjars ${LIB_JARS} mr_input_text mr_output_text
... View more