About cstanca

cstanca · ‎12-17-2016

Introduction Many organizations have come to rely on Hadoop for dealing with the ever-increasing quantities of data that they gather. Today, it is clear what problems Hadoop can solve, however, cloud is still not the first choice for Hadoop deployment. Pros and cons for Hadoop in the cloud have been shared across multiple blogs and books, but the question is always coming-up in discussions with enterprises considering Hadoop in the Cloud. Thus, I thought it would be useful to collate together a few pros and cons, as well as mention a pragmatic approach to consider a hybrid cloud for organizations that have made significant investments on-prem. For organizations going to Hadoop for the first time, Cloud is probably a better bet especially if they don't have a lot of IT expertise and a great stream of revenue exists and needs to be exploited immediately. Pro Cloud Lack of space. You don’t have space to keep racks of physical servers, along with the necessary power and cooling. Flexibility. It is much easier to reorganize instances, or expand or contract your footprint, for changing business needs. Everything is controlled through cloud provider APIs and web consoles. Changes can be scripted and put into effect manually or even automatically and dynamically based on current conditions. New usage patterns. Cloud providers abstract computing resources such that they are not tied to physical configurations, which means they can be managed in ways that are otherwise impractical. For example, individuals could have their own instances, clusters, and even networks to work with, without much managerial overhead. The overall budget for CPU cores in your cloud provider account can be concentrated in a set of large instances, a larger set of smaller instances, or some mixture, and can even change over time. When an instance malfunctions, instead of troubleshooting what went wrong, you can just tear it down and replace it. Worldwide availability. The largest cloud providers have data centers around the world. You can use resources close to where you work, or close to where your customers are, for the best performance. You can set up redundant clusters, or even entire computing environments, in multiple data centers, so that if local problems occur in one data center, you can shift to working elsewhere. Data retention restrictions. If you have data that is required by law to be stored within specific geographic areas, you can keep it in clusters that are hosted in data centers in those areas. Cloud provider features. Each major cloud provider offers an ecosystem of features to support the core functions of computing, networking, and storage. To use those features most effectively, your clusters should run in the cloud provider as well. Capacity. Very few customers tax the infrastructure of a major cloud provider. You can establish large systems in the cloud that are not nearly as easy to put together, not to mention maintain, on-prem. Pro On-Prem Simplicity. Cloud providers start you off with reasonable defaults, but then it is up to you to figure out how all of their features work and when they are appropriate. It takes a lot of experience to become proficient at picking the right types of instances and arranging networks properly. High levels of control. Beyond the general geographic locations of cloud provider data centers and the hardware specifications that providers reveal for their resources, it is not possible to have exacting, precise control over your cloud architecture. You cannot tell exactly where the physical devices sit, or what the devices near them are doing, or how data across them shares the same physical network1. When the cloud provider has internal problems such as network outages, there’s not much you can do but wait. Unique hardware needs. You cannot have cloud providers attach specialized peripherals or dongles to their hardware for you. If your application requires resources that exceed what a cloud provider offers, you will need to host that part on-prem away from your Hadoop clusters. Saving money. For one thing, you are still paying for the resources you use. The hope is that the economy of scale that a cloud provider can achieve makes it more economical for you to pay to “rent” their hardware than to run your own. You will also still need people who understand system administration and networking to take care of your cloud infrastructure. Inefficient architectures can cost a lot of money in storage and data transfer costs, or instances that are running idle. Best of Both Instead of running your clusters and associated applications completely in the cloud or completely on-prem, the overall system is split between the two - Hybrid Cloud. Data channels are established between the cloud and on-prem worlds to connect the components needed to perform work. Examples Suppose there is a large, existing on-prem data processing system, perhaps using Hadoop clusters, which works well. In order to expand its capacity for running new analyses, rather than adding more on-prem hardware, Hadoop clusters can be created in the cloud. Data needed for the analyses is copied up to the Hadoop clusters where it is analyzed, and the results are sent back on-prem. The cloud clusters can be brought up and torn down in response to demand, which helps to keep costs lower. Assume the management of vast amounts of incoming data that needs to be centralized and processed. To avoid having one single choke point where all of the raw data is sent, a set of cloud clusters can share the load, perhaps each in a geographic location convenient to where the data is generated. These clusters can perform pre-processing of the data, such as cleaning and summarization, to reduce the work that the final centralized system must perform. References Moving Hadoop to the Cloud by Bill Havanki, published by O'Reilly Media, Inc., 2017.

cstanca · ‎12-17-2016

@Sampat Budankayala As you already know, Hive does not support sub-queries such as connect by. Bad news, this is a general situation with similar tools in Hadoop ecosystem. Join works if you know the number of levels and the query is quite ugly. If you need hierarchical queries against databases that don't support Recursive Subquery Factoring, one very low-tech alternative is to encode the hierarchy into a separate key. Of course this will only work if you can control the table update process and rewrite the key following parent updates. Your option is to take the hierarchical data, import it onto an RDBMS suited for connect by. *** If response helped, please vote/accept best answer.

cstanca · ‎12-17-2016

You did not specify the use case, but be aware of some limitations on bson files: https://github.com/mongodb/mongo-hadoop/wiki/Using-.bson-Files You may want also to connect pyspark to MongoDB. Good reference: https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-3-spark-example-key-takeaways

cstanca · ‎12-17-2016

@Jane Becker The mongo-hadoop project connects Hadoop AND SPARK with MongoDB. You can download it from the releases page (https://github.com/mongodb/mongo-hadoop/releases) or build it yourself from https://github.com/mongodb/mongo-hadoop. If you decide to build it yourself, you could do it using gradlew and the following steps, then copy the jar into lib/ wget -P /tmp/ https://github.com/mongodb/mongo-hadoop/archive/r1.5.1.tar.gz mkdir mongo-hadoop tar -xvzf /tmp/r1.5.1.tar.gz -C mongo-hadoop --strip-components=1 # Now build the mongo-hadoop-spark jars cd mongo-hadoop ./gradlew jar cd .. cp mongo-hadoop/spark/build/libs/mongo-hadoop-spark-*.jar lib/

cstanca · ‎12-16-2016

@Gurpreet Singh @Greg Keys has provided the link you requested (ref REL_SUCCESS): http://funnifi.blogspot.com/2016/02/executescript-processor-hello-world.html

cstanca · ‎12-15-2016

@Boris Demerov The most common filesystem used with HBase is HDFS. But you are not locked into HDFS because the FileSystem used by HBase has a pluggable architecture and can be used to replace HDFS with any other supported system. In fact, you could go as far as implementing your own filesystem—maybe even on top of another database. The local filesystem actually bypasses Hadoop entirely, that is, you do not need to have a HDFS or any other cluster at all. It is handled all in the FileSystem class used by HBase to connect to the filesystem implementation. The supplied ChecksumFileSystem class is loaded by the client and uses local disk paths to store all the data.The beauty of this approach is that HBase is unaware that it is not talking to a distributed filesystem on a remote or colocated cluster, but actually is using the local filesystem directly. The standalone mode of HBase uses this feature to run HBase only (without HDFS). The S3 FileSystem implementation provided by Hadoop supports three different modes: the raw (or native) mode, the block-based mode, and the newer AWS SDK based mode. The raw mode uses the s3n: URI scheme and writes the data directly into S3, similar to the local filesystem. You can see all the files in your bucket the same way as you would on your local disk.The s3: scheme is the block-based mode and was used to overcome S3’s former maximum file size limit of 5 GB. This has since been changed, and therefore the selection is now more difficult—or easy: opt for s3n: if you are not going to exceed 5 GB per file.The block mode emulates the HDFS filesystem on top of S3. It makes browsing the bucket content more difficult as only the internal block files are visible, and the HBase storage files are stored arbitrarily inside these blocks and strewn across them. Both these filesystems share the fact that they use the external JetS3t open source Java toolkit to do the actual heavy lifting. A more recent addition is the s3a: scheme that replaces the JetS3t block mode with an AWS SDK based one. It is closer to the native S3 API and can optimize certain operations, resulting in speed ups, as well as integrate better overall compared to the existing implementation. There are other filesystems, and one to mention is QFS, the Quantcast File System. It is an open source, distributed, high-performance filesystem written in C++, with similar features to HDFS. Find more information about it at the Quantcast website.There are other file systems, for example the Azure filesystem, or the Swift filesystem. Both use the native APIs of Microsoft Azure Blob Storage and OpenStack Swift respectively allowing Hadoop to store data in these systems. Please carefully evaluate what you need given a specific use-case, however, note that the majority of clusters in production today are based on HDFS. I will not iterate here the advantages of using HDFS. The primary reason HDFS is so popular is its built-in replication, fault tolerance, and scalability. Choosing a different filesystem should provide the same guarantees, as HBase implicitly assumes that data is stored in a reliable manner by the filesystem implementation. HBase has no added means to replicate data or even maintain copies of its own storage files. This functionality must be provided by the filesystem.

cstanca · ‎12-13-2016

@Andi Sonde As you pointed-out, Kafka 0.10.0.0 supports rack awareness. KAFKA-1215 added a rack-id to kafka config. You can specify that a broker belongs to a particular rack by adding a property to the broker config: broker.rack=my-rack-id. The rack awareness feature spreads replicas of the same partition across different racks. This extends the guarantees Kafka provides for broker-failure to cover rack-failure, limiting the risk of data loss should all the brokers on a rack fail at once. The feature can also be applied to other broker groupings such as availability zones in EC2. Let's assume an example with 6 brokers. Brokers 0,1 and 2 are on the same rack, and brokers 3,4 and 5 are on a separate rack. Instead of picking brokers in the order of 0 to 5, we order them: 0,3,1,4,2,5 - each broker is followed by a broker from a different rack. In this case, if leader for partition 0 is on broker 4, the first replica will be on broker 2 which is on a completely different rack. If the first rack goes offline, we know that we still have a surviving replica and therefore the partition is still available. This will be true for all replicas, so we have guaranteed availability in case of rack failure. *** If any response was helpful, please vote/accept best answer.

cstanca · ‎12-12-2016

@Jane Becker I had this question myself a few months ago. I did a little bit of researched and learned the following: - Garbage First (or G1) has been introduced with Java7 and it is designed to automatically adjust to different workloads and provide consistent pause times for GC over the lifetime of the application; it also handles large heap sizes with ease, by segmenting the heap into smaller zones and not collecting over the entire heap in each pause. - There are two configuration options for G1 that are relevant for performance: a) MaxGCPauseMillis - preferred pause for each garbage collection cycle. Default value is 200 milliseconds, however, it is not a fixed value, G1 can exceed it. G1 will attempt to schedule the frequency of GC cycles as well number of zones that are collected in each cycle such that each cycle will take approximately 200 ms. B) InitiatingHeapOccupancyPercent - specified the% pf the total heap that may be in use before G1 starts a new collection cycle. Default is 45%. This includes both, eden and old zone usage in total. Last time I checked, Kafka start script does not use G1 collector, defaulting to New and COncurrent Mark and Sweek GC. You may need to make the change manually via environment variable KAFKA_JVM_PERFORMANMCE_OPTS. *** If this was helpful, please vote/accept answer.

cstanca · ‎12-12-2016

@Connor O'Neal The value used to have the meaning "don't swap unless out of memory". The meaning is different since Linux kernel version 3.5-rc1. That change was back ported to many distributions, including RedHatas of kernel 2.6.32-303. This changed the meaning of the value 0 to "never swap". It is for this reason that a value of 1 is now recommended. *** If this helped, pls vote/accept answer.

cstanca · ‎11-29-2016

@Fernando Lopez Bello If you are thinking Hue then I recommend to consider Ambari Views. Ambari Views are very powerful. https://cwiki.apache.org/confluence/display/AMBARI/Views Zeppelin is also an awesome visualization tool: https://zeppelin.apache.org/

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Hadoop - "To Run, or Not To Run" in the Cloud?

Re: How to implement "connect By" of ORACLE in Hiv...

Re: How to connect MongoDB with Hadoop and Spark?

Re: How to connect MongoDB with Hadoop and Spark?

Re: Calling Process Dynamic

Re: What are file system options for HBase deploym...

Re: Can anyone explain Kafka rack awareness featur...

Re: What is reasoning for using GC1 Garbage Collec...

Re: Why not set swappiness to zero?

Re: Data Visualization for HDP and HDF scenarios