About cstanca

cstanca · ‎11-28-2016

@apappu You are using "code" blocks for non-code regular text. For example you describe textually each step using a code block. The same issue with final note. It is text, not code. Also, the article, should include a structure like: Problem Description Assumptions Steps Conclusions Could you clean-up the article for that, also spell checks and resubmit? Our articles need to have a publisher quality.

cstanca · ‎11-24-2016

@ambud.sharma Voted up :). Before, it was counter-intuitive.

cstanca · ‎11-24-2016

Behavior The number of cells returned to the client are normally filtered based on the table configuration; however, when using the RAW => true parameter, you can retrieve all of the versions kept by HBase, unless there was a major compaction or a flush to disk event in meanwhile. Demonstration Create a table with a single column family: create 't1', 'f1' Configure it to retain a maximum version count of 3: alter 't1',NAME=>'f1',VERSIONS=>3 Perform 4 puts: put 't1','r1','f1:c1',1 put 't1','r1','f1:c1',2 put 't1','r1','f1:c1',3 put 't1','r1','f1:c1',4 Scan with RAW=>true. I used VERSIONS as 100 for a catch-all. It could have been anything greater than 3 (number of versions set previously). Unless specified, only the latest version is returned by scan command. scan 't1',{RAW=>true,VERSIONS=>100} The above scan returns all four versions. ROW COLUMN+CELL r1 column=f1:c1,timestamp=1479950685181, value=4 r1 column=f1:c1,timestamp=1479950685155, value=3 r1 column=f1:c1,timestamp=1479950685132, value=2 r1 column=f1:c1,timestamp=1479950627736, value=1 Flush to disk: flush ‘t1’ Then scan: scan 't1',{RAW=>true,VERSIONS=>100} Three versions are returned. ROW COLUMN+CELL r1 column=f1:c1,timestamp=1479952079260, value=4 r1 column=f1:c1,timestamp=1479952079234, value=3 r1 column=f1:c1,timestamp=1479952079209, value=2 Do four more puts: put 't1','r1','f1:c1',5 put 't1','r1','f1:c1',6 put 't1','r1','f1:c1',7 put 't1','r1','f1:c1',8 Flush to disk: flush ‘t1’ Scan: scan 't1',{RAW=>true,VERSIONS=>100} Six versions are returned: ROW COLUMN+CELL r1 column=f1:c1,timestamp=1479952349970, value=8 r1 column=f1:c1,timestamp=1479952349925, value=7 r1 column=f1:c1,timestamp=1479952349895, value=6 r1 column=f1:c1,timestamp=1479952079260, value=4 r1 column=f1:c1,timestamp=1479952079234, value=3 r1 column=f1:c1,timestamp=1479952079209, value=2 Force major compaction: major_compact ‘t1’ Scan: scan 't1',{RAW=>true,VERSIONS=>100} Three versions are returned: ROW COLUMN+CELL r1 column=f1:c1,timestamp=1479952349970, value=8 r1 column=f1:c1,timestamp=1479952349925, value=7 r1 column=f1:c1,timestamp=1479952349895, value=6 Conclusion When deciding the number of versions to retain, it is best to treat that number as the minimum version count available at a given time and not as a given constant. Until a flush to disk and a major compaction, number of versions available is higher than the configured for the table.

cstanca · ‎11-23-2016

@R c You state that your server has 12 cores (2 threads per core) and 32 GB RAM. You did not state whether your server is x86. If it is a Power7 CPU then 12 cores is a lot of CPU power. I'll assume is x86. Then you state that you would like 6 nodes with 4 vCPU per node and 8-16 GB per each. The RAM math does not add-up. You would need 48-96 GB to meet that wish. You did not mention anything about the type of storage that you could use. Is it the internal storage of the server, NAS or SAN? How much can you use? For this type of cluster, you would need 25-30 GB for root and logs and separate storage for data nodes, whatever you can afford, but not less than 50-100 GB/data node if you really want to have a big data capable for development cluster. If your resources are really 12 cores (24 vCPU) and 32 GB RAM, that is a really not a great RAM/vCPU ratio. A good rule of thumb is to have minimum 4 GB RAM for each vCPU. That means in your case, it would have been good if you had 96 GB RAM. My response would have been easily: 6 nodes as you wanted, each 4vCPU and 16 GB RAM. I would never allocate less than 8 GB of RAM per node. If you RAM is really 32 GB then, maybe a 4-node cluster which will be memory bound. HBase, Spark and Hive LLAP may have some limitations. Keep in mind that your resources are a reasonable fit for a very small POC/dev environment, however, useable. +++ If any of the responses was helpful, please vote and accept best answer.

cstanca · ‎11-22-2016

@ambud.sharma Could you develop more on point #4? Latency measures the amount of time between the start of an action and its completion, throughput is the total number of such actions that occur in a given amount of time. I doubt that lower latency is responsible for throughput decrease in a direct correlation. There is more to it. Based on #4, NYSE trading is better running in a batch mode. I believe you forget some other factor in the correlation, for example RESOURCES as a given constant, or single tuple overhead, etc. Please elaborate to increase the value of this article.

cstanca · ‎11-21-2016

@Daniel Scheiner See my response from Nov 11 at 05:21 PM. It states the same, Cost Based Optimizer decides to use Tez for those queries where Tez can deliver faster that MR.

cstanca · ‎11-20-2016

@kishore sanchina HiveQL is a query language and Hive is an execution engine. BigSQL is just another execution engine which can co-exist with Hive and leverage Hive storage model and metastore. Tables created in Hive are visible to Big SQL and vice versa. The major difference? Until recently, the response would have been that Hive requires MapReduce and BigSQL uses a different approach leveraging memory, however, recently Hive uses Tez and even more recently uses LLAP and the difference between them is just that they are just alternatives provides by Community vs. IBM. If you were looking from the functional point of view, SQL functions, then IBM’s Big SQL provides higher degree of ANSI SQL language compatibility, however, Hive is almost there and at lower cost and larger community support. +++ If any response was helpful, please vote/accept best answer.

cstanca · ‎11-20-2016

@Leonid Fedotov LLAP is still in Tech Preview and having more than one HiveServer2 Interactive is not currently supported. HiveServer2 Interactive service is deployed to the host that you want to run it on. As-is, it is a single point of failure and it supports a single queue. Let's remember that a Tech Preview feature is meant to demonstrate the concept, but it usually falls short on enterprise level features like HA, Security, Resource Management. Those are to be added. It is a long shot to setup outside of Ambari. There are no documents to document such approach. You would have to create multiple HiveServer2 instances and point HiveServer2 interactive to those. I would not go that path. I would wait for Ambari 2.5.0 to have two instances. If any response helped, please vote/accept best answer. +++ It is likely that others may be confusing HiveServer2 and HiveServer2 Interactive services and it is important to understand the difference. I started to respond your question before you and @jss had another pass at it. I will still keep the content below for other HCC users. HiveServer1 is a server client model service, which allow users to connect using Hive CLI interface and using thrift client. It support for remote client connection but only one client can connect at a time. It does not provide session management support and because of thrift API it does not provide concurrency control. HiveServer 2 is also a client and server model and it allows to connect many different clients like thrift. HiveServer2 gives multi-client support where many clients can connect at the same time. Authentication is much better using kerberos. It provides support for JDBC and ODBC driver connection. Beeline CLI is used for connecting to HiveServer2. Interactive HiveServer2has been around for a while and it does not have anything to do with Hive 2. The only thing that got better is that multiple interactive HiveServer2 instances can be created via Ambari UI using recent versions of HDP, I believe post HDP 2.2. In HDP 2.2 and earlier, interactive queues can be set up at the command-line. In HDP 2.3 and later, you can use Ambari (a GUI) to set up interactive queues. Multiple HiveServer2 instances can be used for: Load-balancing and high availability using Zookeeper Running multiple applications with different settings Because HiveServer2 uses its own settings file, using one for ETL operations and another for interactive queries is a common practice. All HiveServer2 instances can share the same MetastoreDB. Consequently, setting up multiple HiveServer2 instances that have embedded metastores is a simple operation. HiveServer2 Interactive This is still HiveServer2 instance, however, it is dedicated to HiveServer2 Interactive service. Check this reference: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_hive-performance-tuning/content/ch_hive_llap.html, section 8.

cstanca · ‎11-16-2016

@Jeeva Jeeva Multithreading Programming Model and MapReduce Programming Model are based on fundamentally different principles and both are meant to solve different kinds of data storage and processing problems. Multithreading is based on Parallelization of Processing where as Hadoop takes power by Parallelization of Data. If you assume that the Hadoop ecosystem is only MapReduce and Spark batch, then your understanding is correct . However, the ecosystem includes also real-time streaming tools like Apache Storm which uses multi-threading. However, modern tools handle all these programmatic needs for multi-threading by architecture/design. Their focus is scalability by architecture and design and less by laborious programming efforts. References: http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/ https://www.safaribooksonline.com/blog/2014/01/06/multi-threading-storm +++ If it helped, pls vote/accept best answer

cstanca · ‎11-16-2016

@Arpit Jain When you create table as select ... into ORC table don't forget the cast the proper data type to match your target table. Some of the fields may get converted implicitly, others not.

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: Configure SSL between Ranger and Rager HDFS pl...

Re: Guidelines for building Streaming Applications

HBase Major Compactions Impact on Cell Versions Co...

Re: I am trying to install a Hadoop cluster on a D...

Re: Guidelines for building Streaming Applications

Re: QlikView not using TEZ ?

Re: what is major differences between hive and Big...

Re: Is there a way to run multiple HiveServer2 Int...

Re: Multi-threading in Hadoop

Re: How to remove double quote from csv file at ti...