About cstanca

cstanca · ‎10-11-2016

@Smart Solutions The two main options for replicating the HDFS structure are Falcon and distcp. The distcp command is not very feature rich, you give it a path in the HDFS structure and a destination cluster and it will copy everything to the same path on the destination. If the copy fails, you will need to start it again, etc. Another method for maintaining a replica of your HDFS structure is Falcon. There are more data movement options and you can more effectively manage the lifecycle of all of the data on both sides. If you're moving Hive table structures, there is some more complexity to making sure the tables are created on the DR side, but moving the actual files is done the same way You excluded distcp as an option. As such, I suggest to look at Falcon. Check this: http://hortonworks.com/hadoop-tutorial/mirroring-datasets-between-hadoop-clusters-with-apache-falcon/ +++++++ if any response addressed your question, please vote and accept best answer.

cstanca · ‎10-11-2016

@Kumar Veerappan Assuming HDFS replication factor > 1 (default is 3), put the node under maintenance and stop services running on the node. Once the server comes back up, start services and take the node out of maintenance, in that order. Putting the node under maintenance before stopping the services will eliminate the risk of alerts. Starting the services before taking the node out of maintenance will prevent the alerts as well. It is unlikely that your data node will remain that much behind but you may consider HDFS rebalancing to your threshold (default is 10%). +++++ If any of the responses helped, please vote and accept best answer.

cstanca · ‎10-08-2016

@Raja Sekhar Chintalapati Did you find any of the responses helpful? If so, please accept the best answer or provide one if you addressed the problem by yourself, different than the responses in this thread.

cstanca · ‎10-08-2016

Introduction The term “Load Testing” has evolved over the years, however the core meaning still comes down to making sure that your system can handle a predefined amount of users at the same time. The load test is usually run over an extended period of time in a staggered manner, slowly ramping up the amount of users until you hit a predefined maximum that’s usually based on projected usage levels extrapolated from access logs with a buffer for surges. The goal of load testing is to determine the number of users that the system can typically cope with, this is called the systems “concurrency level” and give you a hard number to work with when dealing with capacity planning, performance and optimization, stability, SLAs etc. JMeter was originally designed for testing Web Applications but has since expanded to other test functions, including database via JDBC. As such, it has the ability to execute SQL queries against a given JDBC Driver. JMeter allows to define queries to execute against a Hive table. An instance of JMeter can run multiple threads of queries in parallel, with multiple instances of JMeter capable of spreading clients across many nodes. The queries can also be parameterized with pseudo-random data in order to simulate all types of queries to a table. JMeter automates the execution of the queries in parallel. The results of the queries that JMeter ran are also aggregated and analyzed together to provide an overall view into the performance. Mean and median are provided for a simple insight, as well as 90th, 95th, 99th and 99.9th percentiles to understand the execution tail. This approach is extremely useful to execute read-heavy workloads. JMeter Setup for Hive Load Testing These steps have been tested on HDP 2.4.2 and OSX and should work similarly on other Unix-like systems. Step 1. Download, Install and Setup JMeter Pre-requisites: http://jmeter.apache.org/usermanual/get-started.html Download JMeter: http://mirror.symnds.com/software/Apache//jmeter/binaries/apache-jmeter-3.0.tgz and unzip it to your preferred location Add the required Hive and logging jars from my repo (https://github.com/cstanca1/jmeter-hive-hdp )] to your JMETER_HOME/lib/ext. Start JMeter from $JMETER_HOME/bin directory running jmeter for unix-like systems In the jdbc connection configuration page set the following: Auto Commit = true Database URL = jdbc:hive://hive_ip_address:10000/default JDBC Driver Class = org.apache.hadoop.hive.jdbc.HiveDriver Note: Set AutoCommit=true in JMeter configuration is a must. Hive does not support AutoCommit=false. Step 2: Building a Database Test Plan To build a database test plan, consult Instructions to build a database test plan. This example uses MySQL driver, however, the same approach is applicable to Hive. You will have to provide the Hive database URL. Note: If you use Hive2 Server the URL uses hive2 instead of hive. The plan includes adding users, JDBC requests and a listener to view/store the test results Step 3: Build a Jmeter Dashboard JMeter supports dashboard report generation to get graphs and statistics from a test plan. To build a dashboard follow Instructions to build a JMeter dashboard. This dashboard should include a request summary graph showing the success and failed transaction percentage, a statistics table providing a summary of all metrics per transaction including 3 configurable percentiles, an error table providing a summary of all errors and their proportion in the total requests, zoomable chart where you can check/uncheck every transaction to show/hide it for response times over time, bytes throughput over time, latencies over time, hits per second, response codes per second, transactions per second, response time vs Request per second, latency vs Request per second, response times percentiles, active threads over time, times vs threads, response time distribution. Step 4: Run JMeter To run JMeter, run jmeter (for Unix) file. These files are found in the bin directory. There are some additional scripts in the bin directory that you may find useful: jmeter - run JMeter (in GUI mode by default). Defines some JVM settings which may not work for all JVMs. jmeter-server - start JMeter in server mode (calls jmeter script with appropriate parameters) jmeter.sh - very basic JMeter script (You may need to adapt JVM options like memory settings). mirror-server.sh - runs the JMeter Mirror Server in non-GUI mode shutdown.sh - run the Shutdown client to stop a non-GUI instance gracefully stoptest.sh - run the Shutdown client to stop a non-GUI instance abruptly It may be necessary to edit the jmeter shell script if some of the JVM options are not supported by the JVM you are using. The JVM_ARGS environment variable can be used to override or set additional JVM options, which will override the HEAP settings in the script. For example: JVM_ARGS="-Xms1024m -Xmx1024m" jmeter -t test.jmx [etc.] Findings 1) I recently executed a Hive Load Test with JMeter and learned that maximum 10 connections are possible by 1 YARN queue. That was a big eye opener since the requirement was for 50 concurrent connections, Obviously, multiple queries can be submitted by connection, but also, as resources are available, multiple YARN queues can be created and used to increase number of connections. 2) Creating multiple queues to meet the 50 concurrent connections requirement lead to another finding, response time and scalability were impacted dramatically. If you assume N connections and M number of concurrent executions by connection, the more connections, the more overhead, more resources to do less and slower, For example, assume N = 10 and M =1000. That would be 10,000 concurrent queries. For N = 1 and M = 10000 that would be also 10000 concurrent queries. The same queries with the same resources allocated overall had a significantly better response time for lesser queues. As such, unless there is another reason for multiple queues, my advice would be to limit one queue for a tenant application and limit number of connections per queue as such to reuse one that is already open and better use existent resources. 3) Always question requirements for so many open connections. There is always a proxy application that can use a single connection to serve multiple requests from multiple users.

cstanca · ‎10-08-2016

@Vaibhav Kumar No question that this could work, just a bit concerned with performance. Order by is a very expensive operation involving a single reducer. Practically, you do one full table scan for that and once you get those three records you do another table scan to get the attributes of those records. Look at an alternative here: https://community.hortonworks.com/questions/24667/hive-top-n-records-within-a-group.html If any response in this thread was helpful, don't forget to vote/accept the best response.

cstanca · ‎10-07-2016

@jean rivera I think that I finally found the reason: https://issues.apache.org/jira/browse/HIVE-14857?jql=text%20~%20%22select%20count%22 Probably the ticket you filed is a duplicate. I know that it is not fixing your issue now, but if you find the response helpful, please vote/accept best answer.

cstanca · ‎10-07-2016

@Vaibhav Kumar To re-iterate what @mqureshi already noticed, your query does not seem functionally correct. 1=1 is true, but null=null is false. Different story. If you use LIMIT row_count with ORDER BY, Hive, like MySQL and many other SQL-like engines, ends the sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. ROW_NUMBER function and select third row is what you need. Your idea of inner join will not scale for many records. If you have duplicates then write your query to eliminate the duplicate or deal with that, but one would still wonder how would you determine the true third row when you have a duplicate? I don't see anything in the query you wrote dealing with that problem. I think there is still a lot of work to do on paper before even writing SQL.

cstanca · ‎10-05-2016

@hitaay Start here: http://docs.hortonworks.com/HDPDocuments/SS1/SmartSense-1.3.0/bk_installation/content/ambari_install.html It answers 1, 2 and 3. Yes, you can limit what and how often. Sensitive information can be also randomized. Regarding one of your concerns, be aware that Activity Analyzers deployed to the NameNodes in the cluster do not process any utilization data besides HDFS. Therefore, to process YARN, MapReduce, and Tez utilization data, another instance of the Activity Analyzer needs to be deployed to another node in the cluster, preferably on a non-master node. If any of the responses was helpful, please vote and accept as best answer.

cstanca · ‎10-05-2016

@hitaay https://community.hortonworks.com/questions/394/what-are-best-practices-for-setting-up-backup-and.html Single tool solution is desirable, but it also comes with a price tag. Look at the link above. You can use a combination of HDFS snapshot and your standard database point in time recovery methods for database used for the metadata. You can leverage that practice and avoid extra-cost for something that is really not Hadoop specific. If any response from this thread helped, please vote/accept best answer.

cstanca · ‎10-05-2016

@Ahmad Debbas Setup HDFS NFS Gateway and copy Sharepoint files. You could also use basic script to PUT the files to HDFS. It would require to use an edge node that has access to SharePoint repository and HDFS client. HDFS NFS Gateway: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html HDFS PUT: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#put If you already use HDP and it is installed with Ambari, HDFS NFS Gateway is just another service to add via Ambari. If the response was helpful, please vote/accept best answer.

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: Backing up HDFS production data

Re: Data Node maintenance

Re: hide password in beeline

JMeter Setup for Hive Load Testing

Re: Select nth row in hive

Re: select count(*) fails with tez over cassandra

Re: Select nth row in hive

Re: Are there any node performance implications of...

Re: What is the best approach/tool for point in ti...

Re: HDFS-Sharepoint