Member since
03-16-2016
707
Posts
1753
Kudos Received
203
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 6986 | 09-21-2018 09:54 PM | |
| 8748 | 03-31-2018 03:59 AM | |
| 2626 | 03-31-2018 03:55 AM | |
| 2758 | 03-31-2018 03:31 AM | |
| 6185 | 03-27-2018 03:46 PM |
10-11-2016
01:23 AM
4 Kudos
@Smart Solutions The two main options for replicating the HDFS structure are Falcon and distcp. The distcp command is not very feature rich, you give it a path in the HDFS structure and a destination cluster and it will copy everything to the same path on the destination. If the copy fails, you will need to start it again, etc. Another method for maintaining a replica of your HDFS structure is Falcon. There are more data movement options and you can more effectively manage the lifecycle of all of the data on both sides. If you're moving Hive table structures, there is some more complexity to making sure the tables are created on the DR side, but moving the actual files is done the same way You excluded distcp as an option. As such, I suggest to look at Falcon. Check this: http://hortonworks.com/hadoop-tutorial/mirroring-datasets-between-hadoop-clusters-with-apache-falcon/ +++++++ if any response addressed your question, please vote and accept best answer.
... View more
10-11-2016
01:17 AM
6 Kudos
@Kumar Veerappan Assuming HDFS replication factor > 1 (default is 3), put the node under maintenance and stop services running on the node. Once the server comes back up, start services and take the node out of maintenance, in that order. Putting the node under maintenance before stopping the services will eliminate the risk of alerts. Starting the services before taking the node out of maintenance will prevent the alerts as well. It is unlikely that your data node will remain that much behind but you may consider HDFS rebalancing to your threshold (default is 10%). +++++ If any of the responses helped, please vote and accept best answer.
... View more
10-08-2016
01:25 AM
@Raja Sekhar Chintalapati Did you find any of the responses helpful? If so, please accept the best answer or provide one if you addressed the problem by yourself, different than the responses in this thread.
... View more
10-08-2016
01:13 AM
16 Kudos
Introduction The term “Load Testing” has evolved over the years, however
the core meaning still comes down to making sure that your system can handle a
predefined amount of users at the same time. The load test is usually run over
an extended period of time in a staggered manner, slowly ramping up the amount
of users until you hit a predefined maximum that’s usually based on projected
usage levels extrapolated from access logs with a buffer for surges. The goal of load testing is to determine the number of users that the
system can typically cope with, this is called the systems “concurrency level”
and give you a hard number to work with when dealing with capacity planning, performance and
optimization, stability, SLAs etc. JMeter was originally designed
for testing Web Applications but has since expanded to other test functions,
including database via JDBC. As such, it has the ability to
execute SQL queries against a given JDBC Driver. JMeter allows to define
queries to execute against a Hive table. An instance of JMeter can run multiple
threads of queries in parallel, with multiple instances of JMeter capable of
spreading clients across many nodes. The queries can also be parameterized with
pseudo-random data in order to simulate all types of queries to a table. JMeter automates the execution of the queries in parallel.
The results of the queries that JMeter ran are also aggregated and analyzed
together to provide an overall view into the performance. Mean and median are
provided for a simple insight, as well as 90th, 95th, 99th and 99.9th
percentiles to understand the execution tail. This approach is extremely useful to execute read-heavy
workloads. JMeter Setup for Hive Load Testing These steps have been tested on HDP 2.4.2 and OSX and should work similarly on other
Unix-like systems. Step 1. Download, Install
and Setup JMeter
Pre-requisites: http://jmeter.apache.org/usermanual/get-started.html Download JMeter: http://mirror.symnds.com/software/Apache//jmeter/binaries/apache-jmeter-3.0.tgz and unzip it to your preferred location Add the required Hive and logging jars from my
repo (https://github.com/cstanca1/jmeter-hive-hdp
)] to your JMETER_HOME/lib/ext. Start JMeter from $JMETER_HOME/bin directory running jmeter for unix-like systems
In the jdbc connection configuration page set the following:
Auto Commit = true
Database URL = jdbc:hive://hive_ip_address:10000/default
JDBC Driver Class = org.apache.hadoop.hive.jdbc.HiveDriver
Note: Set AutoCommit=true in JMeter configuration is a must. Hive does not support AutoCommit=false. Step 2: Building a Database Test Plan To build a database test plan, consult Instructions to build a database test plan. This example uses MySQL driver, however, the
same approach is applicable to Hive. You will have to provide the Hive database
URL. Note: If you use Hive2 Server the URL uses hive2 instead of hive. The plan includes adding users, JDBC requests
and a listener to view/store the test results Step 3: Build a Jmeter Dashboard JMeter supports dashboard report generation
to get graphs and statistics from a test plan. To build a dashboard follow Instructions to build a JMeter dashboard. This dashboard should include a request summary graph showing
the success and failed transaction percentage, a statistics table providing a
summary of all metrics per transaction including 3 configurable percentiles, an
error table providing a summary of all errors and their proportion in the total
requests, zoomable chart where you can check/uncheck every transaction to
show/hide it for response times over time, bytes throughput over time,
latencies over time, hits per second, response codes per second, transactions
per second, response time vs Request per second, latency vs Request per second,
response times percentiles, active threads over time, times vs threads,
response time distribution. Step 4: Run JMeter To run JMeter, run jmeter (for Unix) file. These
files are found in the bin directory. There are some additional scripts in the bin directory that you may find
useful: jmeter - run JMeter (in GUI mode by default). Defines some JVM
settings which may not work for all JVMs. jmeter-server - start JMeter in server mode (calls jmeter script with
appropriate parameters) jmeter.sh - very basic JMeter script (You may need to adapt JVM options
like memory settings). mirror-server.sh - runs the JMeter Mirror Server in non-GUI mode shutdown.sh - run the Shutdown client to stop a non-GUI instance
gracefully stoptest.sh - run the Shutdown client to stop a non-GUI instance abruptly It may be necessary to edit the jmeter shell script if some of the JVM
options are not supported by the JVM you are using. The JVM_ARGS
environment variable can be used to override or set additional JVM options, which will override the HEAP settings in the script. For
example: JVM_ARGS="-Xms1024m -Xmx1024m" jmeter -t test.jmx [etc.] Findings 1) I recently executed a Hive Load Test with JMeter and learned that maximum 10 connections are possible by 1 YARN queue. That was a big eye opener since the requirement was for 50 concurrent connections, Obviously, multiple queries can be submitted by connection, but also, as resources are available, multiple YARN queues can be created and used to increase number of connections. 2) Creating multiple queues to meet the 50 concurrent connections requirement lead to another finding, response time and scalability were impacted dramatically. If you assume N connections and M number of concurrent executions by connection, the more connections, the more overhead, more resources to do less and slower, For example, assume N = 10 and M =1000. That would be 10,000 concurrent queries. For N = 1 and M = 10000 that would be also 10000 concurrent queries. The same queries with the same resources allocated overall had a significantly better response time for lesser queues. As such, unless there is another reason for multiple queues, my advice would be to limit one queue for a tenant application and limit number of connections per queue as such to reuse one that is already open and better use existent resources. 3) Always question requirements for so many open connections. There is always a proxy application that can use a single connection to serve multiple requests from multiple users.
... View more
10-08-2016
01:07 AM
@Vaibhav Kumar No question that this could work, just a bit concerned with performance. Order by is a very expensive operation involving a single reducer. Practically, you do one full table scan for that and once you get those three records you do another table scan to get the attributes of those records. Look at an alternative here: https://community.hortonworks.com/questions/24667/hive-top-n-records-within-a-group.html If any response in this thread was helpful, don't forget to vote/accept the best response.
... View more
10-07-2016
09:14 PM
5 Kudos
@jean rivera I think that I finally found the reason: https://issues.apache.org/jira/browse/HIVE-14857?jql=text%20~%20%22select%20count%22 Probably the ticket you filed is a duplicate. I know that it is not fixing your issue now, but if you find the response helpful, please vote/accept best answer.
... View more
10-07-2016
06:25 PM
5 Kudos
@Vaibhav Kumar To re-iterate what @mqureshi already noticed, your query does not seem functionally correct. 1=1 is true, but null=null is false. Different story. If you use LIMIT row_count with ORDER BY, Hive, like MySQL and many other SQL-like engines, ends the sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. ROW_NUMBER function and select third row is what you need. Your idea of inner join will not scale for many records. If you have duplicates then write your query to eliminate the duplicate or deal with that, but one would still wonder how would you determine the true third row when you have a duplicate? I don't see anything in the query you wrote dealing with that problem. I think there is still a lot of work to do on paper before even writing SQL.
... View more
10-05-2016
08:46 PM
5 Kudos
@hitaay Start here: http://docs.hortonworks.com/HDPDocuments/SS1/SmartSense-1.3.0/bk_installation/content/ambari_install.html It answers 1, 2 and 3. Yes, you can limit what and how often. Sensitive information can be also randomized. Regarding one of your concerns, be aware that Activity Analyzers deployed to the NameNodes in the cluster do not process any utilization data besides HDFS. Therefore, to process YARN, MapReduce, and Tez utilization data, another instance of the Activity Analyzer needs to be deployed to another node in the cluster, preferably on a non-master node. If any of the responses was helpful, please vote and accept as best answer.
... View more
10-05-2016
08:40 PM
@hitaay https://community.hortonworks.com/questions/394/what-are-best-practices-for-setting-up-backup-and.html Single tool solution is desirable, but it also comes with a price tag. Look at the link above. You can use a combination of HDFS snapshot and your standard database point in time recovery methods for database used for the metadata. You can leverage that practice and avoid extra-cost for something that is really not Hadoop specific. If any response from this thread helped, please vote/accept best answer.
... View more
10-05-2016
06:24 PM
5 Kudos
@Ahmad Debbas
Setup HDFS NFS Gateway and copy Sharepoint files. You could also use basic script to PUT the files to HDFS. It would require to use an edge node that has access to SharePoint repository and HDFS client. HDFS NFS Gateway: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html HDFS PUT: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#put If you already use HDP and it is installed with Ambari, HDFS NFS Gateway is just another service to add via Ambari. If the response was helpful, please vote/accept best answer.
... View more