Member since
10-06-2015
42
Posts
23
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1466 | 11-07-2016 10:11 PM | |
1076 | 04-19-2016 05:32 PM | |
2447 | 04-11-2016 06:57 PM | |
2164 | 11-04-2015 03:06 PM |
07-07-2016
01:45 PM
The primary design choice to make here is whether we need CPU scheduling (DRF) or not. In some clusters with varying CPU capacities, due to throughput differences we may need to tweak time out settings to increase them. This is because Network socket time out occur in heterogeneous clusters. Another aspect is to ensure that each node has enough memory head room left after Yarn allocation to prevent CPU hangs on less capable nodes. (Typically 80% for Yarn and 20% OS etc) . Since one of the nodes has only 12 GB of RAM, you may also want to closely monitor memory usage of processes especially Ambari agent and Ambari metrics memory usage and monitor if it is growing in size, that Yarn is not aware of.
... View more
07-05-2016
06:29 PM
1 Kudo
Hi, Please check if the sqoop list tables option is working or not against the same connection details. If this is not working, then there is a connectivity issue from the sqoop client machine to the database server. The query that is failing here is not the query used by Sqoop to extract data , but the query used to collect metadata information from remote database. It then uses this information to create the hadoop writers. So this appears to be a connectivity issue, and this can be confirmed by running the list tables variant of sqoop first.
... View more
04-19-2016
05:32 PM
1 Kudo
Nifi documentation seems to indicate around 50MB/s -100 MB/s transfer rates. https://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.1.0/bk_Overview/content/performance-expectations-and-characteristics-of-nifi.html Nifi is useful if there are several source databases from which data needs to be extracted very frequently as it helps with monitoring and work flow maintenance. If some of this data needs to be routed to different tables based on a columns value for instance Nifi is a good choice as sqoop wont support this by default. If data needs to moved to multiple destinations also Nifi is a good choice - for example , Land data in HDFS while moving a part of the data to Kafka/Storm or Spark - This is also a benefit of Nifi Nifi can apply scheduling of these flows easily while in sqoop it has to be set up as a crontab or Control M etc. Sqoop can use mappers in hadoop for faulttolerance and for parallelism and may achieve better rates.If deduplication etc is to be performed then Nifi becomes a choice for smaller data sizes. For large table loads Sqoop is a good choice.
... View more
04-18-2016
03:59 PM
In the beeline command please check if the Hive principal name is set correctly and matching the cluster settings. Also ensure that the kerberos ticket is still available. !connect jdbc:hive2://sandbox.hortonworks.com:10000/default;principal=hive/_HOST@REALM.COM
... View more
04-11-2016
06:57 PM
1 Kudo
https://community.hortonworks.com/questions/12663/hdp-install-issues-about-hdp-select.html Please check if this issue is a duplicate of the above link
... View more
04-11-2016
04:01 PM
1 Kudo
Hi, If the tables are large size (Say multi Terabyte) then managing the table ingest through Sqoop / Partitioned Hive Tables is the best option from a performance stand point. Though there are CDC tools like Oracle GoldenGate , which writes to HBASE and handles frequent updates in near-realtime, the maximum number of regions per region server in Hbase will grow very rapidly when there are large tables. The maximum Transactions per second achievable is only around 10000 TPS for the Near-Realtime CDC repliciation processes. In case of a CDC failure for a few days, these new record changes need to be applied and system needs to catch up. Please check the four stage incremental update strategy for Hive for large table updates as documented in the following link. This process merges existing data from the tables to the new/changed data from sources. http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
... View more
12-07-2015
08:49 PM
Hi Neeraj - Trifacta seems to be a data wrangling tool, does it also provide data quality measures OOTB ?
... View more
12-07-2015
06:38 PM
1 Kudo
Hi I am looking for Best practices around data quality Testing for hive / pig/ oozie based ETL. Client is looking at tools like Data flux data quality for Hadoop . If there are any alternate recommendations , please update this question.
... View more
Labels:
11-04-2015
03:06 PM
After compacting (Major compaction per partition) the tables (HDP 2.2.4.2-2) we got the right number of Tez mappers. So this appears to be a bug related to compaction. Alex, Yes, they are both submitting to the same queue. The ACID Transactions is broken until further advisary. I am looking for more details of the reasons why it is broken. If you have details please send me a note.
... View more
10-21-2015
03:14 PM
When a table is partitioned and bucketed and Transactions enabled on it , the number of map tasks launched by TEZ = 2 , while MR jobs still launches 72 Tasks (Table is about 17Gig). if transaction is not enabled , then the query is launching Correct number of Tez tasks, If there are any hints on why this may occur, please share.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Tez
- « Previous
-
- 1
- 2
- Next »