About pbalasundaram

pbalasundaram · ‎07-07-2016

The primary design choice to make here is whether we need CPU scheduling (DRF) or not. In some clusters with varying CPU capacities, due to throughput differences we may need to tweak time out settings to increase them. This is because Network socket time out occur in heterogeneous clusters. Another aspect is to ensure that each node has enough memory head room left after Yarn allocation to prevent CPU hangs on less capable nodes. (Typically 80% for Yarn and 20% OS etc) . Since one of the nodes has only 12 GB of RAM, you may also want to closely monitor memory usage of processes especially Ambari agent and Ambari metrics memory usage and monitor if it is growing in size, that Yarn is not aware of.

pbalasundaram · ‎07-05-2016

Hi, Please check if the sqoop list tables option is working or not against the same connection details. If this is not working, then there is a connectivity issue from the sqoop client machine to the database server. The query that is failing here is not the query used by Sqoop to extract data , but the query used to collect metadata information from remote database. It then uses this information to create the hadoop writers. So this appears to be a connectivity issue, and this can be confirmed by running the list tables variant of sqoop first.

pbalasundaram · ‎04-19-2016

Nifi documentation seems to indicate around 50MB/s -100 MB/s transfer rates. https://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.1.0/bk_Overview/content/performance-expectations-and-characteristics-of-nifi.html Nifi is useful if there are several source databases from which data needs to be extracted very frequently as it helps with monitoring and work flow maintenance. If some of this data needs to be routed to different tables based on a columns value for instance Nifi is a good choice as sqoop wont support this by default. If data needs to moved to multiple destinations also Nifi is a good choice - for example , Land data in HDFS while moving a part of the data to Kafka/Storm or Spark - This is also a benefit of Nifi Nifi can apply scheduling of these flows easily while in sqoop it has to be set up as a crontab or Control M etc. Sqoop can use mappers in hadoop for faulttolerance and for parallelism and may achieve better rates.If deduplication etc is to be performed then Nifi becomes a choice for smaller data sizes. For large table loads Sqoop is a good choice.

pbalasundaram · ‎04-18-2016

In the beeline command please check if the Hive principal name is set correctly and matching the cluster settings. Also ensure that the kerberos ticket is still available. !connect jdbc:hive2://sandbox.hortonworks.com:10000/default;principal=hive/_HOST@REALM.COM

pbalasundaram · ‎04-11-2016

https://community.hortonworks.com/questions/12663/hdp-install-issues-about-hdp-select.html Please check if this issue is a duplicate of the above link

pbalasundaram · ‎04-11-2016

Hi, If the tables are large size (Say multi Terabyte) then managing the table ingest through Sqoop / Partitioned Hive Tables is the best option from a performance stand point. Though there are CDC tools like Oracle GoldenGate , which writes to HBASE and handles frequent updates in near-realtime, the maximum number of regions per region server in Hbase will grow very rapidly when there are large tables. The maximum Transactions per second achievable is only around 10000 TPS for the Near-Realtime CDC repliciation processes. In case of a CDC failure for a few days, these new record changes need to be applied and system needs to catch up. Please check the four stage incremental update strategy for Hive for large table updates as documented in the following link. This process merges existing data from the tables to the new/changed data from sources. http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/

pbalasundaram · ‎12-07-2015

Hi Neeraj - Trifacta seems to be a data wrangling tool, does it also provide data quality measures OOTB ?

pbalasundaram · ‎12-07-2015

Hi I am looking for Best practices around data quality Testing for hive / pig/ oozie based ETL. Client is looking at tools like Data flux data quality for Hadoop . If there are any alternate recommendations , please update this question.

pbalasundaram · ‎11-04-2015

After compacting (Major compaction per partition) the tables (HDP 2.2.4.2-2) we got the right number of Tez mappers. So this appears to be a bug related to compaction. Alex, Yes, they are both submitting to the same queue. The ACID Transactions is broken until further advisary. I am looking for more details of the reasons why it is broken. If you have details please send me a note.

pbalasundaram · ‎10-21-2015

When a table is partitioned and bucketed and Transactions enabled on it , the number of map tasks launched by TEZ = 2 , while MR jobs still launches 72 Tasks (Table is about 17Gig). if transaction is not enabled , then the query is launching Correct number of Tez tasks, If there are any hints on why this may occur, please share.

Online	Offline
Last Visited	‎09-20-2017 01:15 AM

Member Since	‎10-06-2015 11:29 AM
Last Visited	‎09-20-2017 01:15 AM
Posts	42
Kudos received	23

Cloudera Community

Re: AD Kerberized cluster Hive connection string

Re: Any use case for using hdf for importing data ...

Re: error install Hbase

Re: Tez with Transaction with Bucketing

Re: Hadoop nodes with different characteristics

Re: Sqoop Query returns Backend I/O Exception

Re: Any use case for using hdf for importing data ...

Re: Exception while executing insert query on kerb...

Re: error install Hbase

Re: SQOOP CDC Jobs with weekly full refreshes

Re: Recommended data quality test suite for Hive /...

Recommended data quality test suite for Hive / Pig...

Re: Tez with Transaction with Bucketing

Tez with Transaction with Bucketing