About mqureshi

mqureshi · ‎07-19-2017

@Suhel How many users are connecting to your HiveServer 2 concurrently? That determines your memory. From Hortonworks recommendations, for 20 concurrent users you need a mere 6 GB. If you have 10 concurrent connections, 4 GB is enough. For single connection 2 GB, so definitely you don't wont to go below that. When you have too much memory, you run into what's called "Stop the world garbage collection pauses". You can google more on this but basically JVM needs to move object and update references to it. Now if you move object before updating the references and application that is running access it from old reference than there is trouble. if you update reference first and than try to move object the updated reference is wrong till object is moved and any access while object has not moved will cause issue. For both CMS and Parallel collector the young generation collection algorithm is similar and it is stop the world that is, application is stopped when collection is happening. When you allocate too much memory, like 24 GB, stop the world takes longer time, hence your application fails. So, your metastore does not need to have same memory as Hive Server 2. They are two different processes. If metastore is also running into similar issues, you can set it to 8 GB or less - that's still a lot of memory for just Metastore.

mqureshi · ‎07-19-2017

@Bala Vignesh N V Why not use filter like the following? val header = data.first val rows = data.filter(line => line != header)

mqureshi · ‎07-19-2017

@Jobin George On your new node, do you have flow.xml.gz? If yes, can you delete it and try adding the node again.

mqureshi · ‎07-18-2017

Please see the following link. In your code, you'll need to do a "repartition". What I am trying to say is if you force more data to same reducer, you will create less files. Call repartition function on some key where data for that key will land in same partition. https://dzone.com/articles/optimize-spark-with-distribute-by-cluster-by

mqureshi · ‎07-18-2017

@Krishna S To use these components without HDFS, you need a file system that supports Hadoop API. Some such systems are Amazon S3, WASB, EMC Isilon and a few others(these systems might not implement 100 percent of Hadoop API - please verify). you can also install Hadoop in standalone mode which does not use HDFS. I am not sure NFS on its own supports Hadoop API but using Hadoop NFS gateway, you can mount HDFS as client's local file system. Here is a link on using this feature. https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.htm

mqureshi · ‎07-18-2017

na Also use DISTRIBUTE BY so data for same partition goes to same reducer.

mqureshi · ‎07-18-2017

@Upendra N I think you probably realize what makes SCD type 2 difficult in Hadoop (hive/Pig) is that you cannot update records (With new Hive ACID you can but under the hood its doing the magic, that you can also do your self). Rather than reprinting the process here, here is one link that describes implementing doing SCD Type 2 in Hadoop using Hive. Hope this helps. https://www.softserveinc.com/en-us/tech/blogs/process-slowly-changing-dimensions-hive/

mqureshi · ‎07-10-2017

@Prakhar Agrawal In your code you have only two properties. Where are you handling in your processor code these additional properties that you are getting error for, for example, "databaseName"? I don't see this property in your code. static final PropertyDescriptor MyPropertyDescriptor = new PropertyDescriptor.Builder() .name("Print User Input") .description("It prints the user input") .required(true) .build(); static final PropertyDescriptor n = new PropertyDescriptor.Builder() .name("Num Rows to Print") .description("number of rows to be printed") .required(true) .build();

mqureshi · ‎07-10-2017

@Prakhar Agrawal Can you please share your code for PropertyDescriptors in your custom processor and how you are handling it in "OnTriger()" method?

mqureshi · ‎07-07-2017

@Karan Alang I have not done this but it seems like that collector is already there, at least by that name. Can you change the name and try it?

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: Should the HiveServer2 Heap Size and Metastore...

Re: Removing header from CSV file through pyspark

Re: HDF 3.0 - Issue with Adding a new NiFi Node(s)...

Re: How to reduce the small problem in spark using...

Re: How to Install Hortonworks entire ecosystem wi...

Re: How to reduce the small problem in spark using...

Re: Best and Easy way to implement and create SCD2...

Re: I am getting error "not a supported property" ...

Re: I am getting error "not a supported property" ...

Re: Error in Kafka startup with the JMX exporter -...