About mqureshi

mqureshi · ‎09-26-2017

@Sheetal Sharma Not for data nodes. For some master nodes processes like Hive Metastore, yes. Also, use RAID for all OS disks, You don't want a node failure just because one OS disk fails. As for data nodes, they make three copies of data on different machines, so you don't need RAID. In fact, RAID will reduce performance as performance in RAID is determined by the slowest disk. Same with Zookeeper and Quorum journal manager. They have redundant processes running on three different nodes on three different disks, so you don't need RAID.

mqureshi · ‎09-23-2017

@Biswajit Chakraborty You will use "Rolling filename Pattern" property which in your case can be set to "my-app-*.log. Another thing in your use case you will do is to specify "filesToTail" property. Use expression language to specify your files to tail. https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#dates

mqureshi · ‎09-21-2017

@Riddhi Sam First of all, Spark is not faster than Hadoop. Hadoop is a distributed file system (HDFS) while Spark is a compute engine running on top of Hadoop or your local file system. Spark however is faster than MapReduce which was the first compute engine created when HDFS was created. So, when Hadoop was created, there were only two things. HDFS where data is stored and MapReduce which was the only compute engine on HDFS. To understand how Spark is faster than MapReduce, you need to understand how both MapReduce and Spark works. When a MR job starts, the first step is to read data from disk and run mappers. The output of mappers is stored back on disk. Then Shuffle and sort step starts and reads the mapper output from disk and after shuffle and sort completes, it stores the result back on disk (there is actually some network traffic also when keys for Reduce step are gathered on same node but that's true for Spark also, so let's focus on the disk step only). Then finally the reduce step starts, reads the output from shuffle and sort step and finally stores the result back in HDFS. That's six disk accesses to complete the job. Most Hadoop clusters have 7200 RPM disks which are ridiculously slow. Now, here is how Spark works. Like MapReduce job needs mappers and reducers, Spark has two types of processes. One is transformation and other is action. When you write a Spark job, it consists of a number of transformations and a few actions. When Spark job starts, it creates a DAG (Directed acyclic graph) of the job (steps it is supposed to run as part of the job). Then when a job starts, it looks at the DAG and assume the first 5 steps are transformations. It remembers the steps (the DAG) but doesn't really go to disk to perform the transformations. Then it encounters action. At that point a Spark job goes to disk, performs the first transformation, keeps the result of transformation in memory, performs the second transformation, keeps the result in memory and so on until all the steps complete. The only time it goes back to disk is to write the output of the job. So, two accesses to disk. This makes Spark faster. There are other things in Spark which makes it faster than MapReduce. For example, a rich set of API which enables to accomplish in one Spark job what might require two or more MapReduce jobs running one after the other. Imagine, how slow that would be. There are cases where Spark would spill to disk because of the amount of data and it will be slow but may or may not be as slow as MapReduce because of better rich API.

mqureshi · ‎09-20-2017

@Pooja Kamle Check if your metastore is running or not? See your mysql process. That might be down.

mqureshi · ‎09-20-2017

@Pooja Kamle Is there something already running on port 9083? What is the output of "netstat -nlp | grep 9083" ?

mqureshi · ‎09-19-2017

@sally sally Please increase minimum number of entries to greater than 1 (I say start with 10). Also increase the minimumgroupsize. In your case, your first file looks like its 72 KB and your minimum groupsize is 10 KB. One file alone satisfies the condition of minimum group size and and combine that with minimum number of entries and merge condition are already satisfied.

mqureshi · ‎09-19-2017

@Vijay Parmar Are you using hive on spark? These libraries are under hive and if you are not using Hive on spark then your other applications should not be affected. Regardless, I am not asking you to delete. Just move to resolve this issue and then you can restore in the unlikely event of anything else getting impacted.

mqureshi · ‎09-19-2017

@Vijay Parmar Here is your issue: SLF4J:Found binding in[jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J:Found binding in[jar:file:/usr/hdp/2.4.2.0-258/hive/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J:Found binding in[jar:file:/usr/hdp/2.4.2.0-258/hive/lib/spark-examples-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J:Found binding in[jar:file:/usr/hdp/2.4.2.0-258/hive/lib/spark-hdp-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class] You need to get rid of 2nd and 3rd and 4th. For now just move it to a backup location which is not in CLASSPATH. Then run this job. Then figure out if anything else is using these jar files (definitely not number 3 as its just examples jar). From name, it seems like you probably would never need it because they will be required for Hive on Spark which I am guessing you are not doing since you are using HDP which uses Tez and LLAP.

mqureshi · ‎09-11-2017

@Bhaskar Das So you want to know when mappers have completed and data is being transferred to reducers, how many times copy occurs? Right? After mappers complete, data is sent to reducer based on keys. Data for each key will land on a particular reducer and only that reducer, no matter which mapper it is coming from. One reducer may have more than one key, but one key will always exist on a particular reducer. So imagine, mappers output data on node 1, node 2, and node 3. Further assume that there is a key "a" for which data is present in mapper outputs on node 1, node 2, and node 3. Imagine reducers running on each of the three nodes (total three reducers). suppose data for key "a" is going to node 3. Then data from node 1, node 2 will be copied to node 3 as reducer input. In fact data from node 3 will also be copied over in a folder where reducer can pick it up (local copy unlike over the network for data coming from node 1 and node 2). So really three copies occurred when you had 3 mappers and 1 reducer. If you follow the above logic on how copy is done based on keys, you will arrive at "m*n" copies. Please see the picture in following link (Map Reduce data flow). that should visually answer what I have described above. Hope this helps. https://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow

mqureshi · ‎08-17-2017

@Qi Wang Have you setup a truststore and then trust SAM as an application that can connect to Ambari? I have not set this up but not setting up a truststore and "trusting" SAM can be a reason for your error. Check troubleshooting in the following link: https://community.hortonworks.com/articles/39865/enabling-https-for-ambariserver-and-troubleshootin.html

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: Should we use RAID with Hadoop?

Re: NiFi with rolloing file pattern

Re: Why is spark has better speed than Hadoop

Re: Hive Metastore and Hive Server starts but stop...

Re: Hive Metastore and Hive Server starts but stop...

Re: How to merge flowfiles in nifi?

Re: Hive Error: SLF4J: Class path contains multipl...

Re: Hive Error: SLF4J: Class path contains multipl...

Re: If there are m mappers and n reducers, then wh...

Re: How to create SAM service pool for secured Amb...