Member since
06-07-2016
923
Posts
322
Kudos Received
115
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3244 | 10-18-2017 10:19 PM | |
3604 | 10-18-2017 09:51 PM | |
13207 | 09-21-2017 01:35 PM | |
1327 | 08-04-2017 02:00 PM | |
1681 | 07-31-2017 03:02 PM |
09-26-2017
12:26 PM
1 Kudo
@Sheetal Sharma Not for data nodes. For some master nodes processes like Hive Metastore, yes. Also, use RAID for all OS disks, You don't want a node failure just because one OS disk fails. As for data nodes, they make three copies of data on different machines, so you don't need RAID. In fact, RAID will reduce performance as performance in RAID is determined by the slowest disk. Same with Zookeeper and Quorum journal manager. They have redundant processes running on three different nodes on three different disks, so you don't need RAID.
... View more
09-23-2017
06:41 PM
@Biswajit Chakraborty
You will use "Rolling filename Pattern" property which in your case can be set to "my-app-*.log. Another thing in your use case you will do is to specify "filesToTail" property. Use expression language to specify your files to tail. https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#dates
... View more
09-21-2017
01:35 PM
2 Kudos
@Riddhi Sam
First of all, Spark is not faster than Hadoop. Hadoop is a distributed file system (HDFS) while Spark is a compute engine running on top of Hadoop or your local file system. Spark however is faster than MapReduce which was the first compute engine created when HDFS was created. So, when Hadoop was created, there were only two things. HDFS where data is stored and MapReduce which was the only compute engine on HDFS. To understand how Spark is faster than MapReduce, you need to understand how both MapReduce and Spark works. When a MR job starts, the first step is to read data from disk and run mappers. The output of mappers is stored back on disk. Then Shuffle and sort step starts and reads the mapper output from disk and after shuffle and sort completes, it stores the result back on disk (there is actually some network traffic also when keys for Reduce step are gathered on same node but that's true for Spark also, so let's focus on the disk step only). Then finally the reduce step starts, reads the output from shuffle and sort step and finally stores the result back in HDFS. That's six disk accesses to complete the job. Most Hadoop clusters have 7200 RPM disks which are ridiculously slow. Now, here is how Spark works. Like MapReduce job needs mappers and reducers, Spark has two types of processes. One is transformation and other is action. When you write a Spark job, it consists of a number of transformations and a few actions. When Spark job starts, it creates a DAG (Directed acyclic graph) of the job (steps it is supposed to run as part of the job). Then when a job starts, it looks at the DAG and assume the first 5 steps are transformations. It remembers the steps (the DAG) but doesn't really go to disk to perform the transformations. Then it encounters action. At that point a Spark job goes to disk, performs the first transformation, keeps the result of transformation in memory, performs the second transformation, keeps the result in memory and so on until all the steps complete. The only time it goes back to disk is to write the output of the job. So, two accesses to disk. This makes Spark faster. There are other things in Spark which makes it faster than MapReduce. For example, a rich set of API which enables to accomplish in one Spark job what might require two or more MapReduce jobs running one after the other. Imagine, how slow that would be. There are cases where Spark would spill to disk because of the amount of data and it will be slow but may or may not be as slow as MapReduce because of better rich API.
... View more
09-20-2017
09:28 PM
@Pooja Kamle Check if your metastore is running or not? See your mysql process. That might be down.
... View more
09-20-2017
06:21 AM
@Pooja Kamle Is there something already running on port 9083? What is the output of "netstat -nlp | grep 9083" ?
... View more
09-19-2017
01:55 PM
@sally sally Please increase minimum number of entries to greater than 1 (I say start with 10). Also increase the minimumgroupsize. In your case, your first file looks like its 72 KB and your minimum groupsize is 10 KB. One file alone satisfies the condition of minimum group size and and combine that with minimum number of entries and merge condition are already satisfied.
... View more
09-19-2017
04:27 AM
@Vijay Parmar Are you using hive on spark? These libraries are under hive and if you are not using Hive on spark then your other applications should not be affected. Regardless, I am not asking you to delete. Just move to resolve this issue and then you can restore in the unlikely event of anything else getting impacted.
... View more
09-19-2017
03:25 AM
1 Kudo
@Vijay Parmar Here is your issue:
SLF4J:Found binding in[jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J:Found binding in[jar:file:/usr/hdp/2.4.2.0-258/hive/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J:Found binding in[jar:file:/usr/hdp/2.4.2.0-258/hive/lib/spark-examples-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J:Found binding in[jar:file:/usr/hdp/2.4.2.0-258/hive/lib/spark-hdp-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class] You need to get rid of 2nd and 3rd and 4th. For now just move it to a backup location which is not in CLASSPATH. Then run this job. Then figure out if anything else is using these jar files (definitely not number 3 as its just examples jar). From name, it seems like you probably would never need it because they will be required for Hive on Spark which I am guessing you are not doing since you are using HDP which uses Tez and LLAP.
... View more
09-11-2017
02:54 AM
@Bhaskar Das So you want to know when mappers have completed and data is being transferred to reducers, how many times copy occurs? Right? After mappers complete, data is sent to reducer based on keys. Data for each key will land on a particular reducer and only that reducer, no matter which mapper it is coming from. One reducer may have more than one key, but one key will always exist on a particular reducer. So imagine, mappers output data on node 1, node 2, and node 3. Further assume that there is a key "a" for which data is present in mapper outputs on node 1, node 2, and node 3. Imagine reducers running on each of the three nodes (total three reducers). suppose data for key "a" is going to node 3. Then data from node 1, node 2 will be copied to node 3 as reducer input. In fact data from node 3 will also be copied over in a folder where reducer can pick it up (local copy unlike over the network for data coming from node 1 and node 2). So really three copies occurred when you had 3 mappers and 1 reducer. If you follow the above logic on how copy is done based on keys, you will arrive at "m*n" copies. Please see the picture in following link (Map Reduce data flow). that should visually answer what I have described above. Hope this helps. https://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow
... View more
08-17-2017
10:01 PM
@Qi Wang Have you setup a truststore and then trust SAM as an application that can connect to Ambari? I have not set this up but not setting up a truststore and "trusting" SAM can be a reason for your error. Check troubleshooting in the following link: https://community.hortonworks.com/articles/39865/enabling-https-for-ambariserver-and-troubleshootin.html
... View more