Member since
10-24-2015
171
Posts
379
Kudos Received
23
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1611 | 06-26-2018 11:35 PM | |
2962 | 06-12-2018 09:19 PM | |
1973 | 02-01-2018 08:55 PM | |
853 | 01-02-2018 09:02 PM | |
4894 | 09-06-2017 06:29 PM |
09-27-2018
11:09 PM
@vamsi
krishna
, can you please check if resource manager is up ?
... View more
09-06-2018
06:45 PM
@Alaa
Nabil
, The stack trace posted above is related to Nodemanager trying to recover a container. Do you know what is this application (application_1536072828773_1909) and what was the status of this application ?
... View more
07-12-2018
08:40 PM
Are you running Distributed shell application ? Default client timeout for Dshell is 600 secs. You can extend client timeout using "-timeout <milliseconds>" in application launch command.
... View more
06-29-2018
09:21 PM
2 Kudos
@kanna k , you can find out location of standby namenode logs from hadoop-env.sh. Look for HADOOP_LOG_DIR value to find out correct location of the log . Example: export HADOOP_LOG_DIR=/var/log/hadoop/$USER In this example, standby namenode log will be present at /var/log/hadoop/hdfs dir.
... View more
06-26-2018
11:35 PM
3 Kudos
@Kumar Veerappan, is umask set properly in your cluster ? Refer to below article for details. https://community.hortonworks.com/content/supportkb/150234/error-path-disk3hadoopyarnlocalusercachesomeuserap.html
... View more
06-26-2018
09:57 PM
1 Kudo
Here's one more good thread for HDFS small file problem. https://community.hortonworks.com/questions/167615/what-is-small-file-problem-in-hdfs.html
... View more
06-26-2018
06:33 PM
@kanna k, you can find out location of standby namenode logs from hadoop-env.sh. Look for HADOOP_LOG_DIR value to find out correct location of the log . Example: export HADOOP_LOG_DIR=/var/log/hadoop/$USER In this example, standby namenode log will be present at /var/log/hadoop/hdfs dir.
... View more
06-26-2018
06:28 PM
@Moti Ben Ivgi, Please look at below thread, you may be hitting issue with host setup. https://community.hortonworks.com/questions/26802/data-node-process-not-starting-up.html
... View more
06-25-2018
10:43 PM
I have a yarn service app which has two components Master and Worker. I restarted Yarn services and launched the yarn service app. Here, I'm noticing that the app launched by Yarn only get Master component. It did not start any worker node. Can someone please explain why could this situation happen and how to recover from this ?
... View more
- Tags:
- Hadoop Core
- YARN
Labels:
- Labels:
-
Apache YARN
06-12-2018
09:19 PM
2 Kudos
Zookeeper.out file contains log for zookeeper server. you can refer to below thread to enable log rotating for zookeeper. This way you can avoid too big log files. https://community.hortonworks.com/questions/39282/zookeeper-log-file-not-rotated.html
... View more
06-07-2018
09:14 PM
@Dukool Sharma, Reduce only job is not possible with Mapreduce. Reducer requires intermediate data in the form of key-pair value from mapper. Thus its not possible to run just reducers without mappers.
... View more
02-23-2018
01:03 AM
@Anurag Mishra, Unfortunately there is no yarn cli / rest api available to validate user-queue mapping without launching application.
... View more
02-02-2018
09:19 PM
2 Kudos
@Renuka Peshwani, Spark is a data processing engine to develop faster and ease of use analytics. Spark ecosystem consist of below components. Spark Core Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built on top of. It provides in-memory computing capabilities to deliver speed, a generalized execution model to support a wide variety of applications, and Java, Scala, and Python APIs for ease of development. Spark SQL / Dataframe Many data scientists, analysts, and general business intelligence users rely on interactive SQL queries for exploring data. Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data. It also provides powerful integration with the rest of the Spark ecosystem (e.g., integrating SQL query processing with machine learning). Spark Streaming Many applications need the ability to process and analyze not only batch data, but also streams of new data in real-time. Running on top of Spark, Spark Streaming enables powerful interactive and analytical applications across both streaming and historical data, while inheriting Spark’s ease of use and fault tolerance characteristics. It readily integrates with a wide variety of popular data sources, including HDFS, Flume, Kafka, and Twitter. Spark Machine Learning Machine learning is critical piece in mining Big Data for actionable insights. Built on top of Spark, MLlib is a scalable machine learning library that delivers both high-quality algorithms (e.g., multiple iterations to increase accuracy) and blazing speed (up to 100x faster than MapReduce). The library is usable in Java, Scala, and Python as part of Spark applications, so that you can include it in complete workflows. GraphX GraphX is a graph computation engine built on top of Spark that enables users to interactively build, transform and reason about graph structured data at scale. It comes complete with a library of common algorithms.
... View more
02-01-2018
08:55 PM
1 Kudo
@Michael Bronson , Missing data block can be related to data corruption. Use 'hdfs fsck <path> -list-corruptfileblocks -files -locations' to find out which replicas got corrupted. Secondly, In order to fix issue, you can delete the corrupted blocks using 'hdfs fsck / -delete' I hope you find below thread useful for handing missing blocks. https://community.hortonworks.com/questions/17917/best-way-of-handling-corrupt-or-missing-blocks.html
... View more
02-01-2018
08:45 PM
1 Kudo
@Pritam Pal, Big data is a concept which can handle a large amount of data. However, Hadoop is an open source framework which can be used for distributed storage and processing of datasets of big data. Here's a good link which elaborates the difference between big data and Hadoop nicely https://www.quora.com/What-is-the-difference-between-big-data-and-Hadoop
... View more
02-01-2018
08:41 PM
2 Kudos
@Pritam Pal , Combiner is used to summarize map output records with same key. This process reduces the data transfer across network between mapper and reducer. Hadoop doesn’t guarantee on how many times a combiner function will be called for each map output key. At times, it may not be executed at all, while at times it may be used once, twice, or more times depending on the size and number of output files generated by the mapper for each reducer. https://www.tutorialspoint.com/map_reduce/map_reduce_combiners.htm
... View more
02-01-2018
07:47 PM
2 Kudos
@Pritam Pal, Hadoop is a combination of HDFS ( data storage), YARN (app execution framework) And Mapreduce ( data processing engine). Thus, it is not fair to compare Hadoop and Spark. Mapreduce and Spark can be comparable because both of them are data processing engine. Here is a good link which compares Mapreduce and Spark in detail https://www.xplenty.com/blog/apache-spark-vs-hadoop-mapreduce/
... View more
01-20-2018
09:16 PM
1 Kudo
@Nayeem Janmohamed, It seems that datanodes in this cluster are not in good state. Can you please restart datanodes and restart History Server ?
... View more
01-16-2018
07:55 PM
1 Kudo
@AA MJ, How many datanodes do you have in cluster ? As per message , it seams you only have one datanode. Ideally, you should have 3 datanodes atleast to manage replication correctly. However, as per this message it means that you have only one data node and that data node is excluded. Can you please check content of /etc/hadoop/conf/dfs.hosts.exclude ? If the datanode is added to exclude host list, this behavior is expected. If data node is not added to exclude list and still you are noticing issue with replication, you can restart the datanode and try again.
... View more
01-05-2018
07:41 PM
2 Kudos
@pbarna, you can set mapreduce.job.queuename=myqueue for mapred job. https://community.hortonworks.com/content/supportkb/49658/how-to-specify-queue-name-submitting-mapreduce-job.html
... View more
01-02-2018
09:02 PM
1 Kudo
There are multiple ways you can perform various operations on HDFS. You can choose any of the below approach as per your need. 1) Command Line Most of users use command line to interact with HDFS. HDFS CLI is easy to use. Its easy to automate with scripts. However, HDFS CLI need hdfs client installed on the host. 2) Java Api If you are familiar with Java and Apache Apis, You can choose to use Java api to communicate with HDFS Cluster. 3) Webhdfs This is rest api way of accessing HDFS. This approach does not require hdfs client to be installed on host. You can use this api to connect to remote HDFS cluster too.
... View more
12-22-2017
07:58 PM
1 Kudo
@kotesh banoth, you can setup user-queue mapping in capacity scheduler. This will restrict access to the queue as per users. Once this mapping is set, you can set priority to respective queue. As per above mentioned case, the queue related to user1 should have highest priority, this way application started in this queue will execute first. user-queue mapping: http://tamastarjanyi.blogspot.com/2015/01/user-based-queue-mapping-for-capacity.html Queue priority setting: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.0.0/bk_ambari-views/content/setting_yarn_queue_priorities.html
... View more
12-22-2017
07:34 PM
1 Kudo
@Amod Kulkarni , this issue should be related to mismatch in scala version. Few relevant links: https://stackoverflow.com/questions/25089852/what-is-the-reason-for-java-lang-nosuchmethoderror-scala-predef-arrowassoc-upo https://issues.apache.org/jira/browse/SPARK-5483
... View more
12-20-2017
07:33 PM
1 Kudo
@Michael Bronson, HDFS in this cluster is in safemode. Thats why Timelineserver is failing to start. Kindly check HDFS log to see why is Namenode is in safemode. You can explicitly turn off safemode by running "hdfs dfsadmin -safemode leave".
... View more
12-18-2017
09:39 PM
7 Kudos
@Fernando Lopez Bello , you can use Yarn Node label feature to achieve your goals. 1) add a node label such as "spark nodes" on hosts where you want keep spark application running. 2) Map "spark nodes" node label with a Yarn Queue such as "SparkQueue". 3) Run your Spark applications from "SparkQueue". This way you can ensure that Spark application will run on specific hosts you want. Find few useful links for Node labels feature as below. https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/NodeLabel.html https://community.hortonworks.com/articles/72450/node-labels-configuration-on-yarn.html
... View more
12-08-2017
08:34 PM
7 Kudos
@yassine sihi, there are two different concepts. HDFS can be deployed in two modes. 1) Without HA 2) With HA. In without HA mode, HDFS will have Namenode and Secondary Namenode. Here, secondary namenode periodically take snapshot of namenode and keep the metadata and audit logs up to date. So in case of Namenode failure, Secondary Namenode will have copy of latest namenode activity and prevent data loss. In HA mode, HDFS have two set of Namenodes. One acts as active namenode and another acts as Standby Namenode. The duties of standby namenode is similar to Secondary namenode where it keeps the track of active namenode activity and take snapshot periodically. Here, in case of active namenode failure, standby namenode automatically takes the control and becomes active. This way user will not notice the failure in namenode. This way High availability is guaranteed.
... View more
10-14-2017
06:53 PM
1 Kudo
@raouia , you can check application status using yarn ui or yarn cli. Run "yarn application -status application_1507646811049_0004" to get status of the application. If application is finished, you can use "yarn logs -applicationId application_1507646811049_0004" cmd to get logs for this application.
... View more
10-10-2017
06:36 PM
1 Kudo
@Nikita Kiselev, you can also use yarn cli to figure out active/stand by RM. You can find out RM-Ids from yarn-site.xml. Look for yarn.resourcemanager.ha.rm-ids properties. <property>
<name> yarn.resourcemanager.ha.rm-ids</name>
<value> rm1, rm2</value>
</property> Run "yarn rmadmin -getServiceState rm1" to find out state of RM1. It will return active if RM1 is an active RM or else it will return standby. You can run same command to know rm2 status too. ( yarn rmadmin -getServiceState rm2).
... View more
10-04-2017
01:54 AM
1 Kudo
It looks like start operation is trying to create /hdp/apps/2.6.2.0-205/mapreduce/mapreduce.tar.gz. Can you please check if /hdp/apps/2.6.2.0-205/mapreduce/mapreduce.tar.gz is present on HDFS ? Ideally mapreduce tar ball should already be present on hdfs.
... View more
09-29-2017
01:33 AM
5 Kudos
@Roberto Sancho , currently yarn does not have feature to time out the application. YARN-3813 is apache jira tracking this feature. For now. you will need to write a long running script, which can monitor each application's state and kill if it stays in accepted state for long time.
... View more