About pradeep_tp

pradeep_tp · ‎06-27-2016

I have a 22 GB file, that is processed by a MapReduce job. The output file is a JSON file that I am storing on HDFS. The size of the file is 1 GB. Currently, I do not want to reduce the information in the output file, because it contains valuable information needed for my visualization (drill down etc). The problem is that this file is huge in terms of reading from HDFS and used by charting tools on a web page. What should be the strategy here?. My first thought is to go for a NoSQL such as MongoDB or HBase. But, I have other choices like a RDBS like Oracle. I understand that the choice actually depends upon the nature of the data, but I would like to hear from the experienced hadoop users who might have faced similar situation.

pradeep_tp · ‎05-10-2016

@Ajay I have understood it now, but can you tell me how do I view the Current mappers running for an application? what links should I follow?

pradeep_tp · ‎05-10-2016

I have a job running in the cluster, but I am unable to see that job through the JobHistory UI. I can only see the job if I execute the command "hadoop job -list" in the linux command prompt. I have observed that if I go to "ResourceManager UI" I see a running application, but I do not see any jobs of that running application through "JobHistory UI". In the ResourceManager UI, I have also observed that the latest application that I executed is associated with "ApplicationMaster' under "Tracking UI" field. Rest of the other Applications are associated with "History" under "Tracking UI" field. Is this the reason why I cannot see all the jobs under JobHistory UI for this application because it is associated with ApplicationMaster?.

pradeep_tp · ‎05-10-2016

@Predrag Minovic Thanks. I have understood it now very well. The problem was indeed Nodemanager not available in the other two nodes. Also, I made a mistake in my calculation of MB, due to which I misunderstood the process.

pradeep_tp · ‎05-06-2016

I have a 4 node cluster. I am running a MapReduce job on this cluster. The input file is a JSON file of the size 1.53 GB. The Mapper task is reading a JSON record and manipulating the text. I observed the following, after I executed the Job. 1) There are 15 Mapper tasks, which is correct. (no issues here) 2) Only 1% of the job is processed in 50 minutes, which is very slow. 3) Only 4 mapper task is shown running. 4) Two mappers are running on Machine1 and other two mapers are running on Machine2. 5) Mapper task 1 in Machine 1 is showing total 21627027 as read and keeps increasing after a few seconds. Following is what I need to understand: 1) Why only two Nodes have all the Mapper tasks running. Why are the other nodes not running any mapper? 2) If one mapper is per 128 MB file block, why the mapper task on machine 1 is showing 21627027 byes (21 MB) of data ? (Edited: I had mentioned 21120 MB, which was a calculation mistake. The correct figure is 21 MB.)

pradeep_tp · ‎05-05-2016

I found out what was occupying space in Non dfs space. It was the log files under the folder /var/log/hive. It had around 67 GB of log file!!!. I removed the file and now the space has been reclaimed. Thanks for your help. (I used the command du -kscx * to know the size of each folder. I executed this command in the log folder.)

pradeep_tp · ‎05-05-2016

@Sagar Shimpi I have checked the NameNode UI. I observe that the "Non-DFS Used" is showing 77.15 GB and "used" showing just 1.25 GB. 77.15 GB is very high as compared to other three nodes. My question is what to do next? how do I free up more space on this node?. As for the versions, HDP is version 2.4 and Ambari is version 2.2.1.1.

pradeep_tp · ‎05-05-2016

@Sagar ShimpiThanks for pointing me to "Rebalance HDFS" utility. After I clicked on Rebalance HDFS, the progress bar quickly ended saying success. Shouldn't this be a long procedure, with lots of data being sent from one node to another to balance?. How do I know when the process will finish, if it has not ended, because after click that link, I do not see any change immediately.

pradeep_tp · ‎05-05-2016

I have a four node cluster (HDP 2.4). On the Ambari hosts page, I can see that the space consumption for one of the nodes is very high. First of all I do not understand the cause of this. I would like to understand how easy is it to evenly distribute the data across all the nodes so that all nodes consumes equal amount of dfs data.

pradeep_tp · ‎05-04-2016

I found the issue and fixed. The command "hostname -f" on Machine1 was giving "hostname: Unknown host" error. I fixed that error and then when I added the host again, it was successful.

Online	Offline
Last Visited	‎05-12-2017 04:34 AM

Member Since	‎12-09-2015 03:22 AM
Last Visited	‎05-12-2017 04:34 AM
Posts	97
Kudos received	51

Cloudera Community

Re: Hostname related exception on adding new host

Re: Forbidden 403 error on HDP 2.4 installation.

Re: How to select SSH Private key from web browser...

How to decide the final output location?

Re: JobHistory UI is not showing the running jobs.

JobHistory UI is not showing the running jobs.

Re: Why only two nodes are running Map tasks?

Why only two nodes are running Map tasks?

Re: Uneven DFS data storage across cluster.

Re: Uneven DFS data storage across cluster.

Re: Uneven DFS data storage across cluster.

Uneven DFS data storage across cluster.

Re: Hostname related exception on adding new host