About gsrao_cse

cstanca · ‎09-19-2016

@srinivasa rao I guess you read about when you perform a "select * from <tablename>", Hive fetches the whole data from file as a FetchTask rather than a mapreduce task which just dumps the data as it is without doing anything on it, similar to "hadoop dfs -text <filename>" However, the above does not take advantage of the true parallelism. In your case, for 1 GB will not make the difference, but image a 100 TB table and you do use a single threaded task in a cluster with 1000 nodes. FetchTask is not a good use of parallelism. Tez provides some options to split the data set to allow true parallelism. tez.grouping.max-size and tez.grouping.min-size are split parameters. Ref: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/ref-ffec9e6b-41f4-47de-b5cd-1403b4c4a7c8.1.html If any of the responses was helpful, please don't forget to vote/accept the answer.

shivanageshch · ‎09-16-2016

1)Why Secondary namenode is explicitly copying Fsimage from Primary name node when secondary name node is having the same copy of FS image as primary has? There is no guaranty that the fs image in secondary namenode will be exactly same as that in Primary namenode. During checkpoint period of time , there may happen any corruption of data or any crashes and data loss. Its better to get the latest available data from Primary namenode and then merge the editlogs. 2) Initially when cluster is setup will it be having any fsimage at primary node if yes will it contains any data. Yes, When a new namenode is setup in a new cluster it will have a FSimage with no data in it with file name like Fsimage_000000000 representing no transactions. 3) Looks like both primary name node and secondary name node are maintaining all the transaction logs? Is it required to maintain same logs in both locations? if yes, How many old transactions that we have to keep in cluster? is there any configuration for this By default HDFS stores till the transactions count reaches 1 million. Files which are storing transaction logs greater than 1 million are removed from HDFS.

ssubhas · ‎08-16-2016

@srinivasa rao This is a limitation with sqoop as only default schema under the userid would be used for extracting all the tables with sqoop import-all-tables. Workaround is to import tablewise using sqoop import.

bleonhardi · ‎08-01-2016

Hive was essentially a java library that kicks off MapReduce jobs. So the hive cli for example runs a full "server" in its client. If you have multiple clients all of them do their own SQL parsing/optimization etc. In Hive1 there was a thrift server which was like a proxy server for this. So a thrift ( data serialization framework ) client could connect to it instead of doing all the computations locally. All of that is not relevant anymore since Hiveserver2 has been the default for many years in all distributions and is a proper database server with concurrency/security/logging/workload management... You still have the hive client available but this will be deprecated soon in favor of beeline which is a command line client that connects to hiveserver2 through jdbc. This is desirable since the hive cli punches a lot of holes into central hive administration. So forget about hiveserver1 thrift server and thrift client.

rajkumar_singh · ‎07-22-2016

please refer http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/de//archive/mapreduce-osdi04.pdf most importantly the implementation

shivanageshch · ‎07-26-2016

Very neatly explained.!

ArpitAgarwal · ‎07-18-2016

Note that 'hdfs fsck / -files -blocks -locations' is a workaround. It can be slow on large clusters. There is no efficient way to query the block locations of a file.

sunile_manjee · ‎07-14-2016

I use this blog often when I forget the data movement between map-->reduce The map outputs are copied to the reduce task JVM’s memory if they are small enough (the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which specifies the proportion of the heap to use for this purpose); otherwise, they are copied to disk. When the in-memory buffer reaches a threshold size (controlled by mapred.job.shuffle.merge.percent) or reaches a threshold number of map outputs(mapred.inmem.merge.threshold), it is merged and spilled to disk.

Online	Offline
Last Visited	‎12-26-2017 12:18 PM

Member Since	‎07-01-2016 02:32 PM
Last Visited	‎12-26-2017 12:18 PM
Posts	26
Kudos received	5

Cloudera Community

Re: How multiple reducer writing the output ? can ...

Re: Why Map job is launched when I run SELECT * FR...

Re: Why Secondary namenode is explicitly copying F...

Re: Why only dbo schema tables are loading into Hi...

Re: What is it meant by "HiveServer cannot handle ...

Re: Why we have to do multiple times sorting oper...

Re: On what basis Application Master decides that ...

Re: List out the Metadata attributes in Hadoop

Re: How Reducers know where the mapper results are...