Member since
07-01-2016
26
Posts
5
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4814 | 02-22-2017 03:23 PM |
09-19-2016
05:20 PM
2 Kudos
@srinivasa rao I guess you read about when you perform a "select * from <tablename>", Hive fetches the whole data from file as a FetchTask rather than a mapreduce task which just dumps the data as it is without doing anything on it, similar to "hadoop dfs -text <filename>" However, the above does not take advantage of the true parallelism. In your case, for 1 GB will not make the difference, but image a 100 TB table and you do use a single threaded task in a cluster with 1000 nodes. FetchTask is not a good use of parallelism. Tez provides some options to split the data set to allow true parallelism. tez.grouping.max-size and tez.grouping.min-size are split parameters. Ref: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/ref-ffec9e6b-41f4-47de-b5cd-1403b4c4a7c8.1.html If any of the responses was helpful, please don't forget to vote/accept the answer.
... View more
09-16-2016
05:31 AM
1 Kudo
1)Why Secondary namenode is explicitly copying Fsimage from Primary name node when secondary name node is having the same copy of FS image as primary has? There is no guaranty that the fs image in secondary namenode will be exactly same as that in Primary namenode. During checkpoint period of time , there may happen any corruption of data or any crashes and data loss. Its better to get the latest available data from Primary namenode and then merge the editlogs. 2) Initially when cluster is setup will it be having any fsimage at primary node if yes will it contains any data. Yes, When a new namenode is setup in a new cluster it will have a FSimage with no data in it with file name like Fsimage_000000000 representing no transactions. 3) Looks like both primary name node and secondary name node are maintaining all the transaction logs? Is it required to maintain same logs in both locations? if yes, How many old transactions that we have to keep in cluster? is there any configuration for this By default HDFS stores till the transactions count reaches 1 million. Files which are storing transaction logs greater than 1 million are removed from HDFS.
... View more
08-16-2016
02:30 PM
1 Kudo
@srinivasa rao This is a limitation with sqoop as only default schema under the userid would be used for extracting all the tables with sqoop import-all-tables. Workaround is to import tablewise using sqoop import.
... View more
08-01-2016
02:38 PM
1 Kudo
Hive was essentially a java library that kicks off MapReduce jobs. So the hive cli for example runs a full "server" in its client. If you have multiple clients all of them do their own SQL parsing/optimization etc. In Hive1 there was a thrift server which was like a proxy server for this. So a thrift ( data serialization framework ) client could connect to it instead of doing all the computations locally. All of that is not relevant anymore since Hiveserver2 has been the default for many years in all distributions and is a proper database server with concurrency/security/logging/workload management... You still have the hive client available but this will be deprecated soon in favor of beeline which is a command line client that connects to hiveserver2 through jdbc. This is desirable since the hive cli punches a lot of holes into central hive administration. So forget about hiveserver1 thrift server and thrift client.
... View more
07-22-2016
01:22 PM
please refer http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/de//archive/mapreduce-osdi04.pdf most importantly the implementation
... View more
07-26-2016
12:58 PM
Very neatly explained.!
... View more
07-18-2016
10:27 PM
1 Kudo
Note that 'hdfs fsck / -files -blocks -locations' is a workaround. It can be slow on large clusters. There is no efficient way to query the block locations of a file.
... View more
07-14-2016
04:03 PM
1 Kudo
I use this blog often when I forget the data movement between map-->reduce
The map outputs are copied to the reduce task JVM’s memory if they are small enough (the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which specifies the proportion of the heap to use for this purpose); otherwise, they are copied to disk. When the in-memory buffer reaches a threshold size
(controlled by mapred.job.shuffle.merge.percent) or reaches a threshold number of map outputs(mapred.inmem.merge.threshold), it is merged and spilled to disk.
... View more