Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5336 | 08-12-2016 01:02 PM | |
2183 | 08-08-2016 10:00 AM | |
2565 | 08-03-2016 04:44 PM | |
5434 | 08-03-2016 02:53 PM | |
1396 | 08-01-2016 02:38 PM |
08-15-2016
08:39 AM
Linkedin? There is only one Benjamin Leonhardi there
... View more
08-12-2016
10:10 PM
@jovan karamacoski I think you might want to contact us for a services engagement. I strongly suspect that what you want to achieve and what you asking about are not compatible. On hadoop normally some files will be hot not specific blocks. And files will be per definition widely distributed across nodes. So moving specific "hot" drives will not make you happy. Also esp. If you write having some nodes with more network than others doesn't sound like a winning combination. Since slow nodes will be a bottleneck and it's all linked together. That's how hdfs works. If you want some files to be faster you might want to look at hdfs storage tiering. Using that you could put "hot" data on fast storage like ssds. You could also look at node labels to put specific applications on fast nodes with lots of cpu etc. But moving single drives ??? That will not make you happy. Per definitely hdfs will not care. One balancer later and all your careful planning is gone. Oh and lastly there is no online move of data nodes. You always need to stop a data node change the storage layout and start it again. It will send the updated block report to the Namenode.
... View more
08-12-2016
07:05 PM
1 Kudo
just use doAs=true make sure only hive can read the warehouse folder and you are done. Hive cli can start but not access anything
... View more
08-12-2016
01:02 PM
1 Kudo
"If it runs in the Appmaster, what exactly are "the computed input splits" that jobclient stores into HDFS while submitting the Job ??" InputSplits are simply the work assignments of a mapper. I.e. you have the inputfolder /in/file1
/in/file2 And assume file1 has 200MB and file2 100MB ( default block size 128MB ) So the InputFormat per default will generate 3 input splits ( on the appmaster its a function of InputFormat) InputSplit1: /in/file1:0:128000000
InputSplit2: /in/file1:128000001:200000000
InputSplit3:/in/file2:0:100000000 ( per default one split = 1 block but he COULD do whatever he wants. He does this for example for small files where he uses MultiFileInputSplits which span multiple files ) "And how map works if the split spans over data blocks in two different data nodes??" So the mapper comes up ( normally locally to the block ) and starts reading the file with the offset provided. HDFS by definition is global and if you read non local parts of a file he will read it over the network but local is obviously more efficient. But he COULD read anything. The HDFS API makes it transparent. So NORMALLY the InputSplit generation will be done in a way that this does not happen. So data can be read locally but its not a necessary precondition. Often maps are non local ( you can see that in the resource manager ) and then he can simply read the data over the network. The API call is identical. Reading an HDFS file in Java is the same as reading a local file. Its just an extension to the Java FileSystem API.
... View more
08-12-2016
09:24 AM
2 Kudos
Hello jovan, Yes you can simply move a folder. Data nodes are beautifully simple that way. We just did it on our cluster. Stop hdfs, copy the folder to a new location and change the location in the ambari configuration. just try it with a single drive on a single node ( using ambari groups) ( you can do an hadoop fsck / to check for under replicated blocks after the test). A single drive will not lead to inconsistencies in any case. In general data nodes do not care where the blocks are as long as they still find the files with the right block id in the data folders. You can theoretically do it on a running cluster but you need to use ambari groups do it one server at a time and make sure you do it quickly so Namenode doesn't start to schedule large number of replica additions because of the missing data node ( hdfs waits a biy before it fixes under replication in case a data node just reboots)
... View more
08-11-2016
06:58 PM
Not sure what you mean. Do you want to know WHY blocks get under replicated? There are different possibilities for a block to vanish but by and large its simple: a) The block replica doesn't get written in the first place This happens during network or node failure during a write. HDFS will still return the write of a block as successful as long as one of the block replica writes was successful . So if for example the third designated datanode dies during the write process the write is still successful but the block will be under replicated. The write process doesn't care and they depend on the Namenode to schedule a copy later on. b) The block replicas get deleted later. That can have lots of different reasons. Node dies, drive dies, you delete a block file in the drive. Blocks after all are simple bog standard Linux files with a name blkxxxx which is the block id. They can also get corrupted ( HDFS does CRC checks regularly and blocks that are corrupted will be replaced with a healthy copy. And many more ... So perhaps you should be a bit more specific with your question?
... View more
08-11-2016
12:53 PM
1 Kudo
The namenode has a list of all the files blocks and block replicas in memory. A gigantic hashtable. Datanodes send block reports to it to give it an overview of all the blocks in the system. Periodically the namenode checks if all blocks have the desired replication level. If not it schedules either block deletion ( if the replication level is too high which can happen if a node crashed and was re added to the cluster ) or block copies.
... View more
08-09-2016
10:39 AM
No the expunge should happen immediately, although HDFS may take a bit till the datanodes actually get around to delete the files but it shouldn't take long. So expunge doesn't help? Weird 🙂
... View more
08-09-2016
09:26 AM
2 Kudos
You see that line:
16/08/0909:16:13 INFO fs.TrashPolicyDefault:Namenode trash configuration:Deletion interval =360 minutes,Emptier interval =0 minutes. Per default HDFS uses a trash. You can bypass this with rm -skipTrash or just delete the trash with hadoop fs -expunge
... View more
08-08-2016
10:00 AM
1 Kudo
Gopal and me gave a couple of tips in here to increase the parallelity ( since Hive is normally not tuned for cartesian joins and creates too few mappers ). https://community.hortonworks.com/questions/44749/hive-query-running-on-tez-contains-a-mapper-that-h.html#comment-45388 Apart from that my second point still holds you should create some pre-filtering to reduce the amount of points you need to compare. There are a ton of different ways to do this: https://en.wikipedia.org/wiki/Spatial_database#Spatial_index You can put points in grids and make sure that a data point in one grid entry cannot be closer to any point of the other grid entry than your max distance for example.
... View more