Member since
02-15-2019
9
Posts
1
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1527 | 02-07-2019 04:17 PM |
02-02-2021
02:35 AM
You have to call hsync with the SyncFlag.UPDATE_LENGTH argument
... View more
02-01-2021
11:38 AM
I have an application that uses FSDataOutputStream to write data to HDFS. In order to write that data I use FSDataOutputStream's hflush function. In order to obtain the number of bytes that have been written I use FSDataOutputStream's getPos function. For some reason after hflush has been called, getPos returns the wrong file size most of the time (sometimes it is correct). My understanding is that when I call hflush, and after that when I call getPos, the file size in HDFS has to be equal (in bytes) to what getPos returns, but getPos always returns something greater! As though half of the file is still stuck in some buffer and hasn't reached a physical disk... I read about the hsync function of FSDataOutputStream. I started using hsync instead of hflush, because it guarantees that the data will not be buffered and will be written to disk. But the problem still persists, it is very rare now, but I still have the same issue. 10% of the time, when I call hsync, and then getPos, the file size in HDFS is less than what getPos returns. Why is this happening and how can I synchronize getPos with hsync?
... View more
Labels:
- Labels:
-
HDFS
02-16-2019
08:34 PM
Here's the answer: http://community.cloudera.com/t5/Storage-Random-Access-HDFS/Why-can-t-I-partition-a-1-gigabyte-dataset-into-300/td-p/86549
... View more
02-16-2019
08:34 PM
Hey guys, I have a 1.5 gb dataset and I am trying to write it into an external partitioned hive table, I don't know what's going on, but it fails, saying, "could only be replicated to 0 nodes instead of min
Replication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation." Here are the the exceptions in namenode's log. java.io.IOException: File /user/maria_dev/data2/.hive-staging_hive_2019-02-08_11-05-21_487_8982509445885981287-1/_task_tmp.-ext-10000/cityid=219/_tmp.000013_0 could only be replicated to 0 nodes instead of min
Replication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1719)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3372)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3296)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:850)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:504)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
and 2019-02-08 11:05:45,906 DEBUG net.NetworkTopology (NetworkTopology.java:chooseRandom(796)) - chooseRandom returning null
2019-02-08 11:05:45,906 DEBUG blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseLocalRack(547)) - Failed to choose from local rack (location = /default-rack); the second replica is n
ot found, retry choosing ramdomly
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:701)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:622)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:529)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:489)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:341)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:216)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:113)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:128)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1710)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3372)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3296)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:850)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:504)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
I don't understand what's happening, I have 70 gigs of free space on hdfs, and the dataset is pretty small. What could be going wrong? Inserting into a non-partitioned table works, but inserting into a partitioned table doesn't. Here's my code for inserting: insert into table partbrowserdata partition(cityid)
select /*column names omitted*/
from browserdata; I tried this on all hive execution engines: mr, tez (and also spark on my cloudera cluster and it fails too). By the way the number of partitions it creates is about 300, maybe that's too much? I tried changing settings like hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode but it did not help. I would also like to say that I tried it on newly installed hortonworks sandboxes (with 20 gigs of ram and 6 processor cores), versions 2.6 and 3.0 as well as a cloudera cluster and it didn't work. But it did work on MapR, probably because it has a different file system? Please help.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
-
Apache Tez
02-15-2019
08:03 PM
@Harsh J you are a genius! Thanks a lot!
... View more
02-15-2019
05:39 PM
1 Kudo
Hey guys, I have already asked this on multiple forums but never got a reply, so I thought that I might get one here. I have an about 1 gig dataset, and it's got a "cityid" column of which there are 324 unique values, so after partitioning I should get 324 folders in hdfs. But whenever I partition, it fails, you can look at the exception messages here https://community.hortonworks.com/questions/238893/notenoughreplicasexception-when-writing-into-a-par.html It's definitely an HDFS issue, because everything worked out on MapR. What could possible be the problem? Btw, I tried this on a fresh install of hortonworks and cloudera and with default settings, so nothing was compromised. If you need any more details please ask. Could this be a setup issue or something? Like maybe I need to increase memory somewhere in the HDFS or something?
... View more
Labels:
- Labels:
-
Apache Hive
-
HDFS
02-08-2019
11:08 AM
Hey, thanks so much!
... View more
02-07-2019
08:33 PM
In the picture I attached to this post you can see my current log level value. This does not work. In /var/log/hadoop/hdfs/hadoop-hdfs-namenode-sandbox-hdp.hortonworks.com.log I can only see INFO and WARN messages.
... View more
Labels:
- Labels:
-
Apache Hadoop
02-07-2019
04:17 PM
Are you familiar with user defined functions?
... View more