About paulpaul1076

paulpaul1076 · ‎02-02-2021

You have to call hsync with the SyncFlag.UPDATE_LENGTH argument

paulpaul1076 · ‎02-01-2021

I have an application that uses FSDataOutputStream to write data to HDFS. In order to write that data I use FSDataOutputStream's hflush function. In order to obtain the number of bytes that have been written I use FSDataOutputStream's getPos function. For some reason after hflush has been called, getPos returns the wrong file size most of the time (sometimes it is correct). My understanding is that when I call hflush, and after that when I call getPos, the file size in HDFS has to be equal (in bytes) to what getPos returns, but getPos always returns something greater! As though half of the file is still stuck in some buffer and hasn't reached a physical disk... I read about the hsync function of FSDataOutputStream. I started using hsync instead of hflush, because it guarantees that the data will not be buffered and will be written to disk. But the problem still persists, it is very rare now, but I still have the same issue. 10% of the time, when I call hsync, and then getPos, the file size in HDFS is less than what getPos returns. Why is this happening and how can I synchronize getPos with hsync?

paulpaul1076 · ‎02-15-2019

@Harsh J you are a genius! Thanks a lot!

paulpaul1076 · ‎02-15-2019

Hey guys, I have already asked this on multiple forums but never got a reply, so I thought that I might get one here. I have an about 1 gig dataset, and it's got a "cityid" column of which there are 324 unique values, so after partitioning I should get 324 folders in hdfs. But whenever I partition, it fails, you can look at the exception messages here https://community.hortonworks.com/questions/238893/notenoughreplicasexception-when-writing-into-a-par.html It's definitely an HDFS issue, because everything worked out on MapR. What could possible be the problem? Btw, I tried this on a fresh install of hortonworks and cloudera and with default settings, so nothing was compromised. If you need any more details please ask. Could this be a setup issue or something? Like maybe I need to increase memory somewhere in the HDFS or something?

paulpaul1076 · ‎02-08-2019

Hey, thanks so much!

paulpaul1076 · ‎02-07-2019

In the picture I attached to this post you can see my current log level value. This does not work. In /var/log/hadoop/hdfs/hadoop-hdfs-namenode-sandbox-hdp.hortonworks.com.log I can only see INFO and WARN messages.

paulpaul1076 · ‎02-07-2019

Are you familiar with user defined functions?

Online	Offline
Last Visited	‎06-03-2022 08:36 AM

Member Since	‎02-15-2019 05:21 PM
Last Visited	‎06-03-2022 08:36 AM
Posts	9
Kudos received	1

Cloudera Community

Re: Hive to extract multiple ip addresses from a s...

Re: Why are FSDataOutputStream's hsync and getPos ...

Why are FSDataOutputStream's hsync and getPos func...

Re: Why can't I partitioned a 1 gigabyte dataset i...

Why can't I partition a 1 gigabyte dataset into 30...

Re: How do I set log level of namenode to DEBUG?

How do I set log level of namenode to DEBUG?

Re: Hive to extract multiple ip addresses from a s...