Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to merge data in hdfs while using incremental import

Highlighted

How to merge data in hdfs while using incremental import

Expert Contributor

I am using R to import data. Now, let's say I do it incrementally.

I am using rhdfs package and so can write it in hdfs as following:

writeToHDFS <- function(fileName){
  hdfs.init()
  modelfile <- hdfs.file(fileName, "w")
  write.csv(get(fileName), paste(fileName, "csv", sep="."))
  hdfs.put(paste(fileName, "csv", sep="."), "/user/rstudio/")
  hdfs.close(modelfile)
}

Now, for a file that already exists with that name, I want the data appended to it rather than the file completely overwritten which is probably what rhdfs.put does and as a result of which I get only the new imported data instead of new data appended to old data ?

Is there a way to do it in R?

I then will keep this file mapped to a hive table.

1 REPLY 1
Highlighted

Re: How to merge data in hdfs while using incremental import

Explorer

one idea is to invoke a system command and run a script to merge the files together. Consider saving off the the previous deltas in case you need to recover them later....

hadoop fs -text *_fileName.csv | hadoop fs -put - targetFilename.csv

Don't have an account?
Coming from Hortonworks? Activate your account here