Support Questions

Find answers, ask questions, and share your expertise

What is the best way to stream data to HDFS, accounting for the fact that HDFS is optimized for large files

I've used the PutHDFS processor as I've started to understand how to deal with big data environments.

Up until now I've been putting very small files into HDFS. This seems to be architecturally bad practice. The HDFS block size defaults to about 128 MB, and the hadoop community recommendation seems to be that applications (that write to HDFS) should write files that are GB in size, or even TB.

I'm trying to understand how to do this with Nifi. Part of my concern is a concern for the data analysts. What is the best way to logically structure files that are appropriate for HDFS?

Currently the files that I am writing contain small JSON objects, or lists. I use MergeRecord to intentionally make the file I write larger. However my JSON objects accumulate fast thousands of JSON records per second potentially.

For the Big Data/Nifi experts, I'd appreciate any thoughts relative to the best way to use Nifi to support streaming large data objects into HDFS.


Super Guru

@David Sargrad


You can also think of using MergeContent processor to create bigger files(by using min,max group size properties 1GB..etc) and then store these files into HDFS directory.



If you have structured json files then create hive table on top of this files and run

insert overwrite <same_hive_table> select * from <same_hive_table>;

By using this method hive will create exclusive lock on the directory until the overwrite will be completed.


Option3:Hadoop Streaming jar:

Store all files into one daily directory and then run merge as a daily job at midnight..etc by using hadoop-streaming.jar as described in this link.


Option4:Hive Using ORC files:

If you are thinking to store the files as orc files then convert json data to orc format then you can use concatenate feature of ORC to create big file by merging small orc files.


Option5:Hive Transactional tables:

By using hive transactional tables we can insert data using PutHiveStreaming(convert json data to avro and feed it to puthivestreaming ) processor and based on buckets we have created in hive transactional table, Hive will store all your data into these buckets(those many files in HDFS).

-> If you are reading this data from Spark then make sure your spark is able to read Hive Transactional tables.


If you found any other efficient way to do this task, please mention the method so that we will learn based on your experience.. 🙂

Hi @Shu. Thank you very much for your thoughts. This is the kind of feedback that I was hoping for. I'll absolutely do my best to understand your recommendation. It sounds like I am not completely off-base in the way that I hope to use HDFS. It does sound like you are confirming that I must figure out how to accumulate large files, prior to driving them into HDFS. I will look at the tools and methods that you suggest.

Thanks for your insights

Super Guru

@David Sargrad

You can have sysdate,current_date..etc variable in your spark job then use that variable to read the directory from HDFS dynamically.

HI @Shu. Could you please explain what sysdate, current_date, etc would do for me with the spark job? I dont fully understand how to use them and the benefits that this technique would offer.

@Shu I like your idea of creating daily archives (Option 3 above). How do I ensure that spark jobs that I create to process those daily files run on the datanode that they are stored on? Does yarn do this by default? I've not yet used yarn. I've only used HDFS. I am hoping to eventually use k8s (kubernetes).