Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What is the best way to stream data to HDFS, accounting for the fact that HDFS is optimized for large files

What is the best way to stream data to HDFS, accounting for the fact that HDFS is optimized for large files

New Contributor

I've used the PutHDFS processor as I've started to understand how to deal with big data environments.


Up until now I've been putting very small files into HDFS. This seems to be architecturally bad practice. The HDFS block size defaults to about 128 MB, and the hadoop community recommendation seems to be that applications (that write to HDFS) should write files that are GB in size, or even TB.


I'm trying to understand how to do this with Nifi. Part of my concern is a concern for the data analysts. What is the best way to logically structure files that are appropriate for HDFS?


Currently the files that I am writing contain small JSON objects, or lists. I use MergeRecord to intentionally make the file I write larger. However my JSON objects accumulate fast thousands of JSON records per second potentially.


For the Big Data/Nifi experts, I'd appreciate any thoughts relative to the best way to use Nifi to support streaming large data objects into HDFS.

5 REPLIES 5

Re: What is the best way to stream data to HDFS, accounting for the fact that HDFS is optimized for large files

Super Guru

@David Sargrad

Option1:NiFi

You can also think of using MergeContent processor to create bigger files(by using min,max group size properties 1GB..etc) and then store these files into HDFS directory.

-

Option2:Hive

If you have structured json files then create hive table on top of this files and run

insert overwrite <same_hive_table> select * from <same_hive_table>;

By using this method hive will create exclusive lock on the directory until the overwrite will be completed.

-

Option3:Hadoop Streaming jar:

Store all files into one daily directory and then run merge as a daily job at midnight..etc by using hadoop-streaming.jar as described in this link.

-

Option4:Hive Using ORC files:

If you are thinking to store the files as orc files then convert json data to orc format then you can use concatenate feature of ORC to create big file by merging small orc files.

-

Option5:Hive Transactional tables:

By using hive transactional tables we can insert data using PutHiveStreaming(convert json data to avro and feed it to puthivestreaming ) processor and based on buckets we have created in hive transactional table, Hive will store all your data into these buckets(those many files in HDFS).

-> If you are reading this data from Spark then make sure your spark is able to read Hive Transactional tables.

-

If you found any other efficient way to do this task, please mention the method so that we will learn based on your experience.. :)

Re: What is the best way to stream data to HDFS, accounting for the fact that HDFS is optimized for large files

New Contributor

Hi @Shu. Thank you very much for your thoughts. This is the kind of feedback that I was hoping for. I'll absolutely do my best to understand your recommendation. It sounds like I am not completely off-base in the way that I hope to use HDFS. It does sound like you are confirming that I must figure out how to accumulate large files, prior to driving them into HDFS. I will look at the tools and methods that you suggest.


Thanks for your insights

Re: What is the best way to stream data to HDFS, accounting for the fact that HDFS is optimized for large files

Super Guru

@David Sargrad

You can have sysdate,current_date..etc variable in your spark job then use that variable to read the directory from HDFS dynamically.

Re: What is the best way to stream data to HDFS, accounting for the fact that HDFS is optimized for large files

New Contributor

HI @Shu. Could you please explain what sysdate, current_date, etc would do for me with the spark job? I dont fully understand how to use them and the benefits that this technique would offer.

Highlighted

Re: What is the best way to stream data to HDFS, accounting for the fact that HDFS is optimized for large files

New Contributor

@Shu I like your idea of creating daily archives (Option 3 above). How do I ensure that spark jobs that I create to process those daily files run on the datanode that they are stored on? Does yarn do this by default? I've not yet used yarn. I've only used HDFS. I am hoping to eventually use k8s (kubernetes).