Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to decide the final output location?

avatar
Expert Contributor

I have a 22 GB file, that is processed by a MapReduce job. The output file is a JSON file that I am storing on HDFS. The size of the file is 1 GB. Currently, I do not want to reduce the information in the output file, because it contains valuable information needed for my visualization (drill down etc). The problem is that this file is huge in terms of reading from HDFS and used by charting tools on a web page. What should be the strategy here?. My first thought is to go for a NoSQL such as MongoDB or HBase. But, I have other choices like a RDBS like Oracle. I understand that the choice actually depends upon the nature of the data, but I would like to hear from the experienced hadoop users who might have faced similar situation.

1 ACCEPTED SOLUTION

avatar
Master Guru

Different options.

You mostly run aggregations but on a small subset of columns?

Hive with ORC, this is a column compressed format so only the columns you need will actually be read. This means you would have to say goodbye to the json format but as long as your data model is pretty flat. ( There are lists and arrays in Hive as well ). If you restrict by a column as well employ partitioning/sorting/predicate pushdown.

You mostly run aggregations on a small number of rows ( thousands to millions )

Hbase/Phoenix sound like a good choice.

View solution in original post

1 REPLY 1

avatar
Master Guru

Different options.

You mostly run aggregations but on a small subset of columns?

Hive with ORC, this is a column compressed format so only the columns you need will actually be read. This means you would have to say goodbye to the json format but as long as your data model is pretty flat. ( There are lists and arrays in Hive as well ). If you restrict by a column as well employ partitioning/sorting/predicate pushdown.

You mostly run aggregations on a small number of rows ( thousands to millions )

Hbase/Phoenix sound like a good choice.