question How to process large volume of data(e.g, 100 GB) in Apache Hadoop? in Archives of Support Questions (Read Only)

How to process large volume of data(e.g, 100 GB) in Apache Hadoop?

santoshdash_uu — Tue, 21 Apr 2026 13:30:01 GMT

I want to process 100 GB of RFID data on apache haddop. Can anyone explain me how to do it using Horton Works Sandbox.Thanks in advance

Re: How to process large volume of data(e.g, 100 GB) in Apache Hadoop?

sunile_manjee — Mon, 11 Jul 2016 10:50:01 GMT

@SANTOSH DASH You can process data in hadoop using many difference services. If your data has a schema then you can start with processing the data with hive. Full tutorial here. My preference is to do ELT logic with pig. Full tutorial here. there are many ways to skin a cat here. Full list of tutorials are here.

Re: How to process large volume of data(e.g, 100 GB) in Apache Hadoop?

bleonhardi — Mon, 11 Jul 2016 17:31:35 GMT

Regarding how refer to Sunile. Pig is nice and flexible, Hive is good if you know SQL and your RFID data is already basically in a flat table format, Spark also works well ...

But the question is if you really want to process 100GB of data on the sandbox. The memory settings are tiny there is a single drive data is not replicated ... If you do it like this you can just use python on a local machine. If you want a decent environment you might want to set up 3-4 nodes on a VMware server perhaps 32GB of RAM for each? That would give you a nice little environment and you could actually do some fast processing.