Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to process large volume of data(e.g, 100 GB) in Apache Hadoop?

avatar
New Contributor

I want to process 100 GB of RFID data on apache haddop. Can anyone explain me how to do it using Horton Works Sandbox.Thanks in advance

1 ACCEPTED SOLUTION

avatar
Master Guru

@SANTOSH DASH You can process data in hadoop using many difference services. If your data has a schema then you can start with processing the data with hive. Full tutorial here. My preference is to do ELT logic with pig. Full tutorial here. there are many ways to skin a cat here. Full list of tutorials are here.

View solution in original post

2 REPLIES 2

avatar
Master Guru

@SANTOSH DASH You can process data in hadoop using many difference services. If your data has a schema then you can start with processing the data with hive. Full tutorial here. My preference is to do ELT logic with pig. Full tutorial here. there are many ways to skin a cat here. Full list of tutorials are here.

avatar
Master Guru

Regarding how refer to Sunile. Pig is nice and flexible, Hive is good if you know SQL and your RFID data is already basically in a flat table format, Spark also works well ...

But the question is if you really want to process 100GB of data on the sandbox. The memory settings are tiny there is a single drive data is not replicated ... If you do it like this you can just use python on a local machine. If you want a decent environment you might want to set up 3-4 nodes on a VMware server perhaps 32GB of RAM for each? That would give you a nice little environment and you could actually do some fast processing.