Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Raw Historical data for Hadoop POC

Highlighted

Raw Historical data for Hadoop POC

Super Guru

Hey Guys! Is there any website/way to get bulk of raw data (approx 50-100TB) for Hadoop POC? Can we get historical data from Twitter feed? No specific use-case as of now, just wanted to check if we can download this much amount of data.

2 REPLIES 2

Re: Raw Historical data for Hadoop POC

New Contributor

https://github.com/caesar0301/awesome-public-datasets should fit the bill - it's got large amounts of data from almost every field you can think of. 50-100TB is an awful lot of data though, especially to download. Have you considered writing a script to automatically generate the data locally given a smaller set of input data?

Highlighted

Re: Raw Historical data for Hadoop POC

As Ben said, I don't think downloading that much date is an option. OTOH, tpc-ds benchmark contains a tool that can generate 100T as text files, and will also convert them into ORC files for Hive.

Don't have an account?
Coming from Hortonworks? Activate your account here