Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How spark works to analyze huge databases

avatar

Hi,

Im studing spark, because I read some studies about it and it seems amazing to process large volumes of data. So I was thinking to expriment this, generating 100gb of data with some benchmark like tpc and execute the queries with spark using 2 nodes, but Im with some doubts how to do this.

I need to install hadoop two hadoop nodes to store the tpc tables? And then execute the queries with spark against hdfs? But how we can create the tpc schema and store the tables in hadoop hdfs?Is it possible? Or it´s not necessary install hadoop and we need to use hive instead? I reading some articles about this but but I m getting a bit confused. Thanks for your attention!

1 ACCEPTED SOLUTION

avatar
Master Mentor
@Jan J

I wont start with 2 node cluster ..Minimum 3 to 5 nodes --> This is just a lab env

2 master, 3 DN

You need to deploy a cluster - Link Use ambari to deploy HDP

https://github.com/cartershanklin/hive-testbench - You can generate hive data using testbench

then you can test sparksql

and

https://github.com/databricks/spark-perf

Yes, you should start with Hadoop and take advantage of distributed computing framework.

View solution in original post

11 REPLIES 11

avatar
Master Mentor

@Jan J Please help me to close the thread if it's useful

avatar

Thanks for your help. I just not understand why thins link: https://github.com/databricks/spark-perf that you share is needed. Can you explain? In the first step I instal lhadoop, then Install hive and create the schema. Then I can use spark sql to execute queries against the hive schema right? So why that link its necessary? Thanks again!