Support Questions

Find answers, ask questions, and share your expertise

How To Process Data with Apache Pig tutorial SLOW

avatar
Contributor

Hello all -

Just a quick LOW PRIORITY question for anyone who has run the tutorial "How To Process Data with Apache Pig".

I created the script, and running the job as I write this. It has been running for 2 hours. Does this seem SLOW to anyone else?

I am running on a machine with an i7 processor, have 16 Gb of RAM, of which the Ambari Sandbox is utilizing 8 Gb. Are there other configuration options that should be set? Although - this seems like a massive amount of resources in use already.

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Mike Vogt

Have you configured yarn queues?

There is high probability that some other job is consuming all the resources

Check RM ui from ambari

View solution in original post

11 REPLIES 11

avatar
Master Mentor

@Mike Vogt

Have you configured yarn queues?

There is high probability that some other job is consuming all the resources

Check RM ui from ambari

avatar
Master Mentor

@Mike Vogt

Make sure core components are up

Hdfs

Yarn

Mapreduce

avatar

Yep, my History Server was down and had to be manually started.

avatar
Master Mentor

@Lester Martin Thanks for testing and confirming. I think you should publish article based on your comments

avatar

I'm working with @Rafael Coss to make sure the instructions are extremely crisp as I think there are a few things that could easily trip up a novice which is who we are targeting with these tutorials.

avatar
Contributor

Your genius level skills shine through once again! Thanks very much!

avatar

I just ran this tutorial on my 16GB i7 MBPro (gave the VM 8GB just as you) and could get it to run in 100 secs with MR and about 65 secs using Tez. I then ran the same script from the CLI and got those times down to about 60 and 25 secs on MR and Tez, respectively. I'm using the 2.3.2 Sandbox and the only thing I had to do was start the History Server was showing up red in Ambari.

avatar
Master Mentor

Tez benefits from warm containers so consecutive execution of same scripts should be better. Didn't know MR was performing better in CLI, can't explain that 🙂 @Lester Martin

avatar

It ~seems~ that the Ambari Views were adding about 30 seconds to the run times. Here's some of my notes around timings; notice the actual log-reported job times are pretty consistent from CLI and View runs.

Ran FromExec EngJob TimeClock Time
Ambari ViewMR64 sec103 sec
Ambari ViewTez25 sec63 sec
CLIMR59 sec61 sec
CLITez25 sec27 sec

Actual job times were consistent for each execution engine (Tez twice as fast), but Ambari View ~seemed~ to add 30+ secs overall. I'm sure my the extremely constrained HDP stack on a tiny little psuedo-cluster (aka the Sandbox) is a big factor in this (understandable).