About Daan

Daan · ‎10-15-2014

I noticed! Very cool development!

Daan · ‎09-09-2014

So, I managed to fix my problem. The first hint was the GC overhead limit exceeded message. I quickly found out that this can be cause by lack of heapspace for the JVM. After digging a bit into the YARN configuration in Cloudera Manager, and comparing it to the setting in an Amazon Elastic Mapreduce cluster (where my Pig scripts did work), I found out that, even though each node had 30GB of memory, most YARN components had very low heapspace settings. I updated the heapspace for the NodeManagers, ResourceManager and Containers and I also set the max heapspace for mappers and reducers somewhat higher, keeping in mind the total amount of memory available on each node (and the other services running there, like Impala) and now my Pig scripts work again! Two issues I want to mention in case a Cloudera engineer reads this: I find it a bit strange that Cloudera Manager doesn't set saner heapspace amounts, based on the total amount of RAM available The fact that not everything runs under YARN yet, makes it harder to manage memory. You actually have to manage memory manually. If Impala would run under YARN, there would be less memory management I think 🙂

Daan · ‎09-04-2014

When running a Pig script on a 3-node CM managed CDH cluster, I get the following error: 2014-09-04 11:56:02,411 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1409824425178_0011_r_000001_3 Info:Error: GC overhead limit exceeded All 3 nodes (running on Amazon EC2) have 30GB of memory. The datasize is trivial: I'm using 3 CSVs of which the largest is 1GB in size. The data is fetched directly from Amazon S3. This happens both when running the script in Hue and when running it on the command line. Three questions: Why is this happening? How can I fix this? Isn't the whole purpose of Cloudera Manager to provide a sane configuration, based on the hardware used? Some background for the last question: the cluster is running on Amazon EC2. Before setting up a CDH cluster using Cloudera Manager, I ran an Amazon EMR cluster with the same hardware configuration. The same pig script worked perfectly fine then. I switched to CDH so I could use Hue, and be on the cutting edge of Hadoop related technologies. It's a shame I'm running into these kind of problems so quickly...

Daan · ‎09-02-2014

Thank you for your help!

Daan · ‎09-02-2014

@GautamG wrote: I have to check on that. Meanwhile another option is to create the cluster manually and save the master and worker node images as custom AMIs. Use those AMIs every morning to create a new cluster, then tear it down. When you want to update CDH, just do it once manually and save new AMIs Hmmm, that is actually a great idea! It certainly is the least-effort solution so far 🙂 I will look into that this afternoon. Meanwhile, with regards to Whirr, it seems the documentation I pointed to, is outdated. If you look at the sample whirr config in the whirr-cm repo, it supports YARN roles: https://raw.githubusercontent.com/cloudera/whirr-cm/master/cm-ec2.properties

Daan · ‎09-02-2014

That's interesting, because this page in the CDH5 documentation states: Note: At present you can launch and run only an MapReduce cluster; YARN is not supported. http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Installation-Guide/cm5ig_launch_cm_with_whirr.html

Daan · ‎09-02-2014

@GautamG wrote: Yes the rpm/deb packages have to be installed already. Alternatively you could use a mixture of the AWS API (to provision the hosts), then use the Cloudera Manager API to provision the cluster (using the parcel deployment) Does the CM API support distributing parcels? Or how would I go about that? I know how to provision EC2 instances using the Amazon AWS API, but now I'm kind off in the dark on how to install CM and CDH on those 🙂 Regarding the Whirr option: it doesn't support YARN on EC2 with CM yet, right?

Daan · ‎09-02-2014

Thanks for the swift answer! I have looked at the API and it seems you can't actually install packages through the API, right? Does that mean that all the packages for all the services I'd want to enable, should be installed beforehand on all nodes, before I add hosts, services and roles through the API?

Daan · ‎09-01-2014

Hi all! At the company I work, we're currently using a 4 node Amazon EMR cluster together with S3 for all our data warehousing and analysis needs. The cluster gets spin-up each morning and torn down each evening automatically through a cron job running on another server, to save costs. We're using Impala exstensively. Our data is copied each morning from S3 to HDFS after the cluster has been spun up. I was looking at installing Hue to provide a nice interface for querying Impala. Then it occurred to me that it would probably be easier to move from EMR to EC2 and install CDH5 on there. Ideally we would use Cloudera Manager for monitoring the cluster while it's running. The problem: is there a way to install CDH5, including Cloudera Manager, automatically on an EC2 cluster, without human interaction?

Daan · ‎08-12-2014

Thanks Marcel! That seems to work indeed, at least with Tableau and Impyla. Apparently the instructions on the Amazon website regarding setting up a tunnel, don't work that well. I'm gonna try out tomorrow if this tunnel also works with Squirrel and other generic JDBC DB tools.

Online	Offline
Last Visited	‎06-08-2015 12:29 PM

Member Since	‎06-23-2014 04:19 AM
Last Visited	‎06-08-2015 12:29 PM
Posts	17

Cloudera Community

Re: Pig memory

Re: Install CDH5 on EC2 without human interaction

Re: Pig memory

Pig memory

Re: Install CDH5 on EC2 without human interaction

Re: Install CDH5 on EC2 without human interaction

Re: Install CDH5 on EC2 without human interaction

Re: Install CDH5 on EC2 without human interaction

Re: Install CDH5 on EC2 without human interaction

Install CDH5 on EC2 without human interaction

Re: Can't connect to Impala through JDBC on Amazon...