Member since
06-23-2014
17
Posts
0
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2356 | 09-09-2014 04:36 AM |
09-09-2014
04:36 AM
So, I managed to fix my problem. The first hint was the GC overhead limit exceeded message. I quickly found out that this can be cause by lack of heapspace for the JVM. After digging a bit into the YARN configuration in Cloudera Manager, and comparing it to the setting in an Amazon Elastic Mapreduce cluster (where my Pig scripts did work), I found out that, even though each node had 30GB of memory, most YARN components had very low heapspace settings. I updated the heapspace for the NodeManagers, ResourceManager and Containers and I also set the max heapspace for mappers and reducers somewhat higher, keeping in mind the total amount of memory available on each node (and the other services running there, like Impala) and now my Pig scripts work again! Two issues I want to mention in case a Cloudera engineer reads this: I find it a bit strange that Cloudera Manager doesn't set saner heapspace amounts, based on the total amount of RAM available The fact that not everything runs under YARN yet, makes it harder to manage memory. You actually have to manage memory manually. If Impala would run under YARN, there would be less memory management I think 🙂
... View more
09-04-2014
06:40 AM
I suggest looking at Oozie. Oozie is the defacto workflow manager/scheduler and is supported on CDH. You can schedule jobs to be run at certain time intervals, at certain dates/times, etc. Oozie supports Sqoop jobs. Here is some documentation, although I have no idea how up-to-date it is 🙂 Hope that helps!
... View more
09-04-2014
05:09 AM
When running a Pig script on a 3-node CM managed CDH cluster, I get the following error: 2014-09-04 11:56:02,411 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1409824425178_0011_r_000001_3 Info:Error: GC overhead limit exceeded All 3 nodes (running on Amazon EC2) have 30GB of memory. The datasize is trivial: I'm using 3 CSVs of which the largest is 1GB in size. The data is fetched directly from Amazon S3. This happens both when running the script in Hue and when running it on the command line. Three questions: Why is this happening? How can I fix this? Isn't the whole purpose of Cloudera Manager to provide a sane configuration, based on the hardware used? Some background for the last question: the cluster is running on Amazon EC2. Before setting up a CDH cluster using Cloudera Manager, I ran an Amazon EMR cluster with the same hardware configuration. The same pig script worked perfectly fine then. I switched to CDH so I could use Hue, and be on the cutting edge of Hadoop related technologies. It's a shame I'm running into these kind of problems so quickly...
... View more
09-02-2014
03:22 AM
@GautamG wrote: I have to check on that. Meanwhile another option is to create the cluster manually and save the master and worker node images as custom AMIs. Use those AMIs every morning to create a new cluster, then tear it down. When you want to update CDH, just do it once manually and save new AMIs Hmmm, that is actually a great idea! It certainly is the least-effort solution so far 🙂 I will look into that this afternoon. Meanwhile, with regards to Whirr, it seems the documentation I pointed to, is outdated. If you look at the sample whirr config in the whirr-cm repo, it supports YARN roles: https://raw.githubusercontent.com/cloudera/whirr-cm/master/cm-ec2.properties
... View more
09-02-2014
03:01 AM
That's interesting, because this page in the CDH5 documentation states: Note: At present you can launch and run only an MapReduce cluster; YARN is not supported. http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Installation-Guide/cm5ig_launch_cm_with_whirr.html
... View more
09-02-2014
02:49 AM
@GautamG wrote: Yes the rpm/deb packages have to be installed already. Alternatively you could use a mixture of the AWS API (to provision the hosts), then use the Cloudera Manager API to provision the cluster (using the parcel deployment) Does the CM API support distributing parcels? Or how would I go about that? I know how to provision EC2 instances using the Amazon AWS API, but now I'm kind off in the dark on how to install CM and CDH on those 🙂 Regarding the Whirr option: it doesn't support YARN on EC2 with CM yet, right?
... View more
09-02-2014
02:18 AM
Thanks for the swift answer! I have looked at the API and it seems you can't actually install packages through the API, right? Does that mean that all the packages for all the services I'd want to enable, should be installed beforehand on all nodes, before I add hosts, services and roles through the API?
... View more
09-01-2014
02:40 AM
Hi all! At the company I work, we're currently using a 4 node Amazon EMR cluster together with S3 for all our data warehousing and analysis needs. The cluster gets spin-up each morning and torn down each evening automatically through a cron job running on another server, to save costs. We're using Impala exstensively. Our data is copied each morning from S3 to HDFS after the cluster has been spun up. I was looking at installing Hue to provide a nice interface for querying Impala. Then it occurred to me that it would probably be easier to move from EMR to EC2 and install CDH5 on there. Ideally we would use Cloudera Manager for monitoring the cluster while it's running. The problem: is there a way to install CDH5, including Cloudera Manager, automatically on an EC2 cluster, without human interaction?
... View more
08-12-2014
01:04 PM
Thanks Marcel! That seems to work indeed, at least with Tableau and Impyla. Apparently the instructions on the Amazon website regarding setting up a tunnel, don't work that well. I'm gonna try out tomorrow if this tunnel also works with Squirrel and other generic JDBC DB tools.
... View more
07-10-2014
03:12 AM
I asked it over at the Amazon EMR forums as well. No answers so far 😞 I thought that maybe this was a general Impala JDBC issue that people have seen before.
... View more
07-09-2014
07:13 AM
Hi all, I'm trying to connect to Impala on a cluster setup through Amazon EMR, but it doesn't work. It's a three-node cluster, with Impala installed and working. I've done the following things: Setup a SSH tunnel to the master node like this: ssh -ND 21050 hadoop@master-node-external-dns-hostname Downloaded the correct JDBC drivers from here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/impala-jdbc.html Tried to setup a connection using SquirrelSQL and SQLWorkbenchJ using the downloaded drivers and the following connection string: jdbc:hive2://localhost:21050/;auth=noSasl Result: Could not establish connection to jdbc:hive2://localhost:21050/;auth=noSasl: null I checked wether Impala works by running impala-shell on the master node. I can show tables, query, etc. I checked wether the port is forwarded through the tunnel by telnetting to localhost 21050 I checked with beeline on the master node if it's possible at all to connect to Impala through JDBC on that port. Works just fine Am I missing something? Can someone shine their light on this? Thanks!
... View more
06-25-2014
12:37 AM
I'm using CDH 5.0.2 together with Cloudera Manager 5.0.2. I think the SQOOP issue you linked, is exactly the problem I'm having. I shouldn't have to add --append because I'm already using lastmodified, which is the other incremental mode. As long as SQOOP-1138 isn't fixed, SQOOP will be rather useless to me 🙂 The only alternative seems to be to export the whole database each time, and replace the old data with the new export.
... View more
06-24-2014
12:39 AM
@abe wrote: Sqoop2 does not support incremental imports just yet (https://issues.apache.org/jira/browse/SQOOP-1168). It looks like the command you're running is creating a saved job. Have you tried just executing the saved job (http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html#_saved_jobs)? Seems like this is achievable via: sqoop job --exec import-test. You don't need to create a job to perform incremental imports (http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html#_incremental_imports). -Abe sqoop job --exec import-test is actually the way I ran the saved job. As said, the first time it runs fine, the second time it complains about the output dir existing already. The reason I used a saved job for this, is because of the promise that it will keep track of and autofill the last value.
... View more
06-23-2014
04:27 AM
Hi all! I have a seemingly simple use case for Sqoop: incrementally import data from a MySQL db into HDFS. At first I tried Sqoop2, but it seems Sqoop2 doesn't support incremental imports yet. Am I correct in this? (Sqoop2 did imports fine btw) Then I tried to use Sqoop (1) and figured out I need to create a job so Sqoop can automatically update stuff like the last value for me. This is the command I used to create a job: sqoop job --create import-test -- import --connect jdbc:mysql://10.211.55.1/test_sqoop --username root -P --table test_incr_update --target-dir /user/vagrant/sqooptest --check-column updated --incremental lastmodified When I run it for the first time, it works great. When I run it for the second time, I get: ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://vm-cluster-node1:8020/user/vagrant/sqooptest already exists I could of course remove the target dir before running the import again, but that would defeat the whole purpose of only getting the newer data and merging it with the old data (for which the old data needs to be present, I assume)! Is this a (known) bug? Or am I doing something wrong? Any help would be appreciated 🙂
... View more