About Sean

Sean · ‎06-02-2016

Also, note that there's a script that tries to detect a public IP and set up the hosts file for you on boot. If you're going to edit it manually, you probably want to comment out the line in /etc/init.d/cloudera-quickstart-init that calls /usr/bin/cloudera-quickstart-ip. I don't remember which version that was added in. It might have been 5.5 - so if your VM doesn't have /usr/bin/cloudera-quickstart-ip you can ignore this post and safely edit the hosts file anyway.

Sean · ‎06-01-2016

intermediate_access_logs was created as part of the ETL process in the tutorial. That process is done via Hive because it uses Hive SerDe's and other Hive-only features. The final table created in that process (tokenized_access_logs, if I remember correctly) is the one you should be able to query in Impala. Also, don't forget to 'invalidate metadata' when the ETL process is finished, since Impala doesn't cache metadata.

Sean · ‎06-01-2016

I don't know much about Spark internals to give much intelligent advice here, but it's possible it's a matter of resources. You still have the problem in your hosts file that I described above. The hosts file you posted maps 127.0.0.1 AND your public IP to quickstart.cloudera. You should remove quickstart and quickstart.cloudera from the 127.0.0.1 line and have only your public IP map to that (as shown below). You'll need to restart all services after you make this change. 127.0.0.1 localhost localhost.localdomain quickstart.cloudera quickstart

Sean · ‎05-20-2016

The VirtualBox Guest additions are installed in the VM which should enable drag & drop of files, but perhaps it's having issues with the size of the files? SSH should also be running so scp is another option, as is a Shared Folder. You'll need to get the file to be visible from the VM's filesystem, perhaps unzip them at that point, and then you can use 'hadoop fs -copyFromLocal' to put them in HDFS.

Sean · ‎05-02-2016

When you try to stop a service, it will warn you which services depend on it if they are running. If you try to start a service, it will warn you which services it depends on if they are not running. I believe Zookeeper, HDFS, and YARN are the only other services you need to run for Spark, HBase, and Hive.

Sean · ‎04-29-2016

I don't have a ton of experience with Llama, but I think the misunderstanding here is that Impala manages the execution of its own queries, and the MapReduce framework manages the execution of Hive queries. YARN manages resources for individual MapReduce jobs, and it can manage the Impala daemons via Llama. The YARN application for Llama will run as long as Impala does - that's by design to keep the latency of Impala queries very low. In the case of Hive, YARN will manage the job's resources only until that job (a single query) is finished. Not sure why your Hive queries would not be running. If this is in the QuickStart VM, my first guess would be that if Llama is still running and there aren't enough executors / slots for your Hive queries. YARN in the QuickStart VM is not going to be configured with a lot of capacity and it's not tested with Llama. I know of no other way to manage Impala resources via YARN, though.

Sean · ‎04-13-2016

If you're in the QuickStart VM, it sounds like the browser you're talking about it is looking at the native Linux filesystem. You can find the file in this filesystem at /opt/examples/log_files/access.log.2 (or something like that). The Hive Warehouse directory is in HDFS, which is a separate filesystem.

Sean · ‎04-13-2016

The 2 tables that are created are called 'intermediate_access_logs' and 'tokenized_access_logs' when shown in Hive or Impala. The intermediate_access_logs table is backed by the raw 'original_access_logs' file which is copied into HDFS. If you want to view it as a table, it should still be queryable in Hive at the end of the tutorial. The underlying data should still be in /user/hive/warehouse/original_access_logs in HDFS or /opt/examples/log_files/ on your local filesystem.

Sean · ‎04-11-2016

Looks like the YARN Resource Manager process is not running. I would restart it with: 'sudo service hadoop-yarn-resourcemanager restart'. If you continue to have issues, other services may have failed to come up as a result of this or as a result of the same root cause. The easiest way to restart everything in order on the VM is to simply reboot. If you have sufficient memory for the VM, running on of the Cloudera Manager options on the desktop makes it a lot easier to see the health of all the services, etc. You might also want to look at the log files in /var/log/hadoop-yarn to see what kinds of exceptions are being thrown as the service dies.

Sean · ‎04-11-2016

I apologize for the confusion - the service got a bit backed up over the weekend because of too many people abandoning clusters mid-deployment improperly. I've cleared out everything that looks abandoned so it should work better now. Note that access codes can't be reused, however, so if you deleted your previous stack you'll need to register for a new access code to try again.

Online	Offline
Last Visited	‎03-17-2016 10:55 PM

Member Since	‎07-12-2013 07:35 AM
Last Visited	‎03-17-2016 10:55 PM
Posts	435
Kudos received	117

Cloudera Community

Re: Quickstart VM welcome page doesn't recognize t...

Re: Hadoop installation on Ubuntu 14.o4

Re: In Cloudera Quickstart VM how to upgrade lates...

Re: Unable to transfer files from Mac Desktop to C...

Re: Cloudera service and host monitoring fails fre...

Re: I cannot access programmatically a file within...

Re: Tutorial Exercise 2: select * from intermediat...

Re: I cannot access programmatically a file within...

Re: Lad Source Files to HDFS from My local Machine

Re: Effectively using single node cluster

Re: Impala on yarn

Re: Tutorial exercise 2 Route of my tables

Re: Tutorial exercise 2 Route of my tables

Re: Connecting to ResourceManager.....

Re: Cloudera Live Status shows Create_completed, b...