Member since
10-07-2016
50
Posts
31
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2145 | 08-30-2017 07:03 AM | |
1413 | 06-02-2017 01:03 PM | |
4216 | 02-06-2017 03:43 PM | |
44131 | 10-24-2016 01:08 PM | |
3465 | 10-13-2016 03:48 PM |
07-24-2019
02:48 PM
Hi There, Thank you for reachin out to the community. Initially, the only requirements we have for extracting metadata from S3 are the ones outlined in the documentation. https://www.cloudera.com/documentation/enterprise/latest/topics/navigator_s3.html Please review that and make sure everything is in order regarding those requirements. If there are still issues, we can assist more if you provide the CM version you're using. Cheers
... View more
09-18-2018
02:08 PM
Thanks for reaching out to the cloudera community. I noticed in your query, you've included an elipses. ("FROM census INSERT OVERWRITE LOCAL DI...2012(Stage") Is this intended, or the result of truncation? if it's truncated, we'll need the full query that you're sending to the api. Also, can we get an example of the output you're getting?
... View more
09-18-2018
02:06 PM
Howdy, Thanks for reaching out to the Cloudera community. Can you clarify where you're seeing this info in Navigator? If you could share a screenshot of what you're seeing, that would also be helpful.
... View more
09-08-2017
08:59 AM
I can't speak for other vendors, but I don't think we offer certifications for single components. Generally, working with hadoop effectively requires familiarity with multiple components, so it behooves oneself to learn a few components in the stack. The closest certifications I can think of are the data engineer and data analyst certifications. There's a link bellow to our full list of certifications for reference. I hope this helps. https://www.cloudera.com/more/training/certification.html
... View more
09-07-2017
10:44 AM
I'm glad to be of service. Let me know if there's anything else namenode related that you're curious about. 🙂
... View more
09-06-2017
07:06 AM
Howdy, Thanks for reaching out. Inititally, it would seem there are some permission issues. I would start by checking the permissions for the directory the error mentions. ls -lh /home/cloudera/Python-3.6.2 (the h is just a personal preference of mine) Let me know what you get, and I can help from there if you'd like. Cheers, Josh
... View more
09-05-2017
03:40 PM
Howdy, Thanks for coming to the community with your question. I'm Josh. We have a pretty extensive engineering blog post outlining namenode recovery tools, one of which is namenode recovery mode. http://blog.cloudera.com/blog/2012/05/namenode-recovery-tools-for-the-hadoop-distributed-file-system/ A few code snippets stand out. One is the command used to start a namenode in recovery mode. "./bin/hadoop namenode -recover" And the other is the text one is greeted with when running the above command " You have selected Metadata Recovery mode. This mode is intended to recover lost metadata on a corrupt filesystem. Metadata recovery mode often permanently deletes data from your HDFS filesystem. Please back up your edit log and fsimage before trying this! " In short, namenode recovery mode checks the edits log for errors, and asks you what you'd like to do about them. There is one thing I'm curious about. Are you asking this out of curiosity, or do you have an HDFS problem you're trying to solve? Please let me know if you have any other questions or if you'd like further assistance. Cheers, Josh
... View more
08-30-2017
07:03 AM
Hi There, Thanks for reaching out on the community. I'm Josh, and I'll help address this for you. log4j.properties: CM is the central point of configuration for services, so the short answer is that you should adjust log4j settings using safety valves. Below is an engineering blog post with a good description of how CM works. http://blog.cloudera.com/blog/2013/07/how-does-cloudera-manager-work/ When a CM agent for a host heart beats to Cloudera Manager, Cloudera Manager sends back processes that should be running, and the related config files, one of which is the log4j.properties, for that service and role. From here, the CM agent makes a run time directory for these config files and references those. For instance, the agent will make a directory like the one bellow for a namenode role: / var / run / cloudera - scm - agent / process / 879 - hdfs - NAMENODE/ this is why editing config files in on the OS has no effect, and is not recommended. Enabling audit logging: To enable audit logging for a service without navigator, you would want to set the appropriate log4j settings in the appropriate safety valve for that service. Let's use HDFS as an example. Cloudera Manager has a configuration property for HDFS labeled "NameNode Logging Advanced Configuration Snippet (Safety Valve)". This is the one you want to put your log4j settings in. Once you've put your settings in, it will insert those into the log4j.properties it sends over to the agent in heartbeats. The specifics for enabling vanilla hadoop HDFS audit logging can be found bellow: http://apprize.info/security/hadoop/7.html Considering all of this info, bear in mind that Navigator takes care of all of this for you, as well as adding additional features. For instance, HDFS audit logs can be very bulky and cumbersome by themselves and include many operations that aren't very helpful from an auditing standpoint. Navigator is able to apply event filters to an audit log, store relevant audits, and index them for further searching. Therefore, I highly recommend enabling navigator when doing so becomes feasible. Please let me know if you have any other questions. Cheers
... View more
06-02-2017
01:03 PM
Hello,
In CDH 5.8, queries can be exported from a user's home directory and imported to another users in JSON format. Steps bellow
1. Go to "my documents" on old user. (house icon)
2. Select the queries you would like to export. If they're in a folder, you can export the whole folder or use "cntrl click" to select multiple queries. Another option is to drag select them.
3. Click the "download" icon on the action bar at the top of the file browser. This will download the queries as a json file or multiple json files.
4. Log into the new account and go to "my documents" house icon.
5. Click the "upload" icon in the action bar.
6. Select a json file in the file browser. json files can only be uploaded one at a time.
7. Click import to finish the process.
... View more
05-11-2017
10:02 AM
Howdy, Thanks for reaching out on this. Currently, we're on Hadoop 2.6, and 2.8 is Slated as "TBD". We don't rebase very often on minor versions because of all the changes it makes, and opt instead to backport features into CDH, which leads me to my next question. Is there a particular feature in Hadoop 2.8 you're looking to use? Let me know when you can. Cheers, Josh
... View more
05-11-2017
09:49 AM
1 Kudo
Howdy, Thanks for reaching out on this. I'll start by dropping a description of the test bellow. https://www.cloudera.com/more/training/certification/cca-admin.html A lot of us are currently preparing to take this new exam, myself included. A lot of it's explained on the website, but here's a basic rundown. The Test: The format of the test has changed from multiple choice to being more hands on. In the new version of the test, you're given a pre configured cluster, and tasks to complete on it. Basically, you're proving that you know your way around a cluster and that you know how to troubleshoot in the new version of this test. Preparation: The cool thing about this new format is that preparing for it is more straightforward. Basically, everything you need to be prepared for the test is in the administrator on-demmand course. Also, this link. It seems like you may have already taken this course. If so, then you're golden. Another way you can prepare for it is simply to spin up a cluster and mess around with it. Play with it, break it, fix it. Just make sure you know your way around. I hope the info I've provided was helpful. Let me know if you have any other questions. I would also love to hear from someone about their experience taking the test, and how they like it compared to the old one. Cheers, Josh
... View more
05-11-2017
09:29 AM
1 Kudo
Howdy, Thanks for reaching out on this. Unfortunately, we only provide quickstart VMs for the latest version of CDH. What is your particular goal for using a 5.8 quickstart? Are you doing a POC, making test changes, etc?
... View more
03-29-2017
10:19 AM
2 Kudos
Howdy, Thanks for reaching out. In my experience, the Quickstart VM tends to be pretty slow, it being a single node hadoop cluster and all. I think if you want to increase responsiveness, you can try allocating more CPU cores to your VM. How many CPU cores do you currently have allocated?
... View more
02-06-2017
03:43 PM
1 Kudo
Howdy, Josh here, from Cloudera. Thanks for reaching out on this. As far as verifying whether or not your outlined configuration would work, the short answer would be perhaps. You might have already seen it, but I'll point to this blog post as a reference. It's a good read, and includes a matrix for deciding the specs for your cluster's nodes. If you look, you'll see the configuration you're proposing is in the neighborhood of a " Light Processing Configuration", but for every other configuration listed, it starts to fall short. As long as you don't make a fully stacked cluster with every service imaginable(it seems like you don't intend on doing that), the "Light Processing" config could suffice. You can also check out this other community post to get a better idea of how speccing your cluster could pan out in terms of how many nodes you would want. So, in short, perhaps. Let me know if this helps or if you have any other questions. Cheers, Josh
... View more
11-17-2016
11:35 AM
Interesting. It seems like your VM already had the packages installed for the services, even though Cloudera Manager didn't recognize them. First, try running the "Migrate to Parcels" script on the desktop. That should stop the services, uninstall them, and re-deploy the client configuration. Just ran the script myself, and the output should Idealy look like this. Feel free to share a screenshot of your console output for this If that doesn't work, try reinstalling CM with packages instead of parcels. To do this, I would recommend deleting the current VM and re-extracting the zip you download, just so you can start fresh. Let me know if it throws any warnings this time around or if you have any questions along the way. (screenshots are welcome, as always) Thank you for being so patient while we figure this out. Also, here's some fun reading to get a better understanding of Parcels, if you're so inclined. 🙂
... View more
11-16-2016
08:22 AM
1 Kudo
I'm still researching the issue, but in the meantime, I can definitely help you with the cluster setup. The selections on that first screen look fine. Feel free to post screen shots of anything else you have questions for.
... View more
11-14-2016
09:59 AM
You should'nt have to use the Kerberos or Parcels links. I'm currectly working on some additional troubleshooting steps. Thank you for your patience.
... View more
11-08-2016
12:05 PM
Okay, from this point, it would seem that the script could be having trouble setting up and starting the services. Could you try re-running the script with the method you described (sudo /home/cloudera/cloudera-manager --express --force) and post the output from it? We'll comb through it for any possible errors. Meanwhile, I'm working on my end to replicate this.
... View more
11-07-2016
03:48 PM
All of the files are there. Have you used the launch script for this VM? These guys right here. If so, did you you use the shortcut for express, or the enterprise trial?
... View more
11-03-2016
11:10 AM
Glad that login worked. As far as the services go, they should indeed already be installed. My thought is it could be an issue with the particular VM you received. Either it was received from a different source, or it didn't download completely. Could you perhaps provide the link you originally got the VM from?
Here’s what I recommend. Try re-downloading* the VM, verify it completely downloaded, and try loading it, still using the "admin" user and pass login. I downloaded the quickstart VM from the link below*, and I've verified that it works. (all services installed)
Let me know if this works.
*some links for your convenience
Installing quickstart VM
Community Knowledge article with quickstart install tips (outdated CDH version, but steps are good)
Quickstart VM download
... View more
11-01-2016
03:25 PM
Howdy, Thank you for reaching out on this. I looked at the documentation for the quickstart VMs, and it states the username and password for CM in quickstart are "cloudera", as you said. Intrigued, I decided to fire up the VM on my end, and sure enough, those credentials don't work. I instead tried using the default CM login specified in the instructions for a standard path B install, and viola, it worked! try using the following login for CM in quickstart: username: admin password: admin Let me know if that works, and in the mean-time, I'll look into getting that documentation updated. Thank you for bringing this to my attention. 🙂 Cheers
... View more
10-24-2016
01:08 PM
17 Kudos
Howdy, Thanks for your question. It can be quite jarring to see two columns in your output when du normally only has one column, but fear not, there is an explanation for everything. 🙂 I found a similar post discussing this topic. From looking at it, it's clear that it took some digging to get to the bottom of this, but if you look towards the bottom, you'll see a link to the source code explaining the output format. Eureka! Anywho, the output states that the first two columns are formatted like this: [size] [disk space consumed] but what does this mean? the " size " field is the base size of the file or directory before replication. As we know, HDFS replicates files, so the second field( disk space consumed) is included to show you how much total disk space that file or directory takes up after it's been replicated. Under the default replication factor of three, the first two columns of a 1MB file would theoretically look like this. 1 3 M The fun part is that we can actually use this info to infer the replication factor HDFS is using for these particular files, or at least the amount of replication the file is curently at. If you look at the first line of your output, you'll see the initial size as 816 and the disk space usage as 1.6 K. Divide 1.6K by 816 bytes , and you get 2 (roughly), which would indicate a replication factor of two, and you'll notice this math is consistent with the other entries in the output. Good times. Armed with this knowledge, you can now use du tool to its full potential, both for informative and troubleshooting purposes. Let me know if this info was helpful or if you have any other questions. 🙂 Cheers
... View more
10-14-2016
03:54 PM
Awesome. Glad I was abe to help. 🙂
... View more
10-13-2016
03:48 PM
Howdy,
Thanks for your question. This similar post m a y b e u s e f u l f o r y o u r s i t u a t i o n. It indicates that it's important to spin up the VM via the "import appliance" option rather than making a blank VM and pointing it to the image.
If you haven't already, try starting the VM with the "import appliance option". Here's a nifty guide for doing it .
If that doesn't work, our community knowledge area also has a useful article outlining some common issues to consider. It also wouldn't hurt to try updating your installation of Virtualbox , as it seems to be a couple versions behind.
Let me know how it goes and if you need further assistance.
Cheers
... View more
10-12-2016
11:17 AM
5 Kudos
Howdy, I'm just going to jump in and give you as much info as possible, so strap in. There's going to be a lot of (hopefully helpful) info. Before I get started, and I state this toward the end too, it’s important to know that all of this info is general “big picture” stuff, and there are a ton of factors that go into speccing your cluster (use cases, future scaling considerations, etc). I cannot stress this enough. That being said, let’s dig in. I'm going to answer your questions in order. In short, yes. we generally recommend bare metal(“node” = physical box) for production clusters. You can get away with using VMs on a hypervisor for development clusters or POCs, but that’s not recommended for production clusters. If you don’t have the resources for a bare metal cluster, it’s generally a better idea to deploy it in the cloud. For cloud based clusters, I recommend Cloudera Director , which allows you to deploy cloud based clusters that are configured with hadoop performance in mind. 2. It's not simply a question of how many nodes, but what the specs of each node are. We have some good documentation here that explains best practices for speccing your cluster hardware. The amount of nodes depends on what your workload will be like. This includes how much data you'll be ingesting, how often you'll be ingesting it, and how much you plan on processing said data. That being said, Cloudera Manager makes it super easy to scale out as you're workload grows. I would say the bare minimum is 5 nodes (2 masters, 3 workers) You can always scale out from there by adding additional worker and master nodes. 3 and 4. These can be answered with this nifty diagram (memory recommendations are RAM): This comes from our article on how to deploy clusters like a boss , which covers quite a bit. Additional info on the graphic can be found toward the bottom of the article. If you look at the diagram, you'll notice a few things: - The concept of master nodes, worker nodes, and edge nodes. Master nodes have master services like namenode service, resource manager, zookeeper, journal nodes, etc. If the service keeps track of tasks, marks changes, or has the term "manager" in it, you usually want it on a master node. You can put a good amount on single nodes because they don't do too much heavy lifting. - The placement of DB dependent services. Note that cloudera manager, hue, and all servers that reference a metastore are installed on the master node with an RDBMS installed. You don't have to set it up this way, but it does make logical sense, and is a little more tidy. You will have to consider adding a dedicated RDBMS server eventually, because having it installed on a master node with other servers can easily cause a bottleneck when you’ve scaled enough. - The worker node(s). this diagram only has one worker node, but it’s important to know that you should have at least three worker nodes for your cluster to function properly, as the default replication factor for HDFS is three. From there, you can add as many worker nodes as your workload dictates. At its base, you don't need many services on a worker node, but what you do need is a lot more memory, because these nodes are where data is stored in HDFS and where the heavy processing will be done. -The edge node. It's specced similarly, or even lower, than master nodes, and is only really home to gateways and other services that communicate with the outside world. You could add these services to another master node, but it's nice to have one dedicated, especially if you plan on having folks access the cluster externally. The article also has some good info on where to go with these services as you scale your cluster out further. One more note. If this is a Proof of Concept cluster, I recommend saving sentry for when you put the cluster into production. When you do add, do note it’s a service that uses an RDBMS. Some parting thoughts: When you're planning a cluster, it's important to stop and evaluate exactly what your goal is for said cluster. My recommendation is to start only with the services you need to get the job done. You can always add and activate services later through cloudera manager. If you need any info on each particular service and whether or not you really need it, check out the below links to our official documentation: Hive Pig (apache documentation) Zookeeper HDFS Hue Oozie Sqoop and sqoop2 Yarn Sentry And for that matter, you can search through our documentation here . While this info helps with general ideas and “big picture” topics, you need to consider a lot more info about your planned usage and vision to come up with an optimal setup. Use cases are vitally important to consider when speccing a cluster , especially for production. That being said, you’re more than welcome to get in touch with one our solutions architects to figure out the best configuration for your cluster. Here’s a link to some more info on that. This is a lot of info, so feel free to take your time digesting it all. Let me know if you have any questions. 🙂 Cheers
... View more