About jmartin1938

chrislione · ‎03-19-2023

I just want to say thanks a lot, is 2023 and it still works

lwang · ‎12-06-2019

Hi @uv , You may want to check this public doc about Navigator Audit Filter: https://docs.cloudera.com/documentation/enterprise/latest/topics/cn_admcfg_audit_filters.html Thanks, Li

cjervis · ‎07-30-2018

@Shashank and @shamik, As each person's setup is typically different I was suggesting you start a new thread or conversation. In this new thread you can explain your unique situation in terms of overall system specifications (RAM, number of processors, etc.) as well as how much of those resources have been allocated to the Quickstart VM. As @jmartin1938 has pointed out the Quickstart VM in inherently slow due to being a single node cluster. The issue can be made worse if the VM is not allcated the appropriate resources or used for large scale projects. To start a new thread go to the appropriate board and use the "new message" button. This will open a page for you to post the wording and title for the new thread.

marcus8 · ‎02-23-2018

Hello Jmartin1938, I was auditing an online course for self-learning as a beginner. Is it possible to get a link for the quickstart VM of Cloudera distribution for Hadoop 5.8? I can share my e-mail if the link is to be shared one-one. Regards Surojit

jmartin1938 · ‎06-02-2017

Hello, In CDH 5.8, queries can be exported from a user's home directory and imported to another users in JSON format. Steps bellow 1. Go to "my documents" on old user. (house icon) 2. Select the queries you would like to export. If they're in a folder, you can export the whole folder or use "cntrl click" to select multiple queries. Another option is to drag select them. 3. Click the "download" icon on the action bar at the top of the file browser. This will download the queries as a json file or multiple json files. 4. Log into the new account and go to "my documents" house icon. 5. Click the "upload" icon in the action bar. 6. Select a json file in the file browser. json files can only be uploaded one at a time. 7. Click import to finish the process.

westerg · ‎02-09-2017

Thanks Josh! Yep, I did read that excellent blog post. I was mostly interested to hear if anyone else has been in a similar situation, and how they resolved it. We'll proceed as best we can and I'll report back here in the future to let everyone know how it goes.

jmartin1938 · ‎10-24-2016

Howdy, Thanks for your question. It can be quite jarring to see two columns in your output when du normally only has one column, but fear not, there is an explanation for everything. 🙂 I found a similar post discussing this topic. From looking at it, it's clear that it took some digging to get to the bottom of this, but if you look towards the bottom, you'll see a link to the source code explaining the output format. Eureka! Anywho, the output states that the first two columns are formatted like this: [size] [disk space consumed] but what does this mean? the "size" field is the base size of the file or directory before replication. As we know, HDFS replicates files, so the second field(disk space consumed) is included to show you how much total disk space that file or directory takes up after it's been replicated. Under the default replication factor of three, the first two columns of a 1MB file would theoretically look like this. 1 3 M The fun part is that we can actually use this info to infer the replication factor HDFS is using for these particular files, or at least the amount of replication the file is curently at. If you look at the first line of your output, you'll see the initial size as 816 and the disk space usage as 1.6 K. Divide 1.6K by 816 bytes, and you get 2 (roughly), which would indicate a replication factor of two, and you'll notice this math is consistent with the other entries in the output. Good times. Armed with this knowledge, you can now use du tool to its full potential, both for informative and troubleshooting purposes. Let me know if this info was helpful or if you have any other questions. 🙂 Cheers

jmartin1938 · ‎10-14-2016

Awesome. Glad I was abe to help. 🙂

jmartin1938 · ‎10-12-2016

Howdy, I'm just going to jump in and give you as much info as possible, so strap in. There's going to be a lot of (hopefully helpful) info. Before I get started, and I state this toward the end too, it’s important to know that all of this info is general “big picture” stuff, and there are a ton of factors that go into speccing your cluster (use cases, future scaling considerations, etc). I cannot stress this enough. That being said, let’s dig in. I'm going to answer your questions in order. In short, yes. we generally recommend bare metal(“node” = physical box) for production clusters. You can get away with using VMs on a hypervisor for development clusters or POCs, but that’s not recommended for production clusters. If you don’t have the resources for a bare metal cluster, it’s generally a better idea to deploy it in the cloud. For cloud based clusters, I recommend Cloudera Director, which allows you to deploy cloud based clusters that are configured with hadoop performance in mind. 2. It's not simply a question of how many nodes, but what the specs of each node are. We have some good documentation here that explains best practices for speccing your cluster hardware. The amount of nodes depends on what your workload will be like. This includes how much data you'll be ingesting, how often you'll be ingesting it, and how much you plan on processing said data. That being said, Cloudera Manager makes it super easy to scale out as you're workload grows. I would say the bare minimum is 5 nodes (2 masters, 3 workers) You can always scale out from there by adding additional worker and master nodes. 3 and 4. These can be answered with this nifty diagram (memory recommendations are RAM): This comes from our article on how to deploy clusters like a boss, which covers quite a bit. Additional info on the graphic can be found toward the bottom of the article. If you look at the diagram, you'll notice a few things: - The concept of master nodes, worker nodes, and edge nodes. Master nodes have master services like namenode service, resource manager, zookeeper, journal nodes, etc. If the service keeps track of tasks, marks changes, or has the term "manager" in it, you usually want it on a master node. You can put a good amount on single nodes because they don't do too much heavy lifting. - The placement of DB dependent services. Note that cloudera manager, hue, and all servers that reference a metastore are installed on the master node with an RDBMS installed. You don't have to set it up this way, but it does make logical sense, and is a little more tidy. You will have to consider adding a dedicated RDBMS server eventually, because having it installed on a master node with other servers can easily cause a bottleneck when you’ve scaled enough. - The worker node(s). this diagram only has one worker node, but it’s important to know that you should have at least three worker nodes for your cluster to function properly, as the default replication factor for HDFS is three. From there, you can add as many worker nodes as your workload dictates. At its base, you don't need many services on a worker node, but what you do need is a lot more memory, because these nodes are where data is stored in HDFS and where the heavy processing will be done. -The edge node. It's specced similarly, or even lower, than master nodes, and is only really home to gateways and other services that communicate with the outside world. You could add these services to another master node, but it's nice to have one dedicated, especially if you plan on having folks access the cluster externally. The article also has some good info on where to go with these services as you scale your cluster out further. One more note. If this is a Proof of Concept cluster, I recommend saving sentry for when you put the cluster into production. When you do add, do note it’s a service that uses an RDBMS. Some parting thoughts: When you're planning a cluster, it's important to stop and evaluate exactly what your goal is for said cluster. My recommendation is to start only with the services you need to get the job done. You can always add and activate services later through cloudera manager. If you need any info on each particular service and whether or not you really need it, check out the below links to our official documentation: Hive Pig (apache documentation) Zookeeper HDFS Hue Oozie Sqoop and sqoop2 Yarn Sentry And for that matter, you can search through our documentation here. While this info helps with general ideas and “big picture” topics, you need to consider a lot more info about your planned usage and vision to come up with an optimal setup. Use cases are vitally important to consider when speccing a cluster, especially for production. That being said, you’re more than welcome to get in touch with one our solutions architects to figure out the best configuration for your cluster. Here’s a link to some more info on that. This is a lot of info, so feel free to take your time digesting it all. Let me know if you have any questions. 🙂 Cheers

Online	Offline
Last Visited	‎08-14-2019 02:27 PM

Member Since	‎10-07-2016 01:03 PM
Last Visited	‎08-14-2019 02:27 PM
Posts	50
Kudos received	31

Cloudera Community

Re: How to enable audit logging without Navigator

Re: update beeswax_savedquery table in hue databas...

Re: Minimum hardware configuration for a small clu...

Re: Explain | hdfs du command output

Re: virtual box quick start vm 5.8 failing on mac ...

Re: Can't log into Cloudera Manager in cloudera-qu...

Re: How to enable audit logging without Navigator

Re: Cloudera Quickstart VM Acting Very Slow

Re: Download CDH 5.8 quickstart

Re: update beeswax_savedquery table in hue databas...

Re: Minimum hardware configuration for a small clu...

Re: Explain | hdfs du command output

Re: virtual box quick start vm 5.8 failing on mac ...

Re: Minimum number of nodes to add in a multi-node...