About agillan

agillan · ‎11-29-2019

What is Cloudera Data Warehouse? Cloudera Data Warehouse is an auto-scaling, highly concurrent and cost effective analytics service that ingests high scale data anywhere, from structured, unstructured and edge sources. It supports hybrid and multi-cloud infrastructure models by seamlessly moving workloads between on-premises and any cloud for reports, dashboards, ad-hoc and advanced analytics, including AI, with consistent security and governance. Cloudera Data Warehouse offers zero query wait times, reduced IT costs and agile delivery. See more information here Key Concepts: In the Cloudera Data Warehouse service, your data is stored in an object store in a data lake that resides in your specific cloud environment. The service is composed of: Database Catalogs: a metadata service associated to a CDP Data Lake which provides the data context for your defined tables and databases within the CDP Enterprise Data Cloud. Virtual Warehouses: compute resources running Hive or Impala on Kubernetes, which allow you to query data stored in cloud object store via the Database Catalogue. Please see Cloudera Documentation for further information. How do I monitor Virtual Warehouse usage? Cloudera Data Warehouse environments come with a pre-built Grafana dashboard that lets you monitor usage of all Virtual Warehouses within that environment. To access the Grafana dashboard, you will need to access the Kubernetes pods and extract the password. Pre-requisites: This article assumes you have already configured Cloudera Data Platform and the Cloudera Data Warehouse service with at least one Environment, at least one Database Catalogue and at least one Virtual Warehouse. Please see Cloudera's Getting Started Instructions. Install kubectl command line interface or your favourite kubernetes UI or CLI How To: 1. On the Cloudera Data Platform Home Page, open the Data Warehouse service: 2. Expand the Environments menu: 3. Click the hamburger menu on your desired Environment: 4. Click Show Kubeconfig and copy the text to your clipboard: 5. Paste the kubeconfig into a file and run the following command to access the kubernetes cluster for that Environment. This command will get the password which is stored encoded in base 64, decrypt it and copy the password to your clipboard: vi dwx.config kubectl --kubeconfig ~/dwx.config get secret grafana -n istio-system -o json | jq -r .data.passphrase | base64 -D | pbcopy 6. Go back to your Cloudera Data Warehouse environment and and click Open Grafana: 7. You should see the Grafana login screen open on a new tab: Username = admin Password = the password that is now on your clipboard 8. Once logged in, expand the istio menu and choose the Compute Autoscaling dashboard: The Compute Autoscaling dashboard will show you total node usage for your environment, as well as individual nodecounts for each of your Virtual Warehouses:

agillan · ‎12-07-2018

Hi @Owez Mujawar - can you attach some screenshots or something to explain what you mean? Do you see anything inside the lineage box at all in Atlas? You definitely won't see the lineage of anything that happened before you ran the import-hive.sh script, but anything that happens to those tables afterwards should be visible. It sounds to me like maybe the Hive hook isn't running somehow. Can you verify that the ATLAS_HOOK Kafka topic is created, that the permissions are right (as per this document) and that messages are going through (perhaps check Ranger audit logs to see) and check over some of the properties in this article in case you see any differences.

agillan · ‎08-07-2017

Ah yes - I was getting the same behaviour, but then i realised if you are uploading a file with the -d command, you have to do @ before the file name. This command worked for me on Ranger 0.6 on HDP 2.5.0.... curl -iv -X POST -H "Content-type:application/json" -u 'admin:admin' -d @test.json http://`hostname`:6080/service/public/api/policy

agillan · ‎08-07-2017

Hi @Pragya Raj - the URL you pasted there doesn't seem quite right based on this documentation. Which version of HDP/Ranger are you using? As you can see, that doc indicates the URL is service/public/v2/api/policy. However, this is for Ranger 0.5, which is in HDP versions 2.3+ If you are running Ranger 0.5 as per the above, can you try that URL and let us know if it works? EDIT: Just rereading, I realise you are probably trying to import a set of policies? Or are you just doing one? If just creating one new one, you use the statement I said above, but if you want to import a set of policies that you exported from another cluster, you do this, according to this documentation. Note the "servicesMap" distinctions and the "multipart" Import Policies through curl To Import policies from JSON file without servicesMap curl -i -X POST -H "Content-Type: multipart/form-data" -F 'file=@/path/file.json' -u admin:admin http://<hostname>:<ranger-port>/service/plugins/policies/importPoliciesFromFile?isOverride=true To Import policies from JSON file with servicesMapcurl -i -X POST -H "Content-Type: multipart/form-data" -F 'file=@/path/file.json' -F ‘servicesMapJson=@/path/servicesMapping.json’ -u admin:admin http://<hostname>:<ranger-port>/service/plugins/policies/importPoliciesFromFile?isOverride=true EDIT2: When passing in a file as a parameter in the -d flag, you must do @/path/file.json (note the @ symbol)

agillan · ‎06-06-2017

You're very welcome! Yes, for this the setrep command can change the replication factor for existing files and there is no need to change the value globally. For others' reference, the command is: hdfs dfs -setrep [-R] [-w] <numReplicas> <path>

agillan · ‎06-06-2017

Hi @Xiong Duan - for most cases, in order to refresh the configurations in the running service, you have to restart the service and there isn't really a way around it. However, if you have High Availability enabled (e.g. NameNode HA), there is a way you can refresh configs without downtime, but you must be very careful. If using Ambari - set up a Config Group with the relevant properties enabled. Add and remove the hosts as described below. Configure for the Standby NameNode Restart the Standby NN Setup the config for the Active NameNode Restart the Active NN. Failover will occur. Setup Config for a group of 2 DataNodes at a time and restart batches of 2 DN at a time This may not be safe for all configuration properties - which properties do you want to refresh?

agillan · ‎06-06-2017

Hi @Jayanthi R all you need to do is use the --hive-overwrite option in your sqoop import statement. Check the sqoop documentation for more advice: https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_importing_data_into_hive You will see a list of options in Table 8 that you can use when you import data into Hive. If you're already tried this, can you let us know if you're having problems? If so, please paste the full error message.

agillan · ‎03-15-2017

At this time, column level security it only possible when accessing data through Hive

agillan · ‎03-07-2017

Picture the scene... You've run the HDFS Balancer on your cluster and have your data balanced nicely across your DataNodes on HDFS. Your cluster is humming along nicely, but your system administrator runs across the office to you with alerts about full disks on one of your DataNode machines! What now? The Low-Down Uneven data distribution amongst disks isn't dangerous as such, though in some rare cases you may start to notice the fuller disks becoming bottlenecks for I/O. As of Apache Hadoop 2.7.3, it is not possible to balance disks within a single node (aka intra-node balancing) - the HDFS balancer only balances across DataNodes and not within them. HDFS-1312 is tracking work to introduce this functionality into Apache HDFS, but it will not be available before Hadoop 3.0. The conservative approach: Modify the following property to your HDFS configurations or add it if it isn't already there: dfs.datanode.du.reserved (reserved space in bytes per volume). This will always leave this much space free on all DataNode disks. Set it to a value that will make your sysadmin happy and continue to use the HDFS balancer as before until HDFS-1312 is complete. The brute force method (careful!): Run fsck and MAKE SURE there are no under-replicated blocks (IMPORTANT!!). Then just wipe the contents of the offending disk. HDFS will re-replicate those blocks elsewhere automatically! NOTE: Do not wipe more than one disk across the cluster at a time!!

agillan · ‎03-01-2017

Ok, turns out it's because the "availability flag" property is now mandatory and the old ingest script didn't generate "_success" to trigger the feed. modified ingest.sh to generate the flag: curl -sS http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz | tar xz && hadoop fs -mkdir -p $1 && hadoop fs -chmod 777 $1 && hadoop fs -put enron_with_categories/*/*.txt $1 && hadoop fs -touchz $1/_success

Online	Offline
Last Visited	‎10-28-2024 01:17 PM

Member Since	‎09-25-2015 04:45 PM
Last Visited	‎10-28-2024 01:17 PM
Posts	82
Kudos received	93

Cloudera Community

Re: Hi,How to make hadoop to read the configuratio...

Re: Falcon Email Processing Tutorial in HDP 2.5

Re: User can still see / read on paths that it doe...

Re: Does using Custom Service names (e.g. hdfs_pro...

Re: Hive Error : Postgres: intermittent test failu...

Cloudera CDP Data Warehouse Monitoring: How to get...

Re: Lineage is not visible for Hive Table in Atlas

Re: How to use the Ranger Rest API for creating ra...

Re: How to use the Ranger Rest API for creating ra...

Re: Hi,How to make hadoop to read the configuratio...

Re: Hi,How to make hadoop to read the configuratio...

Re: Sqoop import command to overwrite the existing...

Re: Apache Ranger and Hive Column level Security

HDFS Balancer: Balancing Data Between Disks on a D...

Re: Falcon Email Processing Tutorial in HDP 2.5