Member since
04-24-2017
82
Posts
11
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1152 | 01-20-2020 03:17 AM |
02-27-2022
11:00 PM
1 Kudo
This solution is works for me on HDP 3.1.4 Ambari 2.7 Thanks for sharing.
... View more
01-20-2020
03:17 AM
@ChineduLB What is your exact Query? You can write count Queries SQL for Hive table. In general you can refer below articles: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/performance-tuning/content/hive_prepare_to_tune_performance.html https://www.qubole.com/blog/5-tips-for-efficient-hive-queries/ Thanks, Tamil Selvan K
... View more
01-08-2020
07:52 AM
1 Kudo
1. use Ranger Auditing for Hive to check the Query details run by a user. Hive does not store this detail in metastore. https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/audit-ref/content/managing_auditing_in_ranger_access.html 2. You can use the below Query To get all the apps having states as FINISHED,KILLED by the specific user for specific time period GET "http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=20&states=FINISHED,KILLED&user=<user-id>&startedTimeBegin={time in epoch}&startedTimeEnd={time in epoch}" 3. Simply make use of Tez view if your execution Engine is Tez
... View more
12-26-2019
12:30 PM
@Prakashcit To ensure data from multiple data sources are ingested to discover at a later stage business insights, usually we dump everything. Comparison of source data with data ingested to simply validate that all the data has been pushed and verifying that correct data files are generated and loaded into HDFS correctly into the desired location. A smart data lake ingestion tool or solution like kylo should enable self-service data ingestion, data wrangling, data profiling, data validation, data cleansing/standardization,see attached architecture /landing_Zone/Raw_data/ [ Corresponding to stage1] /landing_Zone/Raw_data/refined [ Corresponding to stage2] /landing_Zone/Raw_data/refined/Trusted Data [ Corresponding to stage3] /landing_Zone/Raw_data/refined/Trusted Data/sandbox [ Corresponding to stage4] The data lake can be used also to feed upstream systems for a real-time monitoring system or long storage like HDFS or hive for analytics Data quality is often seen as the unglamorous component of working with data. Ironically, it’s usually the component that makes up the majority of our time of data engineers. Data quality might very well be the single most important component of a data pipeline, since, without a level of confidence and reliability in your data, the dashboard and analysis generated from the data is useless. The challenge with data quality is that there are no clear and simple formulas for determining if data is correct this is a continuous data engineering task as more data sources are incorporated to the data pipeline. Typically hive plugged on stage 3 and tables are created after the data validation of stage 2 this ensures that data scientists have cleansed data to run their models and analysts using BI tools at least this has been the tasks I have done all through many projects HTH
... View more
05-28-2018
02:17 PM
@Tamil Selvan K If the above answer helped addressed your question, please take a moment to login and click the "accept" link on the answer.
... View more
02-07-2018
06:45 PM
@SMACH H You can follow the below: 1. lock down the location in HDFS: set permission 700 to /apps/hive/warehouse 2. add policy to Ranger/Hive for database: *, allowing users to create databases. (note that the ambari-qa user also needs access to database: * to complete the service check) 3. Allow access to individual databases via Ranger/Hive policies. This blog post may be of interest: http://hortonworks.com/blog/best-practices-for-hive-authorization-using-apache-ranger-in-hdp-2-2/ Also you may explore the options with "hive.server2.enable.doAs"
... View more
12-28-2018
03:49 AM
I believe, Beeline JDBC client can also be used to connect to Spark SQL thru Spark Thrift server.
... View more
10-01-2017
08:33 PM
1 Kudo
@Tamil Selvan K If you want to access the db and fetch the details This can be done by executing the following commands (Postgres db is the default) : docker exec -it cbreak_commondb_1 su postgres -c 'psql' postgres=# \l postgres=# \c cbdb You are now connected to database "cbdb" as user "postgres” cbdb=# select id, name, stack_id, status from cluster; // Queries the cluster id, name, and the status of the clusters
... View more
06-01-2017
03:52 PM
@Jay SenSharma Thanks for it. And is there any other way round as well? Like for an particular rpmlib, can we find the list of HDP packages as well?
... View more
05-31-2017
01:45 PM
@Satish Sarapuri Thanks, but when I tried to check its behavior (expecting something like it would return only the duplicate records), but it returned every records in that table. Hence, wanted to know an simple implementation of it.
... View more