Member since
01-19-2017
3679
Posts
632
Kudos Received
372
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 753 | 06-04-2025 11:36 PM | |
| 1332 | 03-23-2025 05:23 AM | |
| 660 | 03-17-2025 10:18 AM | |
| 2391 | 03-05-2025 01:34 PM | |
| 1554 | 03-03-2025 01:09 PM |
12-26-2019
12:30 PM
@Prakashcit To ensure data from multiple data sources are ingested to discover at a later stage business insights, usually we dump everything. Comparison of source data with data ingested to simply validate that all the data has been pushed and verifying that correct data files are generated and loaded into HDFS correctly into the desired location. A smart data lake ingestion tool or solution like kylo should enable self-service data ingestion, data wrangling, data profiling, data validation, data cleansing/standardization,see attached architecture /landing_Zone/Raw_data/ [ Corresponding to stage1] /landing_Zone/Raw_data/refined [ Corresponding to stage2] /landing_Zone/Raw_data/refined/Trusted Data [ Corresponding to stage3] /landing_Zone/Raw_data/refined/Trusted Data/sandbox [ Corresponding to stage4] The data lake can be used also to feed upstream systems for a real-time monitoring system or long storage like HDFS or hive for analytics Data quality is often seen as the unglamorous component of working with data. Ironically, it’s usually the component that makes up the majority of our time of data engineers. Data quality might very well be the single most important component of a data pipeline, since, without a level of confidence and reliability in your data, the dashboard and analysis generated from the data is useless. The challenge with data quality is that there are no clear and simple formulas for determining if data is correct this is a continuous data engineering task as more data sources are incorporated to the data pipeline. Typically hive plugged on stage 3 and tables are created after the data validation of stage 2 this ensures that data scientists have cleansed data to run their models and analysts using BI tools at least this has been the tasks I have done all through many projects HTH
... View more
12-26-2019
02:10 AM
@hud When you run nifi as a microservice then you can configure PVC [Persistent Volume Claims] using helm in AKS or Kubernetes which will ensure that evenif the nifi pods restarts it will always have the same volume mounted. Under the persitence configuration the parameter persistence.enabled should be set to true see Helm Chart for Apache Nifi HTH
... View more
12-25-2019
09:44 AM
@kiranpune DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery and reporting. It expands a list of files and directories into the input to map tasks, each of which will copy a partition of the files specified in the source list that basic description. But one can use different command-line options when running DISTCP see the official dictcp documentation below are a few options for your different use cases. OPTIONS -append: Incremental copy of the file with the same name but different length -update: Overwrite if source and destination differ in size, block size, or checksum -overwrite: Overwrite destination -delete: Delete the files existing in the destination but not in the source I think you can schedule or script a daily copy
... View more
12-19-2019
02:12 PM
@saivenkatg55 This "Exiting with status 1: java.io.IOException: Problem starting http server" error should be linked to your other question I just have responded to https://community.cloudera.com/t5/Support-Questions/Unable-to-start-the-node-manager/td-p/286013 If this is resolved then the java.io.IOException shouldn't occur HTH
... View more
12-19-2019
11:22 AM
@Shelton thank you actually I did and I thought this is my first time logging I was able to recover the password. Thanks for reply.
... View more
12-18-2019
06:26 AM
For some reason I thought that it wasn't necessary to have LDAP in addition to Kerberos. I went on setting up an LDAP environment and was able to sync users to Ranger. Thanks!
... View more
12-17-2019
05:48 AM
1 Kudo
@Bindal Do the following steps sandbox-hdp login: root root@sandbox-hdp.hortonworks.com's password: ..... [root@sandbox-hdp~] mkdir -p /tmp/data [root@sandbox-hdp~]cd /tmp/data Now here you should be in /tmp/data to validate that do [root@sandbox-hdp~]pwd copy your riskfactor1.csv to this directory using some tool win Winscp or Mobaxterm see my screenshot using winscp My question is where is riskfactor1.csv file located? If that's not clear you can upload using the ambari view first navigate to /bindal/data and then select the upload please see attached screenshot to upload the file from your laptop. After the successful upload, you can run your Zeppelin job and keep me posted
... View more
12-13-2019
09:50 AM
While trying to used the hue Web Ui I've had the same problem. I my case the problem was that the impala service was not running. I was trying to do a hive query but impala was set as the default application. So as soon as I connect to Hue Web Ui it tries to connect to the Impala server and throws the error message. Therefore any query you run gets executed by impala and throws the error. To solve this, you must check that impala service is running properly.
... View more
12-10-2019
03:45 AM
@jsensharma All the configs are set properly in our cluster. I am trying to access the external hive tables using hive ware house session. Is that the error because of that as the documentation says its not needed to use HiveWarehouseSession for the external tables.
... View more
12-09-2019
10:31 PM
Hi, If you are looking for 2.4 on HDP alone means, as mentioned in previous emails, don't expect any new HDP version anytime before the release of the combined new offering Cloudera Data Platform (CDP) sometime in 2020 or thereafter. As an alternative CDH 6.2+ supports spark 2.4 version. Thanks AK
... View more