About Shelton

Shelton · ‎12-26-2019

@Prakashcit To ensure data from multiple data sources are ingested to discover at a later stage business insights, usually we dump everything. Comparison of source data with data ingested to simply validate that all the data has been pushed and verifying that correct data files are generated and loaded into HDFS correctly into the desired location. A smart data lake ingestion tool or solution like kylo should enable self-service data ingestion, data wrangling, data profiling, data validation, data cleansing/standardization,see attached architecture /landing_Zone/Raw_data/ [ Corresponding to stage1] /landing_Zone/Raw_data/refined [ Corresponding to stage2] /landing_Zone/Raw_data/refined/Trusted Data [ Corresponding to stage3] /landing_Zone/Raw_data/refined/Trusted Data/sandbox [ Corresponding to stage4] The data lake can be used also to feed upstream systems for a real-time monitoring system or long storage like HDFS or hive for analytics Data quality is often seen as the unglamorous component of working with data. Ironically, it’s usually the component that makes up the majority of our time of data engineers. Data quality might very well be the single most important component of a data pipeline, since, without a level of confidence and reliability in your data, the dashboard and analysis generated from the data is useless. The challenge with data quality is that there are no clear and simple formulas for determining if data is correct this is a continuous data engineering task as more data sources are incorporated to the data pipeline. Typically hive plugged on stage 3 and tables are created after the data validation of stage 2 this ensures that data scientists have cleansed data to run their models and analysts using BI tools at least this has been the tasks I have done all through many projects HTH

Shelton · ‎12-26-2019

@hud When you run nifi as a microservice then you can configure PVC [Persistent Volume Claims] using helm in AKS or Kubernetes which will ensure that evenif the nifi pods restarts it will always have the same volume mounted. Under the persitence configuration the parameter persistence.enabled should be set to true see Helm Chart for Apache Nifi HTH

Shelton · ‎12-25-2019

@kiranpune DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery and reporting. It expands a list of files and directories into the input to map tasks, each of which will copy a partition of the files specified in the source list that basic description. But one can use different command-line options when running DISTCP see the official dictcp documentation below are a few options for your different use cases. OPTIONS -append: Incremental copy of the file with the same name but different length -update: Overwrite if source and destination differ in size, block size, or checksum -overwrite: Overwrite destination -delete: Delete the files existing in the destination but not in the source I think you can schedule or script a daily copy

Shelton · ‎12-19-2019

@saivenkatg55 This "Exiting with status 1: java.io.IOException: Problem starting http server" error should be linked to your other question I just have responded to https://community.cloudera.com/t5/Support-Questions/Unable-to-start-the-node-manager/td-p/286013 If this is resolved then the java.io.IOException shouldn't occur HTH

pnkalyan · ‎12-19-2019

@Shelton thank you actually I did and I thought this is my first time logging I was able to recover the password. Thanks for reply.

RobertCare · ‎12-18-2019

For some reason I thought that it wasn't necessary to have LDAP in addition to Kerberos. I went on setting up an LDAP environment and was able to sync users to Ranger. Thanks!

Shelton · ‎12-17-2019

@Bindal Do the following steps sandbox-hdp login: root root@sandbox-hdp.hortonworks.com's password: ..... [root@sandbox-hdp~] mkdir -p /tmp/data [root@sandbox-hdp~]cd /tmp/data Now here you should be in /tmp/data to validate that do [root@sandbox-hdp~]pwd copy your riskfactor1.csv to this directory using some tool win Winscp or Mobaxterm see my screenshot using winscp My question is where is riskfactor1.csv file located? If that's not clear you can upload using the ambari view first navigate to /bindal/data and then select the upload please see attached screenshot to upload the file from your laptop. After the successful upload, you can run your Zeppelin job and keep me posted

gredondo · ‎12-13-2019

While trying to used the hue Web Ui I've had the same problem. I my case the problem was that the impala service was not running. I was trying to do a hive query but impala was set as the default application. So as soon as I connect to Hue Web Ui it tries to connect to the Impala server and throws the error message. Therefore any query you run gets executed by impala and throws the error. To solve this, you must check that impala service is running properly.

eswarloges · ‎12-10-2019

@jsensharma All the configs are set properly in our cluster. I am trying to access the external hive tables using hive ware house session. Is that the error because of that as the documentation says its not needed to use HiveWarehouseSession for the external tables.

AKR · ‎12-09-2019

Hi, If you are looking for 2.4 on HDP alone means, as mentioned in previous emails, don't expect any new HDP version anytime before the release of the combined new offering Cloudera Data Platform (CDP) sometime in 2020 or thereafter. As an alternative CDH 6.2+ supports spark 2.4 version. Thanks AK

Online	Offline
Last Visited	‎12-11-2025 11:50 PM

Member Since	‎01-19-2017 04:35 AM
Last Visited	‎12-11-2025 11:50 PM
Posts	3,679
Kudos received	627

Cloudera Community

Re: Apache nifi memory consumption in kubernetes

Re: Nifi toolkit command for GitLabFlowRegistry

Re: Not able to delete the NiFi existing flow usin...

Re: Securing Nifi with SSL and using OIDC provider...

Re: External zookeeper and nifi cluster connection...

Re: How to avoid duplicate row insertion in Hive?

Re: NIFI Production Setup

Re: Discp

Re: Unable to start data node

Re: Hortonworks HDP 3.0 root user password doesn't...

Re: Kerberos users not showing up in Ranger policy...

Re: "Path does not exist" error message received w...

Re: ThriftTransport Error

Re: HDP 3.1 Hive Connectivity Issue

Re: Which version of HDP supports spark 2.4