About gkeys

gkeys · ‎11-18-2016

HDF is best thought of as working with data in motion and HDP as Hadoop, the popular Big Data Platform which in contrast can be seen as data at rest. Both are independent platforms but can are often integrated. When integrated, they are deployed as separate clusters or platforms. Both are open source and Hortonworks provides paid support for each separately. HDF HDF has NiFi, Storm and Kafka (as well as Ambari admin console). These components are used to get data from diverse sources (ranging from social media sites, log files, IoT devices, databases, etc) and send the data to an equally diverse range of target systems. In between, they can transform moving content, make decisions based on moving content, and run analytics on moving content. The actual movement of data is difficult to engineer and these components move data and handle the many challenges in doing so all under the covers with no low-level development needed. See: https://hortonworks.com/products/data-center/hdf/ HDP HDP is more commonly known as the Hadoop or Big Data Platform. It has HDFS, YARN, Map-reduce and Tez processing engines, Hive database, HBase No Sql database, and many other tools to work with Big Data (data in large volumes, wide variety of formats, and fast real-time velocity of arriving on the platform ... the 3 Vs). It stores this data cheaply and flexibly, and uses horizontal scaling of servers to parallel process these 3 Vs of data in a short amount of time (compared to traditional databases which face limits in working with the 3 Vs). What type of processing depends on the out-of-the-box or 3rd party tools used and the use case / business case involved. See: https://hortonworks.com/products/data-center/hdp/ HDF + HDP HDF and HDP are often integrated because HDF is an effective way to get diverse sources of data into HDP to be stored and processed all in one place, to be used by data scientists for example. If this is what you were looking for, let me know by accepting the answer; else, please respond to this answer with further questions and I will follow-up.

gkeys · ‎11-18-2016

I was hoping to be granular with encryption of sensitive data vs non-sensitive data flowing into HDFS for performance reasons. If performance differences are not that large ... it is no big deal, then.

Chandra · ‎11-03-2017

Hi @Greg Keys, Thanks for the post. Row filtering works based on the column values which is not in the end. But I am not sure how to filter the rows based on the last column value. Can you please let me know. Thanks

garima_b_verma · ‎11-17-2016

Every flow is perfectly working.. thanks a lot..! Can we check blank space also and replace it with some text value? i used Search Value as " " but it didnt work.. 😞 Could you please check dis one too once.. Many thanks! -Garima.

gkeys · ‎11-11-2016

For tez "tasks" represent map operations or reduce operations. A DAG is a full workflow (job) of vertices (processing of tasks) and edges (data movement between vertices). See these links for a more detailed discussion: http://hortonworks.com/blog/expressing-data-processing-in-apache-tez/ https://community.hortonworks.com/questions/32164/question-on-tez-dag-task-and-pig-on-tez.html https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works You can see number of tasks on the console output: You can also see this in Ambari Tez view (and drill down for greater details) See this for understanding Ambari Tez view: https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.2.0/bk_ambari_views_guide/content/section_using_tez_view.html

cstanca · ‎12-16-2016

@Gurpreet Singh @Greg Keys has provided the link you requested (ref REL_SUCCESS): http://funnifi.blogspot.com/2016/02/executescript-processor-hello-world.html

gkeys · ‎11-23-2016

@bala krishnan It works for me when I set Replacement Strategy to "Literal Replace": my input file has control-a (but no \001) and my output file has control-a followed by test. When I use the default Replacement Value ("Regex Replace") my output file has \001test

gkeys · ‎11-08-2016

HDP 2.3+ packages Sqoop 1.4.6 which allows direct import to HDFS as parquet file, by using: --as-parquetfile See: https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html If you import directly to hive table (vs HDFS) you may need to do this as 2-step process (https://community.hortonworks.com/questions/56847/parquet-files-sqoop-import.html)

gkeys · ‎02-26-2019

Hi Jim, use the log4j library and there is a configuration to use an appender that defines how the logs rotate. Log4j is pretty standard in the java world Here is a good tutorial: https://www.journaldev.com/10689/log4j-tutorial

ssubhas · ‎11-09-2016

@swathi thukkaraju Try using the option --password-file to remove the possibility of entering / exposing the password. Below is the link for creating the password file (link)

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: DataFlow vs Data Platform

Re: NiFi: How to put to HDFS with wire encryption?

Re: NiFi ETL: Removing columns, filtering rows, ch...

Re: How to delete a row/drop a column?

Re: Identify number of Mappers & Reducers launched...

Re: Calling Process Dynamic

Re: 'ReplaceText" processor does not replace speci...

Re: Convert to parquet format

Re: NiFi: Easy custom logging of diverse sources i...

Re: how to implement and deploying our own sqoop ...