About gkeys

gkeys · ‎12-09-2016

[UPDATE DEC 9] Here is an atlas user's guide that provides good details (it is a work in progress): http://atlas.incubator.apache.org/AtlasTechnicalUserGuide.pdf

gkeys · ‎12-09-2016

Atlas HBase is made HA by configuring as distributed HBase (vs default standalone) http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_data-governance/content/ch_hdp_data_governance_install_atlas_ambari.html

gkeys · ‎12-09-2016

note that Atlas uses Titan (graphDB) as a metadata store, and this uses HBase for metadata and Solr for indexing http://atlas.apache.org/HighAvailability.html

gkeys · ‎12-09-2016

You cannot do this directly from the export command and must do some separate processing. I feel the best way to do this is to run this pig script on export result raw = load 'data.csv' using PigStorage(','); nonull = foreach raw generate REPLACE($0, '\\\\N', ''), REPLACE($1, '\\\\N', ''), REPLACE($2, '\\\\N', ''), REPLACE($3, '\\\\N', ''); store nonull into 'nonull/data.csv' using PigStorage(','); Keep in mind this will result in output in the m-r format in hdfs data.csv data.csv/_SUCCESS data.csv/part-m-00000 data.csv/part-m-00001 ... If you want to process this file in hadoop, just point to data.csv If you want to pull this to edge node with command line use hdfs dfs -getmerge <localpath> nonull/data.csv If you want to download it using Ambari Files View, just double click on nonull/data.csv the click Select All then Concatenate and it will download as a single file

gkeys · ‎12-09-2016

The latest tutorial is here. http://hortonworks.com/apache/falcon/#tutorials It does not reflect new capabilities for Falcon 0.10 in HDP 2.5. (This tutorial update is being worked on). Falcon uses hive/hdfs as backend store configs. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_data-movement-and-integration/content/ch_hdp_data_mgmt_falcon_overview.html http://falcon.apache.org/FalconDocumentation.html#

gkeys · ‎12-07-2016

This is very straightforward with NiFi -- very common use case. If the new data is in entire files, using GetFTP (or GetSFTP) processor and configure ftp host and port, path, regex of filename(s), polling frequency, whether to delete original (you can always archive it by forking to another processor), etc. Very easy to configure and implement, monitor, etc. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.GetSFTP/ If the new data are new lines in files (like log files) similar to above but use TailFile which will pick up new lines since last polling. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.TailFile/ On the put side, PutHDFS processor. You download core-site.xml and hdfs.xml from your cluster, put it in a filepath on your nifi cluster and reference that path in the processor config. With that, you then configure the hdfs path (xmls hold all connection details) to put the file ... maybe append a unique timestamp or uuid to filename to distinguish repeated ingests of identically named files. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.hadoop.PutHDFS/

gkeys · ‎12-06-2016

You access all attributes using the expression language, typically to use attributes and their derivation as values in other attributes of a processor: https://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.1.1/bk_HDF_GettingStarted/content/ExpressionLanguage.html http://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2/bk_ExpressionLanguageGuide/content/ch_expression_language_guide.html See this for an excellent overview of attributes, how they change with the lifetime of a flow and how they provide programmatic power in your flows: https://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.1.1/bk_HDF_GettingStarted/content/working-with-attributes.html If this is what you are looking for, let me know by accepting the answer; else, let me know of any gaps or followup questions.

gkeys · ‎12-06-2016

Temp tables like these are created when hive needs to manage intermediate data during an operation. This is normal. They should automatically delete when the operation is over. See these links for more: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_data-access/content/temp-tables.html http://www.javachain.com/hive-create-temporary-table If this is what you are looking for, let me know by accepting the answer; else, let me know of any gaps or followup questions.

gkeys · ‎12-06-2016

Ambari Open the Background Operations (says Ops) in the upper left of Ambari and you will see the rebalance progress. Continually double clicking on the progress bar gives you greater details of how many blocks are being rebalanced. If you go to [HDFS > QuickLinks > Namenode UI > Live Nodes link in body of page] you will see the hdfs capacity used on each node (and thus the imbalance). You can use this to estimate the time it takes to rebalance. If you used the default 10 for Balance Threshold, it will stop rebalancing when all nodes are within 10% of each other in terms of hdfs capacity used. If you think it will take too long to rebalance, you can kill the job (go to namenode command line and run ps -aef|grep balancer kill -9) and then set the threshold higher, e.g. 25 (%) and run it again and it will rebalance faster. Next, you can rebalance again at 20% threshold, then at 15% etc. This will give you greater control on time and duration of rebalance. CLI If running from the command line, you should see the progress of each iteration in stdout. To determine estimated time to balance, use the same technique as above (go to Namenode UI and estimate remaining time from imbalance and amounts of block moving). To kill and restart at a higher balancer threshold, just Cntrl+C and run again.

gkeys · ‎12-06-2016

There is both a push (reporting api) and pull (restful api) way to automate metrics collection in NiFI. See this post for an overview: https://community.hortonworks.com/questions/69004/nifi-monitoring-processor-and-nifi-service.html

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: Apache Atlas lineage graph creation

Re: Which components of Atlas are not HA?

Re: Falcon - Atlas backend store

Re: What is the best way to export Hive table cont...

Re: Falcon on HDP2.5

Re: How to read data from a file from Remote FTP S...

Re: Where are Nifi attributes written?

Re: Hive: Temporary table created automatically

Re: Question on HDFS Rebalance

Re: NiFi: Provenance list capture