Member since
06-20-2016
488
Posts
433
Kudos Received
118
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3119 | 08-25-2017 03:09 PM | |
1974 | 08-22-2017 06:52 PM | |
3424 | 08-09-2017 01:10 PM | |
8090 | 08-04-2017 02:34 PM | |
8129 | 08-01-2017 11:35 AM |
12-09-2016
09:14 PM
[UPDATE DEC 9] Here is an atlas user's guide that provides good details (it is a work in progress): http://atlas.incubator.apache.org/AtlasTechnicalUserGuide.pdf
... View more
12-09-2016
08:38 PM
1 Kudo
Atlas HBase is made HA by configuring as distributed HBase (vs default standalone) http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_data-governance/content/ch_hdp_data_governance_install_atlas_ambari.html
... View more
12-09-2016
08:34 PM
note that Atlas uses Titan (graphDB) as a metadata store, and this uses HBase for metadata and Solr for indexing http://atlas.apache.org/HighAvailability.html
... View more
12-09-2016
03:25 PM
1 Kudo
You cannot do this directly from the export command and must do some separate processing. I feel the best way to do this is to run this pig script on export result raw = load 'data.csv' using PigStorage(',');
nonull = foreach raw generate
REPLACE($0, '\\\\N', ''),
REPLACE($1, '\\\\N', ''),
REPLACE($2, '\\\\N', ''),
REPLACE($3, '\\\\N', '');
store nonull into 'nonull/data.csv' using PigStorage(','); Keep in mind this will result in output in the m-r format in hdfs data.csv
data.csv/_SUCCESS
data.csv/part-m-00000
data.csv/part-m-00001
... If you want to process this file in hadoop, just point to data.csv If you want to pull this to edge node with command line use hdfs dfs -getmerge <localpath> nonull/data.csv If you want to download it using Ambari Files View, just double click on nonull/data.csv the click Select All then Concatenate and it will download as a single file
... View more
12-09-2016
01:42 PM
The latest tutorial is here.
http://hortonworks.com/apache/falcon/#tutorials It does not reflect new capabilities for Falcon 0.10 in HDP 2.5. (This tutorial update is being worked on). Falcon uses hive/hdfs as backend store configs.
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_data-movement-and-integration/content/ch_hdp_data_mgmt_falcon_overview.html http://falcon.apache.org/FalconDocumentation.html#
... View more
12-07-2016
11:56 AM
1 Kudo
This is very straightforward with NiFi -- very common use case. If the new data is in entire files, using GetFTP (or GetSFTP) processor and configure ftp host and port, path, regex of filename(s), polling frequency, whether to delete original (you can always archive it by forking to another processor), etc. Very easy to configure and implement, monitor, etc. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.GetSFTP/ If the new data are new lines in files (like log files) similar to above but use TailFile which will pick up new lines since last polling. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.TailFile/ On the put side, PutHDFS processor. You download core-site.xml and hdfs.xml from your cluster, put it in a filepath on your nifi cluster and reference that path in the processor config. With that, you then configure the hdfs path (xmls hold all connection details) to put the file ... maybe append a unique timestamp or uuid to filename to distinguish repeated ingests of identically named files. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.hadoop.PutHDFS/
... View more
12-06-2016
09:18 PM
You access all attributes using the expression language, typically to use attributes and their derivation as values in other attributes of a processor: https://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.1.1/bk_HDF_GettingStarted/content/ExpressionLanguage.html http://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2/bk_ExpressionLanguageGuide/content/ch_expression_language_guide.html See this for an excellent overview of attributes, how they change with the lifetime of a flow and how they provide programmatic power in your flows: https://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.1.1/bk_HDF_GettingStarted/content/working-with-attributes.html If this is what you are looking for, let me know by accepting the answer; else, let me know of any gaps or followup questions.
... View more
12-06-2016
09:02 PM
Temp tables like these are created when hive needs to manage intermediate data during an operation. This is normal. They should automatically delete when the operation is over. See these links for more: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_data-access/content/temp-tables.html http://www.javachain.com/hive-create-temporary-table If this is what you are looking for, let me know by accepting the answer; else, let me know of any gaps or followup questions.
... View more
12-06-2016
08:16 PM
2 Kudos
Ambari Open the Background Operations (says Ops) in the upper left of Ambari and you will see the rebalance progress. Continually double clicking on the progress bar gives you greater details of how many blocks are being rebalanced. If you go to [HDFS > QuickLinks > Namenode UI > Live Nodes link in body of page] you will see the hdfs capacity used on each node (and thus the imbalance). You can use this to estimate the time it takes to rebalance. If you used the default 10 for Balance Threshold, it will stop rebalancing when all nodes are within 10% of each other in terms of hdfs capacity used. If you think it will take too long to rebalance, you can kill the job (go to namenode command line and run ps -aef|grep balancer kill -9) and then set the threshold higher, e.g. 25 (%) and run it again and it will rebalance faster. Next, you can rebalance again at 20% threshold, then at 15% etc. This will give you greater control on time and duration of rebalance. CLI If running from the command line, you should see the progress of each iteration in stdout. To determine estimated time to balance, use the same technique as above (go to Namenode UI and estimate remaining time from imbalance and amounts of block moving). To kill and restart at a higher balancer threshold, just Cntrl+C and run again.
... View more
12-06-2016
11:52 AM
There is both a push (reporting api) and pull (restful api) way to automate metrics collection in NiFI. See this post for an overview: https://community.hortonworks.com/questions/69004/nifi-monitoring-processor-and-nifi-service.html
... View more