About gkeys

gkeys · ‎12-01-2016

First you need to convert your bags into tuples, then flatten and distinct. This is done using pig's built-in function BagToTuple() See this post for explanation and example: https://community.hortonworks.com/questions/58271/using-pig-latin-to-replace-multiple-strings-from-s.html

gkeys · ‎11-30-2016

See comment to answer above on how to get configs to local.

gkeys · ‎11-30-2016

@Dagmawi Mengistu To get the configs: login to your cluster via Ambari click the HDFS service on left in upper right, Service Actions dropdown, select Download Client Configs This will download to your local machine, and when you unpack this you will find core-site.xml Place core-site.xml anywhere locally and use this path in your PutHDFS config.

gkeys · ‎11-30-2016

These references should cover you needs as expressed above: http://hortonworks.com/apache/atlas/ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_data-governance/content/ch_hdp_data_governance_overview.html http://hortonworks.com/hadoop-tutorial/tag-based-policies-atlas-ranger/ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_release-notes/content/new_features.html http://www.slideshare.net/hortonworks/data-governance-atlas-7122015 http://www.slideshare.net/HadoopSummit/top-three-big-data-governance-issues-and-how-apache-atlas-resolves-it-for-the-enterprise If this is what you are looking for, let me know by accepting the answer; else, let me know of any gaps or followup questions.

gkeys · ‎11-30-2016

Your hdfs-site.xml will have the connection info to hdfs. I believe the problem is your directory in the first screenshot -- you only need a hdfs path (and not the connection info hdfs://server). Example: (But as @Avijeet Dash suggests, looking at exact error (either by clicking the processor error icon, or nifi-app.log for more info) is useful.

gkeys · ‎11-30-2016

Suggestion is to use UpdateAttribute to append timestamp to filename. Here is an example of how to do this: If this is what you are looking for, let me know by accepting the answer; else, let me know of any gaps or followup questions.

gkeys · ‎11-29-2016

How was the ListSFTP processor scheduler set? See this doc for scheduling: https://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2/bk_UserGuide/content/scheduling-tab.html Suggestion is to have it every 5 sec or so. Let me know how it goes.

gkeys · ‎11-28-2016

There are a couple of optimizations you can try (below) but they almost certainly will not reduce a job duration from > 24 hours to a few hours. It likely is that your cluster is too small for the amount of processing you are doing. In that case, your best bet is to break your 200GB data set into smaller chunks and bulk load each sequentially (or preferably, add more nodes to your cluster). Also, be sure that you are not bulk loading when the scheduled major compaction is occurring. Optimizations: in addition to looking at your log, go to Ambari and see what is maxing out ... memory? CPU? This link gives a good overview for optimizing hbase loads. https://www.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.analyze.doc/doc/bigsql_loadhints.html It is not focused on bulkloading specifically, but does still come into play. Note: for each property mentioned, set it in your importtsv script as -D<property>=<value> \ One thing that usually helps map-reduce jobs is compressing the map output so travels across the wire faster to the reducer -Dmapred.compress.map.output=true\ -Dmapred.map.output.compression.code=org.apache.hadoop.io.compress.GzipCodec\ As mentioned though, it is likely that your cluster is not scaled properly for your workload.

gkeys · ‎11-26-2016

Big picture You can use regular expressions also called regex to do this (not expression language). One of the core use cases of NiFi is in fact to filter files on content, or route files on content, or make decisions based on content. The TailFile is used to generate content, but the next downstream processor filters/routes/decides on content. Commonly used processors here are: ExtractText or RouteText (or ReplactText). All of these use regular expressions to match contents of the file. Typically you want to work on a line-by line basis, so you put a SplitText processor before these. Solution to meet your needs This article shows how to route log data based on file entries (using regular expression). It should be very close or identical to what you want to do: https://community.hortonworks.com/articles/65027/nifi-easy-custom-logging-of-diverse-sources-in-mer.html Regular Expressions vs Expression Language Please note that regular expressions are not to be confused with NiFi expression language (which is very powerful in NiFi flows and worth learning). regex: http://www.regular-expressions.info/ NiFi Expression Language: https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html - If this is what you were looking for, let me know by accepting the answer; else, let me know of any gaps or remaining questions.

gkeys · ‎11-26-2016

If you use InvokeHTTP (with HTTP Method property set to PUT), the property Attributes to Send will send attributes as headers in the HTTP request. (You may have to use UpdateAttribute processor to set the attributes). https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.InvokeHTTP/ If this is what you are looking for, let me know by accepting the answer; else, let me know of any gaps or follow-up questions.

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: distinct operation with bags

Re: Nifi data streaming into HDFS

Re: Nifi data streaming into HDFS

Re: Could anyone send me some documents about the ...

Re: Nifi data streaming into HDFS

Re: NIFI splittext to split the single file into m...

Re: how to use listftp processor?

Re: Reducer is running very slow in hbase bulk loa...

Re: Does TailFile Processor support Expression Lan...

Re: Post HTTP Attributes to Send as HTTP Headers (...