Member since
09-15-2015
116
Posts
141
Kudos Received
40
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1845 | 02-05-2018 04:53 PM | |
2396 | 10-16-2017 09:46 AM | |
2078 | 07-04-2017 05:52 PM | |
3109 | 04-17-2017 06:44 PM | |
2286 | 12-30-2016 11:32 AM |
07-26-2016
03:55 PM
1 Kudo
To solve a problem like this you likely need to distribute the workload, so running the difference script on a single node is definitely not going to work as you found out. The best way to do this would be to run the job on one, or both of the clusters, there are however some challenges here. The big problem is that to compare the data you need to move the data from one cluster to the other. You could do this by first moving the data set perhaps from staging to prod, and then running the comparison as a job on the prod cluster. However, this may not be bandwidth efficient. If you are expecting relatively few differences between the data set, a more efficient method may be to emulate what rsync does, in other words to write a job on both clusters that produces a hash of the rows, or some set of blocks in the data set. You can then use the hashes as an index for the rows and just move the hashes from one cluster to the other for comparison. Once you've done the hash comparison, you can then use that to filter the heavier data transfer. Note that this technique is not guaranteed to be perfect due to the risk of hash collision, so you may want to chose wide hash function, however, the probability of failure is very low. The best way to do this would probably be to produce the hashes with either spark or hive, transfer them to one of the clusters, again, use spark or hive to figure out the rows worth transferring. Both Spark and Hive prove good tools for solving this problem.
... View more
07-26-2016
01:16 PM
1 Kudo
Sqoop is really a command line utility, so the best way to kick off sqoop jobs is very much on the console at the moment. You might also consider using Apache NiFi (HDF) for getting data out of relational sources if your use cases are reasonably simple (GBs rather than TBs).
... View more
07-26-2016
01:01 PM
1 Kudo
You can use the embedded Zookeepers in the Nifi process to create a zookeeper cluster, or you can use an external zookeeper cluster (either the one you're using for Kafka, or a different one). Zookeeper synchronises the state within it's cluster.
... View more
07-26-2016
12:17 PM
1 Kudo
The GetKafka processor currently uses the 0.8 (old style) High Level consumer API for Kafka. This means it will use Zookeeper for discovery of the Kafka brokers. These will be based on the host names advertised (advertised.hosts.name) by your Kafka brokers, so should be DNS based. It also means that offsets read are stored in the Kafka zookeeper per consumer group id, as specified in the NiFi processor. So from a Kafka point of view, NiFi is just like any other Kafka consumer, and will connect, balance, and retry in the normal way. From a Kafka point of view the Nifi zookeeper is not used at all. However, for NiFi leader (primary) elections the internal zookeeper, or the external one if configured that way will be used. In a scenario where you have an existing zookeeper cluster, for example for your Kafka install, you will find it simpler to just use that, since the NiFi load on ZK is usually fairly light (unless you make extensive use of the State api for things like QueryDatabaseTable for example). Never use a single zookeeper node in production, three at least. For monitoring of the NiFi cluster, it's worth considering using the AmbariReportingService to integrate your monitoring into Ambari. Note that there are also multiple sources of monitoring stats within NiFi which can be monitored for your particular data flows.
... View more
07-22-2016
11:11 AM
5 Kudos
If you select a section of your flow, you can turn that into a template with the New Template button You can then use the template manager: This lists all your templates and allows you to download them as an XML file. If you want to save the entire flow you have in the system, you can also find that in nifi/conf/flow.xml.gz on your nifi box. This is not a template, but would be able to drop into a clean NiFi instance.
... View more
07-22-2016
09:50 AM
2 Kudos
To make this work you need to ensure that the line attribute is populated. In this scenario it looks like you are going to want to use SplitText to create flow files a line at a time. You can then use ExtractJsonPath to pull out the telnum property as an attribute for each line. Use that attribute to either route, or better, UpdateAttribute to ensure it is just the prefix part you want, and use MergeContent with "Correlation Attribute Name" set to the Attribute you're using to group. This will produce a number of bins of combined files, essentially it's a bit like the group by clause in SQL. That will give you FlowFiles containing all the entries for each given prefix. I would suggest setting a low Max Time on that Merge to avoid introducing additional latency.
... View more
07-21-2016
08:29 PM
1 Kudo
The correct link is http://www.eclipse.org/jetty/documentation/current/configuring-security-authentication.html for how to secure jetty. There are multiple means of doing this documented there. My personal favorite is to use SSL client security. Another option which might provide better integration with a secured Hortonworks cluster would be to use Knox to proxy the solr interface, and use Knox for authentication or single sign on. The Knox Developers' Guide has more information on how to configure a proxy for non-default services such as the banana and solr web endpoints.
... View more
07-21-2016
04:00 PM
1 Kudo
You don't really have a session store in NiFi. However, it would be perfectly possible to create a session store using Cookies and the DistributedMapCacheService and associated processors. You would need to do this by having a flow that creates a session id and adds the Set-Cookie header to the response using dynamic properties on the HandleResponse processor. Use the contents from this as a key to then change and lookup session contents.
... View more
07-17-2016
10:37 AM
1 Kudo
There is a property on the GetKafka processor which species the offset to start reading from. "Auto Offset Reset" should be set to "smallest" to provide the equivalent of --from-beginning in the console consumer.
... View more
07-05-2016
04:19 PM
3 Kudos
You could use PutCassandraQL with the insert JSON syntax in CQL. This way all you need to do is GetKafka -> ReplaceText to prepend:
INSERT INTO namespace.table JSON '
and append: '
This will wrap your JSON document in an INSERT statement if you have a single document per FlowFile. Note that you can use SplitJSON before this if this is not the case. You can then pipe that directly into PutCassandraQL. Note that if you want to batch up entries inserted into Cassandra, then you can use the MergeContent processors and the Header, Footer and Demarcator properties to achieve a similar transformation.
... View more