About sball

sball · ‎07-26-2016

To solve a problem like this you likely need to distribute the workload, so running the difference script on a single node is definitely not going to work as you found out. The best way to do this would be to run the job on one, or both of the clusters, there are however some challenges here. The big problem is that to compare the data you need to move the data from one cluster to the other. You could do this by first moving the data set perhaps from staging to prod, and then running the comparison as a job on the prod cluster. However, this may not be bandwidth efficient. If you are expecting relatively few differences between the data set, a more efficient method may be to emulate what rsync does, in other words to write a job on both clusters that produces a hash of the rows, or some set of blocks in the data set. You can then use the hashes as an index for the rows and just move the hashes from one cluster to the other for comparison. Once you've done the hash comparison, you can then use that to filter the heavier data transfer. Note that this technique is not guaranteed to be perfect due to the risk of hash collision, so you may want to chose wide hash function, however, the probability of failure is very low. The best way to do this would probably be to produce the hashes with either spark or hive, transfer them to one of the clusters, again, use spark or hive to figure out the rows worth transferring. Both Spark and Hive prove good tools for solving this problem.

sball · ‎07-26-2016

Sqoop is really a command line utility, so the best way to kick off sqoop jobs is very much on the console at the moment. You might also consider using Apache NiFi (HDF) for getting data out of relational sources if your use cases are reasonably simple (GBs rather than TBs).

sball · ‎07-26-2016

You can use the embedded Zookeepers in the Nifi process to create a zookeeper cluster, or you can use an external zookeeper cluster (either the one you're using for Kafka, or a different one). Zookeeper synchronises the state within it's cluster.

sball · ‎07-26-2016

The GetKafka processor currently uses the 0.8 (old style) High Level consumer API for Kafka. This means it will use Zookeeper for discovery of the Kafka brokers. These will be based on the host names advertised (advertised.hosts.name) by your Kafka brokers, so should be DNS based. It also means that offsets read are stored in the Kafka zookeeper per consumer group id, as specified in the NiFi processor. So from a Kafka point of view, NiFi is just like any other Kafka consumer, and will connect, balance, and retry in the normal way. From a Kafka point of view the Nifi zookeeper is not used at all. However, for NiFi leader (primary) elections the internal zookeeper, or the external one if configured that way will be used. In a scenario where you have an existing zookeeper cluster, for example for your Kafka install, you will find it simpler to just use that, since the NiFi load on ZK is usually fairly light (unless you make extensive use of the State api for things like QueryDatabaseTable for example). Never use a single zookeeper node in production, three at least. For monitoring of the NiFi cluster, it's worth considering using the AmbariReportingService to integrate your monitoring into Ambari. Note that there are also multiple sources of monitoring stats within NiFi which can be monitored for your particular data flows.

sball · ‎07-22-2016

If you select a section of your flow, you can turn that into a template with the New Template button You can then use the template manager: This lists all your templates and allows you to download them as an XML file. If you want to save the entire flow you have in the system, you can also find that in nifi/conf/flow.xml.gz on your nifi box. This is not a template, but would be able to drop into a clean NiFi instance.

sball · ‎07-22-2016

To make this work you need to ensure that the line attribute is populated. In this scenario it looks like you are going to want to use SplitText to create flow files a line at a time. You can then use ExtractJsonPath to pull out the telnum property as an attribute for each line. Use that attribute to either route, or better, UpdateAttribute to ensure it is just the prefix part you want, and use MergeContent with "Correlation Attribute Name" set to the Attribute you're using to group. This will produce a number of bins of combined files, essentially it's a bit like the group by clause in SQL. That will give you FlowFiles containing all the entries for each given prefix. I would suggest setting a low Max Time on that Merge to avoid introducing additional latency.

sball · ‎07-21-2016

The correct link is http://www.eclipse.org/jetty/documentation/current/configuring-security-authentication.html for how to secure jetty. There are multiple means of doing this documented there. My personal favorite is to use SSL client security. Another option which might provide better integration with a secured Hortonworks cluster would be to use Knox to proxy the solr interface, and use Knox for authentication or single sign on. The Knox Developers' Guide has more information on how to configure a proxy for non-default services such as the banana and solr web endpoints.

sball · ‎07-21-2016

You don't really have a session store in NiFi. However, it would be perfectly possible to create a session store using Cookies and the DistributedMapCacheService and associated processors. You would need to do this by having a flow that creates a session id and adds the Set-Cookie header to the response using dynamic properties on the HandleResponse processor. Use the contents from this as a key to then change and lookup session contents.

sball · ‎07-17-2016

There is a property on the GetKafka processor which species the offset to start reading from. "Auto Offset Reset" should be set to "smallest" to provide the equivalent of --from-beginning in the console consumer.

sball · ‎07-05-2016

You could use PutCassandraQL with the insert JSON syntax in CQL. This way all you need to do is GetKafka -> ReplaceText to prepend: INSERT INTO namespace.table JSON ' and append: ' This will wrap your JSON document in an INSERT statement if you have a single document per FlowFile. Note that you can use SplitJSON before this if this is not the case. You can then pipe that directly into PutCassandraQL. Note that if you want to batch up entries inserted into Cassandra, then you can use the MergeContent processors and the Header, Footer and Demarcator properties to achieve a similar transformation.

Online	Offline
Last Visited	‎10-19-2020 01:00 PM

Member Since	‎09-15-2015 10:07 PM
Last Visited	‎10-19-2020 01:00 PM
Posts	116
Kudos received	121

Cloudera Community

Re: metron pcap query

Re: metron pcap data stored in HDFS sequence forma...

Re: Can Apache Metron be installed using CDH or EM...

Re: Installation failed with ambari, Can I retry t...

Re: metron installation on existed ambari managed ...

Re: Comparing hive tables with Spark.

Re: Is there a way i can practise sqoop in apache ...

Re: Nifi production cluster setup

Re: Nifi production cluster setup

Re: How can I save my NiFi flows into XML?

Re: How to use RouteText to match prefix for each ...

Re: Solr - Banana UI Auth & Auth

Re: Storing Session State in NiFi

Re: In Nifi is there anyway to consume kafka messa...

Re: Kafka json events filter on event name & event...