About sball

sball · ‎07-29-2016

not sure what is in final_table_text, but almost certain you don't mean to parallelise to get it from the driver. You should really be using the dataframe read api to distribute it in a case like this.

sball · ‎07-27-2016

If you already have data in ORC format, you can just create a Hive table on top of that, probably as an external table. As you don't currently have the schema with which to create your table, you will need to get hold of this. Fortunately, the schema is embedded in the ORC file, and you can read it with the orc dump utlity. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUtility for info on how to get the schema out of your ORC file so you can create a table.

sball · ‎07-27-2016

If you write a producer of your own, you will also have to write a state mechanism for incremental processing. Kafka will not in anyway help you here. If you use NiFi's ListFile processor, that will only process new files, or updated files since its last run, so will naturally give you incremental processing. Use ListFile -> FetchFile -> PutHDFS to get what you're after.

sball · ‎07-27-2016

You JSON there is just going to be stored as a string by the looks of it, unless your RDBMS is using a JSON type. To import this, you can just have that put into a string field and then use Hive to convert that to a map type for instance, or just use hive udfs to query it as json from the string type field (try using get_json_object in Hive for instance). Once you've got the bits you want there you can transform them to a complex type table with proper maps for instance. Another option worth considering would be to have sqoop just dump out just the json field, and have that stored as a text file rather than going to a hive table. You can then use the spark dataframe json reader to read in and infer the complex json schema for you, and then use that to build a Hive table with the spark saveAsTable function.

sball · ‎07-27-2016

registerAsTempTable is a spark function which will allow you to write SQL against your dataframe. Essentially it's a way to give the dataframe variable a name in the context of SQL. If what you're looking to do is display the data from a programmatic dataframe in a %pyspark paragraph in the same way it does in say a %sql paragraph, your'e on the right track. However, you need to make sure that the data is separated by tabs, and rows end with a new line. You also need a new line between the header and the content. So: final_table_text = "\n".join(table_text_list) output = final_table_text.map(lambda x: "\t".join(x)) print "%table\n%s" % (output) Note that the output and tab insertion can also be done within spark, before collecting your results if you want it to happen across the whole cluster, which may or may not be quicker depending on the size of the data. Note also that this won't automatically limit the number of results in the way that the %sql paragraph does, so you may want to be a little careful about blowing out the browser RAM unless you put your own limits on the dataframe, or use take(n) instead of collect to retrieve the data for display.

sball · ‎07-27-2016

Unfortunately this is not yet in Ambari in HDP 2.5. I've added this Jira to capture the request, feel free to follow or vote on the Jira if this is something that interests you, and of course all contributions always welcome!.

sball · ‎07-27-2016

You really don't want to use Kafka for this. Kafka is limited to very small objects, and you'd be writing a lot of boilerplate code. NiFi would provide a much simpler and better solution here. No file size limitations, and all the work has already been done for you. You would need to use GetFile on the unix folder (or ListFile->FetchFile if you want to keep files in the folder) and then PutHDFS. That will do everything you need. You can also use NiFi to MergeContent to batch up files if required. This can help with NameNode memory pressure and efficiency of downstream processing. If you really feel like you must use Kafka, you're going to jump through a lot of hoops, write a custom Producer and file handlers. You could then use something like a storm topology with the storm-hdfs bolt to write out to HDFS, or write a manual Consumer with the Hadoop apis to write the file, but honestly, that's going to take you a lot of time vs the simple NiFi solution.

sball · ‎07-27-2016

The SQL Server Enterprise CDC feature can be directed to write data to a table in the SQL server, effectively keeping an audit log in a SQL table. This can be read using the NiFi QueryDatabaseTable processor. It effectively works a bit like a trigger writing and audit table which you can then read incrementally with NiFi. So, the easy answer is yes. You would just create a flow with QueryDatabaseTable (based on the timestamp as max column) -> PutKafka. You may need to SplitAvro between the two depending on the batch size you need to get good throughput, but that's another question 🙂 That said, there are other forms of CDC which may be more efficient, but currently require third party products. Using Passive CDC, which reads the transaction logs of the SQL Server, and therefore does not put additional query load on the processor is an option here. Tools such as Attunity Replicate use this approach and write directly to Kafka. See this webinar for an example of how these tools work and how you can use NiFi with them.

sball · ‎07-27-2016

What you are looking for in primary node scheduling, not Remote Process Groups. If you have a NiFi cluster, once of the nodes is designated as the primary. You can then schedule certain processors to run only on primary in the Scheduling tab. To answer your second question, you will need the NCM on a port, the DFMs on a port, and the Remote Socket port (input.remote.socket) set to something else. The way the Site-to-Site protocol works, is that the control channel is established with the NCM over the same port as the web ui and api calls. The data channel is then established directly to the DFMs over the port specified in the input.remote.socket settings (note you may also want to specify the listening host name as well here). You then establish the connection in the connecting RPG to the address of the NCM and the port of the UI, so the exact address you would go to to change the flow on the server end of the RPG.

sball · ‎07-27-2016

Really if you have two questions, you should ask them as two questions on this forum (it's a Q&A style, not a forum or mail list style we're going for)

Online	Offline
Last Visited	‎10-19-2020 01:00 PM

Member Since	‎09-15-2015 10:07 PM
Last Visited	‎10-19-2020 01:00 PM
Posts	116
Kudos received	121

Cloudera Community

Re: metron pcap query

Re: metron pcap data stored in HDFS sequence forma...

Re: Can Apache Metron be installed using CDH or EM...

Re: Installation failed with ambari, Can I retry t...

Re: metron installation on existed ambari managed ...

Re: Zeppelin 0.6 - How to registerAsTable the data...

Re: How can I upload ORC files to Hive?

Re: Using Apache Kafka to read files from a folde...

Re: How can i import JSON field data from mySql se...

Re: Zeppelin 0.6 - How to registerAsTable the data...

Re: Kafka Mirror Maker: Possible to manage using a...

Re: Using Apache Kafka to read files from a folde...

Re: Can NiFi be used to pipe data from sql server ...

Re: Nifi and Clustering

Re: Nifi and Clustering