1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2485 | 04-03-2024 06:39 AM | |
| 3829 | 01-12-2024 08:19 AM | |
| 2076 | 12-07-2023 01:49 PM | |
| 3061 | 08-02-2023 07:30 AM | |
| 4195 | 03-29-2023 01:22 PM |
03-16-2018
03:15 PM
now use record processor
... View more
03-16-2018
12:19 AM
Other Options: https://github.com/enahwe/Csv2Hive
... View more
03-16-2018
12:16 AM
2 Kudos
NiFi JSON to DDL Custom Processor Java Class JUnit This is further enhanced version of the idea started here: https://community.hortonworks.com/articles/154957/converting-json-to-sql-ddl.html?es_p=6294995 There was some discussion on linkedin about the previous article being a good processor, so I decided to do that. This is pretty basic, but it handles most
types okay. Date and number processing is a bit hacky, but guesses some types. To install, copy the NAR file that you build or download from Github to your NiFi/lib directories and restart those servers. Add the New Processor to Your Flow Configure the Processor with a table type (that is ignored in this version) Configure the Processor with a table name (this is important) JsonToDDLProcessor Generated Docs I configured my table name to be the filename without an extension for JSON Output in NiFi Example Flow Enhancements In Consideration:
Apache OpenNLP Apache Tika Attribute Cleaner Enhancement Deep Learning for Determining Types Machine Learning for Type Inference MITIE Apache MXNet TensorFlow Stanford CoreNLP Kite SDK Hive Tools Spark Tools Make Fields Even Sized or Learn What Sizes Are Common Profiling Data Call to the community, if this is interesting, please join. If you don't want to code, please suggest enhancements, open tickets on bugs, spread the word. Thanks. Source Code: https://github.com/tspannhw/nifi-convertjsontoddl-processor mvn archetype:generate Install the Pre-Built Nar https://github.com/tspannhw/nifi-convertjsontoddl-processor/releases/tag/v1.0 Test JSON Files https://github.com/tspannhw/nifi-convertjsontoddl-processor/tree/master/nifi-convertjsontoddl-processors/src/test/resources Table Create DDL generatedddl CREATE TABLE simple ( EMPID INT, GENDER CHAR(1), DEPTID INT, FIRSTNAME VARCHAR(17), LASTNAME VARCHAR(15), TOTALSPENT INT ) generatedddl CREATE TABLE complex ( EMPID INT, GENDER CHAR(1), DEPTID INT, FIRSTNAME VARCHAR(17), LASTNAME VARCHAR(15), TOTALSPENT INT, ALONGFIELDNAME VARCHAR(33), MYFIELDISALARGESTRINGGUESSWHATTYPE VARCHAR(141), day9 INT, day0 INT, day1 INT, day2 INT, day3 INT, day4 INT, day5 INT, day6 INT, day7 INT, day8 INT, day9 INT, day0 INT, day1 INT, day INT, day INT, day INT, day INT, day INT, day INT, day INT, day INT, day INT, day0 INT, day1 INT, day2 INT, day3 INT, day4 INT, day5 INT, day6 INT, day7 INT, day8 INT, swver VARCHAR(41), hwver VARCHAR(15), mac VARCHAR(29), type VARCHAR(31), hwId VARCHAR(44), fwId VARCHAR(44), oemId VARCHAR(44), devname VARCHAR(51), model VARCHAR(21), deviceId VARCHAR(52), alias VARCHAR(59), iconhash CHAR(1), relaystate INT, ontime INT, activemode VARCHAR(20), feature VARCHAR(19), updating INT, rssi INT, ledoff INT, latitude INT, longitude INT, index INT, zonestr VARCHAR(59), tzstr VARCHAR(34), dstoffset INT, month INT, month INT, month INT, current INT, voltage INT, power INT, total INT, time DATETIME, ledon BOOLEAN, systemtime DATETIME ) generatedddl CREATE TABLE inception ( uuid VARCHAR(41), toppct VARCHAR(25), top VARCHAR(29), toppct VARCHAR(25), top VARCHAR(32), toppct VARCHAR(25), top VARCHAR(47), toppct VARCHAR(25), top VARCHAR(28), toppct VARCHAR(25), top VARCHAR(25), imagefilename VARCHAR(51), runtime CHAR(1) ) generatedddl CREATE TABLE weather ( version VARCHAR(15), xsinoNamespaceSchemaLocation VARCHAR(63), credit VARCHAR(43), creditURL VARCHAR(31), url VARCHAR(50), title VARCHAR(43), link VARCHAR(30), suggestedpickup VARCHAR(37), suggestedpickup_period VARCHAR(14), location VARCHAR(58), stationid VARCHAR(16), latitude VARCHAR(19), longitude VARCHAR(20), observationtime VARCHAR(52), observationtime_rfc822 DATETIME, weather VARCHAR(20), windstring VARCHAR(74), winddir VARCHAR(16), winddegrees VARCHAR(15), windmph VARCHAR(16), windgust_mph VARCHAR(16), windkt CHAR(1), windgust_kt VARCHAR(14), pressurein VARCHAR(17), visibilitymi VARCHAR(17), iconurl_base VARCHAR(57), twoday_history_url VARCHAR(59), iconurl_name VARCHAR(19), oburl VARCHAR(56), disclaimerurl VARCHAR(46), copyrighturl VARCHAR(46), privacypolicy_url VARCHAR(42) ) Example Flow jsontotable.xml Resources:
https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java https://github.com/manalishah/mitie-resources https://github.com/scrapinghub/skinfer/blob/master/README.rst https://github.com/quux00/hive-json-schema https://github.com/mit-nlp/MITIE https://opennlp.apache.org/docs/1.5.3/manual/opennlp.html#tools.namefind.recognition http://opennlp.sourceforge.net/models-1.5/ https://github.com/scrapinghub/dateparser/blob/master/README.rst http://kitesdk.org/docs/1.0.0/Inferring-a-Schema-from-an-Avro-Data-File.html https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala https://github.com/quux00/hive-json-schema http://thornydev.blogspot.in/2013/07/querying-json-records-via-hive.html https://github.com/catherinedevlin/ddl-generator https://github.com/edenzik/JSON-to-PostgreSQL/blob/master/src/main/Parser.java https://nifi.apache.org/developer-guide.html https://community.hortonworks.com/articles/116803/building-a-custom-processor-in-apache-nifi-12-for.html https://community.hortonworks.com/articles/4318/build-custom-nifi-processor.html
... View more
Labels:
03-15-2018
01:30 PM
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/spark-dataframe-api.html
http://spark.apache.org/docs/2.2.0/ http://spark.apache.org/docs/2.2.0/rdd-programming-guide.html http://spark.apache.org/docs/2.2.0/quick-start.html https://community.hortonworks.com/articles/151164/how-to-submit-spark-application-through-livy-rest.html curl -H "Content-Type: application/json" -H "X-Requested-By: admin" -X POST -d '{"file": "/apps/example.jar","className": "com.dataflowdeveloper.example.Links"}' http://server:8999/batches curl -H "Content-Type: application/json" -H "X-Requested-By: admin" -X POST -d '{"file": "hdfs://server:8020/apps/example_2.11-1.0.jar","className": "com.dataflowdeveloper.example.Links"}' http://server:8999/batches FYI 18/03/14 11:54:54 INFO LineBufferedStream: stdout: 18/03/14 11:54:54 INFO Client: Source and destination file systems are the same. Not copying hdfs://server:8020/opt/demo/example.jar
... View more
03-15-2018
01:07 AM
Easy to run this code from Spark Shell as well without connection to nifi runshell.sh
/usr/hdp/current/spark2-client/bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0,org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 --jars /opt/demo/example.jar
... View more
03-14-2018
09:20 PM
3 Kudos
What: Executing Scala Apache Spark Code in JARS from Apache NiFi Why: You don't want all of your scala code in a continuous block like Apache Zeppelin Tools: Apache NiFi - Apache Livy - Apache Spark - Scala Flows: Option 1: Inline Scala Code Apache Zeppelin Running the Same Scala Job (have to add the jar to interpreter for Spark and restart) Grafana Charts of Apache NiFi Run Log Search Helps You Find Errors Run Code For Your Spark Class Setting up Your ExecuteSparkInteractive Processor Setting Up Your Spark Service for Scala Tracking the Job in Livy UI Tracking the Job in Spark UI I was looking at doing this: Pull code from Git and put into a NiFi attribute, run directly. For bigger projects, you will have many classes and dependencies that may require a full IDE and SBT build cycle. Once I build a Scala jar I want to run against that. Example Code package com.dataflowdeveloper.example
import org.apache.spark.sql.SparkSession
class Example () {
def run( spark: SparkSession) {
try {
println("Started")
val shdf = spark.read.json("hdfs://princeton0.field.hortonworks.com:8020/smartPlugRaw")
shdf.printSchema()
shdf.createOrReplaceTempView("smartplug")
val stuffdf = spark.sql("SELECT * FROM smartplug")
stuffdf.count()
println("Complete.")
} catch {
case e: Exception =>
e.printStackTrace();
}
}
}
=--- Run that with
import com.dataflowdeveloper.example.Example
println("Before run")
val job = new Example()
job.run(spark)
println("After run")
=== after run
{"text\/plain":"After run"} Import Tip You need to put your Jar in Session.jars on the Session control and on the same directory on hdfs. So I did /opt/demo/example.jar in Linux and hdfs:// /opt/demo/example.jar. Make sure Livy and NiFi have read permissions on those. Github: https://github.com/tspannhw/livysparkjob Github Release: https://github.com/tspannhw/livysparkjob/releases/tag/v1.1 Apache NiFi Flow Example spark-it-up-scala.xml
... View more
Labels:
03-14-2018
11:52 AM
Yes, much safer to have another instance you can use for reporting and such. Even if it's just one node.
... View more
03-12-2018
01:37 PM
For now, you can use this NiFi flow to do schema registry stuff: https://community.hortonworks.com/articles/177349/big-data-devops-apache-nifi-hwx-schema-registry-sc.html
... View more
03-12-2018
02:19 AM
The update: https://community.hortonworks.com/articles/177370/extracting-html-from-pdf-excel-and-word-documents.html
... View more
03-12-2018
02:19 AM
https://community.hortonworks.com/articles/177370/extracting-html-from-pdf-excel-and-word-documents.html
... View more