About alim

alim · ‎09-22-2017

I changed the extensions on both .xlsx files to .zip. Unzipping them reveals the folder structure of those files. Going through the included XML files, there were some differences but nothing that stood out to cause these errors. Do you know how the vendor generates the Excel files? Is it possible these files are really .xls files but just have the .xlsx file extension? Do you know what version of Excel they use?

alim · ‎09-21-2017

Looking more closely at nifi-app.log, I see the following errors: 2017-09-21 13:58:36,314 ERROR [Timer-Driven Process Thread-9] o.a.n.p.poi.ConvertExcelToCSVProcessor ConvertExcelToCSVProcessor[id=a4cfc1b5-015e-1000-b59d-535f6969973d] Failed to process incoming Excel document: java.lang.UnsupportedOperationException: Only .xlsx Excel 2007 OOXML files are supported java.lang.UnsupportedOperationException: Only .xlsx Excel 2007 OOXML files are supported at org.apache.nifi.processors.poi.ConvertExcelToCSVProcessor$1.process(ConvertExcelToCSVProcessor.java:195) at org.apache.nifi.controller.repository.StandardProcessSession.read(StandardProcessSession.java:2136) at org.apache.nifi.controller.repository.StandardProcessSession.read(StandardProcessSession.java:2106) at org.apache.nifi.processors.poi.ConvertExcelToCSVProcessor.onTrigger(ConvertExcelToCSVProcessor.java:151) at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27) at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1120) at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:147) at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47) at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:132) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: Package should contain a content type part [M1.13] at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:197) at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:696) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:280) at org.apache.nifi.processors.poi.ConvertExcelToCSVProcessor$1.process(ConvertExcelToCSVProcessor.java:159) ... 15 common frames omitted 2017-09-21 13:58:36,430 ERROR [Timer-Driven Process Thread-9] o.a.n.p.poi.ConvertExcelToCSVProcessor ConvertExcelToCSVProcessor[id=a4cfc1b5-015e-1000-b59d-535f6969973d] Failed to process incoming Excel document: java.lang.NullPointerException java.lang.NullPointerException: null at org.apache.nifi.processors.poi.ConvertExcelToCSVProcessor.handleExcelSheet(ConvertExcelToCSVProcessor.java:249) at org.apache.nifi.processors.poi.ConvertExcelToCSVProcessor.access$000(ConvertExcelToCSVProcessor.java:72) at org.apache.nifi.processors.poi.ConvertExcelToCSVProcessor$1.process(ConvertExcelToCSVProcessor.java:190) at org.apache.nifi.controller.repository.StandardProcessSession.read(StandardProcessSession.java:2136) at org.apache.nifi.controller.repository.StandardProcessSession.read(StandardProcessSession.java:2106) at org.apache.nifi.processors.poi.ConvertExcelToCSVProcessor.onTrigger(ConvertExcelToCSVProcessor.java:151) at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27) at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1120) at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:147) at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47) at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:132) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

alim · ‎09-21-2017

I was able to reproduce the issue using the sample file you provided. If I save that .xlsx file (without making any modifications) using my Excel (Microsoft Excel for Mac Version 15.18) and use that file instead, the ConvertExcelToCSV processor has no errors. Please see attached file: jax-shipment-profile-report-monday-18-september-20.zip Trying to determine what difference is causing the error.

alim · ‎09-21-2017

Hi @Lovelesh Chawla, It looks like someone has encountered a similar issue (https://stackoverflow.com/questions/45792912/nifi-convertexceltocsvprocessor-error). In /logs/nifi-app.log, can you provide the full stacktrace of the error? Have you confirmed that the data going into the ConvertExcelToCSV processor is the proper format (.xlsx (XSSF 2007 OOXML file format) Excel documents and not older .xls (HSSF '97(-2007) file format) documents).

alim · ‎09-20-2017

Hi @Sanaz Janbakhsh, PublishMQTT is a processor that requires an incoming relationship. Basically, something has to be coming in to it to publish to a broker. That is why you are getting the "upstream connections is invalid..." message. It sounds like you need to use the ConsumeMQTT processor which receives messages from an MQTT broker. If I am misunderstanding your use case, please post a screenshot of your flow if possible.

alim · ‎09-15-2017

Objective This tutorial demonstrates how to use the PutElasticsearchHttpRecord processor to easily put data from a CSV file into Elasticsearch. Note: The record-oriented processors and controller services were introduced in NiFi 1.2.0. As such, the tutorial needs to be done running Version 1.2.0 or later. Environment This tutorial was tested using the following environment and components: Mac OS X 10.11.6 Apache NiFi 1.3.0 Elasticsearch 2.3.3 PublishElasticsearchHttpRecord (CSVReader) Demo Configuration Elasticsearch For my environment, I had Elasticsearch 2.3.3 installed. Start Elasticsearch and assign cluster and node names: ./elasticsearch --cluster.name elasticsearch --node.name hcc I like to use Elastic HQ to manage/monitor my cluster: Initial Flow One of the great things about the record-oriented functionality in NiFi is the ability to re-use Record Readers and Writers. In conjunction with the Record processors, it is quick and easy to change data formats and data destinations. For example, let's assume you have the flow working from the article "Using PublishKafkaRecord_0_10 (CSVReader/JSONWriter)". Note: The template for that flow can be found in that article as well as step-by-step instructions on how to configure it. As currently configured, the flow: 1. Pulls a .zip file of movie data (titles, tags, ratings, etc.) from a website. 2. Unzips the file. 3. Sends only the movie title information on in the flow. 4. Adds Schema Name "movies" as an attribute to the flowfile. 5. Uses PublishKafkaRecord_0_10 to convert the flowfile contents from CSV to JSON and publish to a Kafka topic. Say instead of publishing that movie data to Kafka, you now want to put it in Elasticsearch. The following steps will demonstrate how to do that quickly and simply by replacing the PublishKafkaRecord processor with a PutElasticsearchRecord processor and re-using a CSVReader. Elasticsearch Flow Setup 1. Delete the connection between the UpdateAttribute and PublishKafkaRecord_0_10 processors. Now delete the PublishKafkaRecord_0_10 processor or set it off to the side. 2. Add a PutElasticsearchHttpRecord to the canvas. 3. Connect the UpdateAttribute processor to the PutElasticsearchHttpRecord processor: 4. Open the Configure dialog for the PutElasticsearchHttpRecord process. On the Settings tab, auto-terminate the "success" relationship and for the purposes of this demo, auto-terminate the "failure" relationship also. 5. On the canvas, make a "retry" relationship connection from the PutElasticsearchHttpRecord to itself. 6. On the Properties tab: Add "http://127.0.0.1:9200" for the Elasticsearch URL property Add "movies" for the Index property Add "default" for the Type property Since it and its schema was already defined for the original PublishKafka flow, simply select "CSVReader" for the Record Reader property. The flow is ready to run. Flow Results Start the flow. (Note: If you had run the original PublishKafka flow previously, don't forget to clear the state of the GetHTTP processor so that the movie data zip is retrieved again.) The movie data is now in Elasticsearch: Helpful Links Here are some links to check out if you are interested in other flows which utilize the record-oriented processors and controller services in NiFi: Change Data Capture (CDC) with Apache NiF Convert CSV to JSON, Avro, XML using ConvertRecord Installing a local Hortonworks Registry to use with Apache NiFi Running SQL on FlowFiles using QueryRecord Processor Using PartitionRecord (GrokReader/JSONWriter) to Parse and Group Log Files

alim · ‎09-11-2017

Objective This tutorial walks you through a NiFi flow that utilizes the PublishKafkaRecord_0_10 processor to easily convert a CVS file into JSON and then publish to Kafka. The tutorial is based on the blog "Integrating Apache Nifi with Apache Kafka", updated with the more recent record-oriented processors and controller services available in NiFi. Note: The record-oriented processors and controller services were introduced in NiFi 1.2.0. As such, the tutorial needs to be done running Version 1.2.0 or later. Environment This tutorial was tested using the following environment and components: Mac OS X 10.11.6 Apache NiFi 1.3.0 Apache Kafka 0.10.2.1 PublishKafkaRecord_0_10 (CSV to JSON) Support Files Here is a template of the flow discussed in this tutorial: publishkafkarecord.xml Demo Configuration Kafka Download & Install The flow in this demo utilizes the PublishKafkaRecord_0_10 processor, which as the name implies, utilizes the Kafka 0.10.x Producer API. As a result, a 0.10.x version of Kafka is required for this tutorial. For my environment, I downloaded and installed Kafka 0.10.2.1 (Scala 2.11). Kafka Configuration & Startup In the bin directory of your Kafka install: Start ZooKeeper: ./zookeeper-server-start.sh ../config/zookeeper.properties Start Kafka: ./kafka-server-start.sh ../config/server.properties Create Kafka Topic: ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic Movies Start Kafka Consumer: ./kafka-console-consumer.sh --zookeeper localhost:2181 --topic Movies --from-beginning Import Template Start NiFi. Import the provided template and add it to the canvas. You should see the following flow on your NiFi canvas: Enable Controller Services Select the gear icon from the Operate Palette: This opens the NiFi Flow Configuration window. Select the Controller Services tab: Enable AvroSchemaRegistry by selecting the lightning bolt icon/button. This will then allow you to enable the CSVReader and JSONRecordSetWriter controller services. Select the lightning bolt icons for both of these services. All the controller services should be enabled at this point: The flow is now ready to run. Flow Overview Here is a quick overview of the flow: 1. GetHTTP pulls a .zip file of movie data (titles, tags, ratings, etc.) from a website 2. UnpackContent unzips the file 3. RouteOnAttribute sends just the movie title information on in the flow 4. UpdateAttribute adds Schema Name "movies" as an attribute to the flowfile 5. PublishKafkaRecord_0_10: Converts the flowfile contents from CSV to JSON Publishes the JSON data to the Kafka topic "Movies" Flow Details Let's look at each of the processors in the flow in detail: Get Movie Data (GetHTTP Processor) This processor pulls a zip file from the website MovieLens, a movie recommendation service. The dataset (ml-20m.zip) contains 20,000,263 ratings and 465,564 tag applications across 27,278 movies. Looking at the processor's configuration: Start the processor to retrieve the file: Note: On the Scheduling tab, the Runs Schedule is set to 10 minutes instead of the default 0 secs, so the processory only periodically checks to see if the data file has been updated instead of constantly: Unzip (UnpackContent Processor) The next processor is UnpackContent which unzips the "ml-20m.zip" file: Running the processor unzips the file into 7 separate csv files (movies.csv, ratings.csv, tags.csv, links.csv, genome-scores.csv, genome-tags.csv, and README.txt): RouteOnAttribute Processor RouteOnAttribute is next. Looking at its configuration: the processor routes the flowfiles to different connections depending on the file name (movies.csv, ratings.csv, tags.csv). For the purposes of this demo, we are only interested in publishing the movie title data to Kafka. As such, we make the connection to the next processor (UpdateAttribute) using the "movies" relationship and auto-terminate the others: Run the RouteOnAttributeProcessor to send only the movie title data: Add Schema Name Attribute (UpdateAttribute Processor) The next step in the flow is an UpdateAttribute processor which adds the schema.name attribute with the value of "movies" to the flowfile: Start the processor, and view the attributes of the flowfile to confirm this: You can also confirm the contents of the flowfile is in CSV format at this point in the flow: Publish to "Movies" Topic (PublishKafkaRecord_0_10 Processor) The final processor is PublishKafkaRecord_0_10. Looking at its configuration: Kafka Brokers property is set to "localhost:9092" and Topic Name property is set to "Movies". Record Reader is set to "CSVReader" and Record Writer is set to "JsonRecordSetWriter". The "CSVReader" controller service parses the incoming CSV data and determines the data's schema. The "JsonRecordSetWriter" controller service determines the data's schema and writes that data into JSON. CSVReader Controller Service Select the arrow icon next to the "CSV Reader" which opens the Controller Services list in the NiFi Flow Configuration. "CSVReader" should be highlighted in the list. Select the View Details button ("i" icon) to see the properties: With Schema Access Strategy property set to "Use 'Schema Name' Property", the reader specifies the schema expected in an attribute, which in this flow is schema.name. The Schema Registry property is set to the AvroSchemaRegistry Controller Service which defines the "movies" schema. Select the arrow icon next to "AvroSchemaRegistry" and select the View Details button ("i" icon) to see its properties: The schema is defined as: { "type": "record", "name": "MoviesRecord", "fields" : [ {"name": "movieId", "type": "long"}, {"name": "title", "type": ["null", "string"]}, {"name": "genres", "type": ["null", "string"]} ] } JsonRecordSetWriter Controller Service Close the window for the AvroSchemaRegistry. Select the View Details button ("i" icon) next to the "JsonRecordSetWriter" controller service to see its properties: Schema Write Strategy is set to "Set 'schema.name' Attribute", Schema Access Strategy property is set to "Use 'Schema Name' Property" and Schema Registry is set to AvroSchemaRegistry. See JSON in Kafka Start the PublishKafaRecord processor and you will see the JSON movie data in your Kafka Consumer window: Helpful Links Here are some links to check out if you are interested in other flows which utilize the record-oriented processors and controller services in NiFi: Change Data Capture (CDC) with Apache NiF Convert CSV to JSON, Avro, XML using ConvertRecord Installing a local Hortonworks Registry to use with Apache NiFi Running SQL on FlowFiles using QueryRecord Processor Using PartitionRecord (GrokReader/JSONWriter) to Parse and Group Log Files

alim · ‎09-07-2017

What version of NiFi are you using? If you are on 1.2.0 or later, you can make use of the new Record Reader/Writer capabilities. Check out the HCC article I wrote showing CSV to JSON to learn more: https://community.hortonworks.com/content/kbentry/115311/convert-csv-to-json-avro-xml-using-convertrecord-p.html There are some URLs at the end for other articles about the Record Readers/Writers. In your scenario, you would use the AvroReader and the CSVRecordSetWriter in your ConvertRecord processor

alim · ‎09-06-2017

Hi @Anishkumar Valsalam I setup a Schema Registry using MySQL as part of another HCC article (https://community.hortonworks.com/content/kbentry/119766/installing-a-local-hortonworks-registry-to-use-wit.html). I believe you need to run the bootstrap script, which will create the necessary tables: ./bootstrap/bootstrap-storage.sh

alim · ‎09-05-2017

Objective Apache NiFi provides the option of starting an embedded ZooKeeper server. However, NiFi can also be configured to run with an external ZooKeeper server. This article describes how to install and configure a 3 host ZooKeeper ensemble to work with a 2 node NiFi cluster. Environment This tutorial was tested using the following environment and components: Mac OS X 10.11.6 Apache ZooKeeper 3.4.6 Apache NiFi 1.3.0 ZooKeeper ZooKeeper Version The version of ZooKeeper chosen for this tutorial is Release 3.4.6. Note: ZooKeeper 3.4.6 is the version supported by the latest and previous versions of Hortonworks HDF as shown in the "Component Availability In HDF" table of the HDF 3.0.1.1 Release Notes. ZooKeeper Download Go to http://www.apache.org/dyn/closer.cgi/zookeeper/ to determine the best Apache mirror site to download a stable ZooKeeper distribution. From that mirror site, select the zookeeper-3.4.6 directory and download the zookeeper-3.4.6.tar.gz file. Unzip the tar.gz file and create 3 copies of the distribution directory, one for each host in the ZooKeeper ensemble. For example: /zookeeper-1 /zookeeper-2 /zookeeper-3 Note: In this tutorial, we are running multiple servers on the same machine. ZooKeeper Configuration "zoo.cfg" file Next we need to create three config files. In the conf directory of zookeeper-1, create a zoo.cfg file with the following contents: tickTime=2000 dataDir=/usr/local/zookeeper1 clientPort=2181 initLimit=5 syncLimit=2 server.1=localhost:2888:3888 server.2=localhost:2889:3889 server.3=localhost:2890:3890 Because we are running multiple ZooKeeper servers on a single machine, we specified the servername as localhost with unique quorum & leader election ports (i.e. 2888:3888, 2889:3889, 2890:3890) for each server.X. Create similar zoo.cfg files in the conf directories of zookeeper-2 and zookeeper-3 with modified values for dataDir and clientPort properties as separate dataDirs and distinct clientPorts are necessary. "myid" file Every machine that is part of the ZooKeeper ensemble needs to know about every other machine in the ensemble. As such, we need to attribute a server id to each machine by creating a file named myid , one for each server, which resides in that server's data directory, as specified by the configuration file parameter dataDir . For example, create a myid file in /usr/local/zookeeper1 that consists of a single line with the text "1" and nothing else. Create the other myid files in the /usr/local/zookeeper2 and /usr/local/zookeeper3 directories with the contents of "2" and "3" respectively. Note: More information about ZooKeper configuration settings can be found in the ZooKeeper Getting Started Guide. ZooKeeper Startup Start up each ZooKeeper host, by navigating to the /bin directory of each and applying the following command: ./zkServer.sh start NiFi NiFi Configuration For a two node NiFi cluster, in each conf directory modify the following properties in the nifi.properties file: nifi.state.management.embedded.zookeeper.start=false nifi.zookeeper.connect.string=localhost:2181,localhost:2182,localhost:2183 The first property configures NiFi to not use its embedded ZooKeeper. As a result, the zookeeper.properties and state-management.xml files in the conf directory are ignored. The second property must be specified to join the cluster as it lists all the ZooKeeper instances in the ensemble. NiFi Startup You can now start up each NiFi node. When the UI is available, create or upload a flow that has processors that capture state information. For example, import and setup the flow from the Change Data Capture (CDC) with Apache NiFi series: In addition to the other setup steps from the CDC article, since this environment is a cluster, for the CaptureChangeMySQL processor, go to the Scheduling tab on the Configure Processor dialog. Change the Execution setting to "Primary node" from "All nodes": Run the flow and select "View State" from the CaptureChangeMySQL and/or EnforceOrder processors to verify that state information is managed properly by the external ZooKeeper ensemble:

Online	Offline
Last Visited	‎10-02-2024 12:24 PM

Member Since	‎04-05-2016 03:36 PM
Last Visited	‎10-02-2024 12:24 PM
Posts	139
Kudos received	144

Cloudera Community

Re: How to Convert XML file into CSV file using NI...

Re: NiFi privilege - Allow modify but not operate ...

Re: I cannot see the data in the queue, I'm in Nif...

Re: Apache Nifi registry Issues with Git

Re: Client cert authentication without admin privi...

Re: Convert Excel to CSV Processor returns error

Re: Convert Excel to CSV Processor returns error

Re: Convert Excel to CSV Processor returns error

Re: Convert Excel to CSV Processor returns error

Re: Invalid upstream connection for MQTT in Nifi

Using PutElasticsearchHttpRecord (CSVReader) in Ap...

Using PublishKafkaRecord_0_10 (CSVReader/JSONWrite...

Re: Need Help to generate CSV file from JSON

Re: Schema Registry - Metstore tables Missing.

Configuring an external ZooKeeper to work with Apa...