Member since
09-23-2015
42
Posts
91
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1195 | 02-01-2016 08:56 PM | |
2791 | 01-16-2016 12:40 PM | |
6440 | 01-15-2016 01:14 PM | |
5406 | 01-14-2016 09:37 PM | |
6920 | 12-14-2015 01:02 PM |
03-15-2017
01:02 PM
Just to clarify https://issues.apache.org/jira/browse/NIFI-2613 currently only supports XSSF (.xlsx) based Excel documents. I do plan to add support for HSSF (.xls) in the near future.
... View more
09-14-2016
05:07 PM
6 Kudos
If I approached you on the street and challenged you to name a state capital building from a flashcard image how many capital buildings out of the 50 do you think you could recognize? That would be pretty difficult right? We humans tend to have a narrow scope of knowledge focused around our intimate interactions with day to day life. I would have no problem identifying Atlanta Georgia’s capital building for example because I live in Atlanta and am exposed to it almost daily. If presented with a image of Kansas's capital in Topeka however I would be stumped since I have never been there or even seen a picture of that capital building. The point is experience drives our ability to recognize images. Most millennials tend to also share these “experiences” to social media, blogs, text, etc. Google is dominate at crunching these digital “experiences” from its users and artfully marrying those impressions against validated datasets. Google Vision is a rest API hosted and managed by Google that allows users to upload arbitrary images and perform services like landmark detection, label annotations, OCR, image properties, explicit content detection, Face detection along with sentiment, and corporate logo detection with amazing accuracy. Obviously this opens up a wide range of next generation platform possibilities but how do we use it? Google Vision can be accessed via a myriad of language sdks but the focus of this article will be around Google Vision’s integration with Apache NiFi. I quizzed myself and shamefully was only able to recognize 4 of the 50 state capitals but how did Google and Apache NiFi do? Using Apache NiFi and Google Vision API I was able to successfully detect 35 of 50 state capitals from those same images! Don’t believe me? Lets take a look at how I did it. First up was the Google Vision API integration with Apache NiFi. Apache NiFi already has a robust set of tools for invoking REST APIs and handling JSON data. However I prefer my workflows to remain clean and concise. Although the discrete components are there I opted to create a custom GoogleVisionProcessor to condense those messy workflows into a single processor. The source code and instructions for using this processor can be found at https://github.com/jdye64/nifi-addons/tree/master/Processors/nifi-google-cloud. I also plan to contribute it to Apache in the coming weeks after I iron out some more advanced features. Lets take a look at the NiFi workflow and results from the experiment. As you can see the GoogleVisionProcessor properly detected 35 out of 50 state capitals! The processor takes the JSON detection definition returned by Google Vision and creates a handy Flowfile attribute that allows us to access that information using other Apache NiFi processors and do with it as we will. And just for reference here is a more graphical representation of the same image landmark detection from Google. I’m really excited about the new opportunities using the combination of Google’s advanced analytics with Apache NiFi’s agility will bring to end users. Much more information about Google Vision can be found at https://cloud.google.com/vision/ and https://cloud.google.com/blog/big-data/2016/09/around-the-world-landmark-detection-with-the-cloud-vision-api
... View more
Labels:
07-21-2016
01:41 PM
3 Kudos
@Brandon Wilson I believe the best way to do this would be to install NiFi on your system using $NIFI_HOME/bin/nifi.sh install
and then using an OS level processing management tool (supervisord on Ubuntu for example) to monitor that process and then restart it based on configurations that you provide to the process management tool.
... View more
05-24-2016
02:56 AM
12 Kudos
Recently I decided it was time to give my lawn a makeover. Years of brutally hot Atlanta summers have taken their toll on my grass and its well … dead. I chose the do it yourself route and as usual went over budget and invested far more time than I had planned. None the less I now have a decent looking backyard. Given my investment and how much I travel I decided it would be worth the extra money to install an automated sprinkler system. There are several of these that exist in the market so I set out doing my research. I chose Rachio (http://rachio.com) for its ability to control watering based on weather and other conditions which is awesome! While Rachio is a great product with several features it had a few shortcomings that I was hoping to supplement with my existing home setup (powered by Apache NiFi of course). The idea was the use all of the existing features that Rachio offered and then use data from my local home automation setup to further supplement the watering system. There were two main features that Rachio didn’t offer that I wanted to add.
Dog (Zeke) Location - My wife and I have the world’s coolest dog (Zeke). He does have a few weaknesses however and water happens to be one of them. Since he spends much of his time in the fenced in backyard by himself I can’t risk them turning on while he is back there or he will enter hyper puppy play mode and dig them all up. Worse yet he will bring that mud back in the house with him once he is done. It is a must that my the system understand when he is outside and not allow watering to occur. Outdoor Gatherings - We use our backyard a lot and don’t want unexpected waterings while we have guests in the backyard. While Rachio allows you to manually control this with an app I wanted a more automated approach that understood when we were in the backyard without any manual process. After settling on the features that I wanted to my system I set out to solve the technical implementation and landed on the approaches listed below. Dog (Zeke) Location - This problem was a little tough to solve. I finally landed on installing an iBeacon (Gimbal Series 10) on Zeke’s collar and setting up a custom Raspberry PI BLE scanner that I had made for another project. This is out of the scope of this blog but at a high level the scanner sits at his only entrance/exit to the yard and toggles between him being either inside of outside. This is c++ and python application that uses Linux bluez. An instance of Apache NiFi is also running on this Raspberry PI and forwarding the JSON iBeacon payload to my NiFi master cluster for further analysis. Outdoor Gatherings - Similar to tracking Zeke with his iBeacon collar I have a separate wireless network in my backyard and uses a MikroTik RouterOS software to monitor for MAC addresses of friends and family’s mobile phones. The logic is if a known MAC address is connected to that network in the backyard then someone is back there and we should delay the sprinklers being turned on. Another instance of Apache NiFi is gathering output from RouterOS and sending that information the the main NiFi instance for further analysis. To recap I have three instances of Apache NiFi running. Two instances are gathering data from its point of inception and passing that data along to the third instance where the data is analyzed. This instance also sends requests to the Rachio API to turn off the watering system if a dog or human being is detected in the backyard. Lets take a look at the NiFi workflow of the third instance that ultimately controls the water system. The workflow was created with out of the box features and simple steps to follow. Clearly Apache NiFi is the cadillac of integrating with other awesome 3rd party systems!
... View more
Labels:
04-19-2016
11:41 PM
13 Kudos
I’m constantly amazed by what powerful things I can do with Apache NiFi in such few steps. I often challenge myself by saying “self, I bet you couldn’t do X with NiFi”. My confidence was challenged yesterday on a long flight back from Peru to Atlanta when I realized I couldn’t perform OCR type tasks with NiFi as it stands today. Perturbed by this fact I set out to come up with a solution. Ultimately this lead me to create a NiFi Tesseract processor for performing OCR tasks natively from within Apache NiFi. It wasn’t really until I was finished that I realized the how useful this processor could be. The Apache Tesseract Processor would give me the ability to read anything from hand written doctors notes from healthcare systems to interpreting scanned children’s book images.
In fact I chose to demonstrate the later by showing how to use Apache NiFi to perform OCR on an excerpt from Dr. Seuss's - "Cat in the Hat” and then feeding that resulting text from the NiFi Tesseract processor to the Mac OS X “say” command to read the output. I have included a screen recording session that shows the Apache NiFi reading in a page from Cat in the Hat and then reading the results. Screen Recording - Using Apache NiFi to read children's books Only 5 simple drag and drop processors for a computer to read a child’s book! Thanks Apache NiFi!
... View more
Labels:
04-19-2016
09:40 PM
10 Kudos
Avro is a popular file format within the Big Data and streaming space. Avro has 3 important characteristics that make it a great fit for both Big Data and streaming applications.
Avro files are self describing. Meaning the Avro files can be opened and the schema definition viewed as standard JSON or inspected programmatically by numerous applications. This makes your application code much less brittle as the schema information can be obtained from the incoming Avro file itself rather than manually defining them in your code. Avro can handle a wide range of data type natively. This includes things like complex types, Maps, Arrays, and even raw bytes. Avro supports schema evolution which can come in very handy in streaming systems where the data flowing through the system can change without notice.
Now that all of the pros of Avro have been called out there is a problem. In reality we rarely encounter Avro files while ingesting/streaming data. Why is this? Mostly because storing data in the Avro format requires defining the Avro schema up front. Defining the Avro schema up front requires some planning and knowledge of the Avro file format itself not everyone has. This is a tragedy given the many benefits Avro provides to us. This was the driving factor for me creating the “InferAvroSchema” processor within Apache NiFi. InferAvroSchema exists to help endusers who either don’t have the time or the knowledge to create Avro files. InferAvroSchema exists to overcome the initial creation complexity issues with Avro and allows Apache NiFi users to quickly take more common flat data files, like CSV, and transform them into Avro.
So now that we have a little background lets get into the details about how we make this happen using Apache NiFi and InferAvroSchema. Before we start lets take a look at the end result to get the big picture. Our end result is a workflow that takes loads a CSV file holding Weather specific data and converts it to an Avro file. The Weather.csv file is loaded using GetFile and then examined by the InferAvroSchema processor to determine the appropriate Avro schema. After the Avro schema is generated the ConvertCSVToAvro processor uses that schema to convert the CSV weather data to an Avro file. The resulting Avro file is ultimately written back to a local file on the NiFi instance machine. While the CSV data certainly doesn’t have to come from a local file (could have been FTP, S3, HDFS, etc) that was the easiest to demonstrate here. There is a local CSV file on my Mac called “Weather.csv”. “Weather.csv” is loaded by the GetFile processor which places the complete contents of “Weather.csv” into a new NiFi FlowFile. Below is a snippet pf the contents of “Weather.csv” to provide context. As you can see the CSV data contains a couple of different weather data points for a certain zip code. We want to take this data from its original CSV format and convert it to an Avro file. So where do we start? First we want to use the “InferAvroSchema” processor to help us make our Avro schema definition without having the manually define it ourselves since we are pretending we have no idea how to make Avro schemas by hand. InferAvroSchema can examine the contents of CSV or JSON data and provide for us a recommended Avro schema definition based on the data that it encounters in the incoming FlowFile content. InferAvroSchema provides a lot of flexibility via configurations so lets go through those and what they mean now. The above screenshot shows the available properties for InferAvroSchema. At first look it can be a little daunting so lets breakdown what is happening here by stepping through each property.
Schema Output Destination - This property controls where the Avro Schema will be written to once it has been generated. Remember we are not converting the CSV to Avro with this processor we are only creating the Avro schema. The actual work of converting the CSV to Avro is done by another processor (ConvertCSVToAvro). ConvertCSVToAvro requires an Avro schema to perform its duties so it makes sense to place the resulting schema in a location that can be easily accessed by ConvertCSVToAvro when it is needed. That is why I have chose to output the Schema as an attribute on the FlowFile so that I can use the NiFi expression language from within the ConvertCSVToAvro processor as you will see later. Input Content Type - Lets the processor know what type of data is in the FlowFile content and that it should try and infer the Avro schema from. In our example here that is CSV but JSON is also valid. CSV Header Definition - Since an Avro Schema needs to know the names for each field it contains this provides us a mechanism to provide those. This value can also be loaded from the CSV header definition as well but I placed it here just to demonstrate. You will notice it is also present in the Weather.csv screenshot above and we handle that in the next property. Get CSV Header Definition From Data - Since we manually specified the CSV Header Definition we don’t want to get the Header Definition from the Weather.csv file itself. IF we had of choosen to not manually specifiy it we could have set this to true to pull that value from the Weather.csv file itself. CSV Header Line Skip Count - Since the Weather.csv file does in fact contain a header line but we chose not to use it and manually we need to make sure we skip that line so that it is not present in the Avro schema logic itself. This is why the value is set to 1 to skip that first line. CSV Escape String - Character used to escape strings. CSV Quote String - Character used for CSV data quote. Pretty Avro Output - Makes the results Avro output pretty formatted or not. Strictly for aesthetic purposes only. Avro Record Name - This value will be the name of the Avro record in the resulting Avro schema. You can set this value to whatever you desire. Of course it makes sense to name it something relative to the data so here I have called it “Weather" Numer of Records To Analyze - This is how many records the processor should analyze to determine the type (String, Long, etc) of the data present in the CSV data. 10 is the default and seems to be the sweet spot for accuracy and performance. Charset - The character encoding of the incoming FlowFile content. In this case the Weather.csv is UTF-8 encoded so that is what I have specified. At this point you will have a Avro schema that was automatically generated based on the raw incoming CSV data, congratulations! This doesn’t do us much good however. We still need to put that Avro schema to work and convert the original Weather.csv data into an Avro file. We do that in our next step with the ConvertCSVToAvro processor. The configuration for that processor is described below. The properties for ConvertCSVToAvro are a little more straightfoward so we aren’t going to go through them one by one. I do want to point out the value for Record Schema however. If you will notice it has a value of ${inferred.avro.schema}. If you recall in the InferAvroSchema processor above we told it to write the resulting Avro Schema to the FlowFile attribute. So now we are able to access that value using the NiFi expression language here. The name of the FlowFile property will always be inferred.avro.schema. At this point our Weather.csv data has successfully been converted to an Avro file and we can do whatever we desire with it. I chose to simply write the data back to another local file which will be named “Weather.csv.avro”. Here is a screenshot of that output.
... View more
Labels:
04-19-2016
09:01 PM
Can you please post the JSON coming off of AttributesToJSON? Changing "Include Core Attributes" alone will not solve your problem.
... View more
04-19-2016
08:02 PM
Ahh I think I see what the problem is I think it is because you have "Include Core Attributes" set to true in AttributesToJSON and some extra fields are getting introduced into the JSON not present in the database table. Please paste that content I mentioned earlier however so I can validate.
... View more
04-19-2016
07:59 PM
Ok so the only way you should be seeing this is if JSON isn't in the format the ConvertJSONToSQL is expecting. The processor does a final Iterator<String> fieldNames = rootNode.getFieldNames(); and then performs a while loop on that Iterator incrementing a "fieldCount" variable each time. The only way you could see this is if the JSON isn't really what you think it is. I see the connection between "AttributesToJSON" and "ConvertJSONToSQL" has some FlowFiles in there. Can you right click that connection and list the contents and paste the exact contents of one of them here? Wondering if "AttributesToJSON" is doing something squirrely. I wrote it so its certainly possible ...
... View more
04-19-2016
07:46 PM
Your configuration looks valid to me. Can you post a screenshot showing your configuration for what is being written to the FlowFile contents and feed to the ConvertJSONToSQL processor? It also might help to validate that the JSON payload you expect is actually in the FlowFile's content by using a LogAttribute processor and setting the "Log Payload" Property to true right before going to the ConvertJSONToSQL processor.
... View more