About rgelhausen

rgelhausen · ‎04-11-2016

@jfrazee if I match on multiple entries in the dictionary, will this processor emit one FlowFile for every entry match? A single flowfile with attributes of all the matched entries?

rgelhausen · ‎04-10-2016

I have flatfiles of metadata (with updates every few minutes throughout the day). I have another stream that I need to join to this metadata in real-time. I know I can accomplish this in Storm or Spark Streaming with some code. Can NiFi help me do this without writing code? For example, I have a list of malicious websites (the metadata), and I'm streaming in http requests.. I need to join the domains on those requests with the list of malicious websites and emit an alert if there are match(es). Slightly more complex version of the same requirement.. how would I incorporate regular updates to the metadata?

rgelhausen · ‎04-07-2016

Can I add a secondary index to a dynamic column defined as part of a view?

rgelhausen · ‎04-06-2016

@Jeremy Dyer -- updated the question with additional items - any comments on those?

rgelhausen · ‎04-06-2016

How do upserts of new records impact the number of pre-split regions? How do updates of existing records impact the number of pre-split regions?

rgelhausen · ‎04-04-2016

Thanks, manually specifying the full path worked. Is there a way to set a default directory? I've tried "set hive.metastore.warehouse.dir=my_dir" but it had no effect.

rgelhausen · ‎04-02-2016

I'm using SparkSQL (local mode) in Zeppelin for development work. As I am not running on a cluster, I do not have /user/hive/warehouse directories. If I'm using strictly SQL, is there a way I can specify the directory of my newly created tables? How about setting the default output directories? Failing Example: %sql create table pings as select split(time, " ")[0] as month, split(time, " ")[2] as year, split(split(time, " ")[3], ":")[0] as hour, split(split(time, " ")[3], ":")[1] as minute, split(split(split(time, " ")[3], ":")[2], "\\.")[0] as second, substr(split(split(split(time, " ")[3], ":")[2], "\\.")[1],0, 3) as ms, * from pings_raw MetaException(message:file:/user/hive/warehouse/pings is not a directory or unable to create one) set zeppelin.spark.sql.stacktrace = true to see full stacktrace

rgelhausen · ‎03-30-2016

What factors should inform whether I use NiFi or Sqoop for ingesting my data?

rgelhausen · ‎03-21-2016

In case you haven't already created your Hive table, this will help you do so. Assuming your cluster is running in Linux VMs, Python is already installed. I can't comment on tf-idf, but the below should help you understand a generic approach to integrating Python functions and Hive queries. Once you have a Hive table, running Python scripts against records is straightforward. You'll need to ssh into your "edge" node, one that has the Hive CLI installed. To start it, type "hive" at the command prompt. hive> add file my_script.py; hive> select transform(col1, col2) as result1, result2 using 'my_script.py' from my_table; See the transform docs, but essentially this will run the equivalent of an MR streaming job against every record in my_table. Records are passed to your script delimited by newlines, and fields are delimited by tabs. Anything you print to standard out will be interpreted the same way (print one line per output record, fields separated by tabs). Here's an example that uses a Python script to validate datatypes. Feel free to ping back if you need additional help.

rgelhausen · ‎03-17-2016

Hi @Amar ch, I didn't time it, but it takes somewhere from 15-30 minutes to fully start the NiFi process. One of the first things NiFi does on startup is unpack all the NAR files in the lib directory to make them available as processors. If you want the service to start faster, you can remove NARs for processors you don't intend to use. Even once the service is started, I've found the NCM will be very slow. For this reason, I would plan on developing a template on your laptop and importing it into your RaspberryPi NiFi instance rather than trying to build out the flow directly on the pi. The stumbling block above aside, once the flow is defined and started, I have had 0 problems with it. Been monitoring WiFi traffic for several months without a blip, even after a few power outages. NiFi started back up and resumed working just fine. Edit: The number of standard NARs has grown since my comment in December, hence the increase in startup time from 10-30 minutes 😃

Online	Offline
Last Visited	‎01-23-2018 02:10 AM

Member Since	‎09-21-2015 08:50 PM
Last Visited	‎01-23-2018 02:10 AM
Posts	133
Kudos received	123

Cloudera Community

Re: Phoenix table design

Re: How to determine whether a hive script fails?

Re: Performance metrics phoenix bulk load vs hbase...

Re: What is recommended way of moving mainframe da...

Re: HBase Row Level Filtering

Re: how would I do a "stream-join" of one data-sou...

how would I do a "stream-join" of one data-source ...

Do secondary indexes require that all rows have va...

Re: How many salt buckets should I use for my Phoe...

How many salt buckets should I use for my Phoenix ...

Re: How can I specify table output directory using...

How can I specify table output directory using Spa...

Can I use NiFi to replace Sqoop?

Re: Running Python Scripts on data in HDFS

Re: Has anyone tried using a Apache NiFi on Raspbe...