About rgelhausen

rgelhausen · ‎12-23-2016

Hi @Boris Demerov, please also see @Wes Floyd's Storm & Kafka guide here. There's some overlap between it and @Constantin Stanca's recommendations, but you may find it useful anyway.

rgelhausen · ‎12-21-2016

I need to route specific "streams" of flowfiles to specific cluster nodes by FlowFile attribute value. Is there a way to take an attribute and ensure that all incoming FlowFiles with the same value for an attribute get distributed to the same cluster node?

rgelhausen · ‎12-18-2016

Hi, @rudra prasad biswas For this use case, consider a few different approaches: 1. Use a single array type "stoppage" column which stores each stop as a JSON element. To query for records where the last stoppage was NY, use something like "where stoppage[array_length(stoppage)-1] = 'New York". 2. Continue using dynamic columns, but include an additional column "stoppage_count" which you increment for each additional stoppage. You can use stoppage_count to tell your application which dynamic columns to include in the query. 3. Use a relational model where you have records of trips, and another table with "stoppages" linked to the trip ID. However, the above approaches assume your queries have an initial access pattern limiting the size of the scan. Assuming your table has primary key of customer_id, trip_id, you DON'T want to run a query like: select * from trips where stoppages[array_length(stoppages)-1] = 'New York' because it would be a full scan. Instead, you want it to be something like: select * from ( select * from trips where customer_id = '1234' ) a where stoppages[array_length(stoppages)-1] = 'New York' which would be a much smaller range scan.

rgelhausen · ‎12-17-2016

Hi, @rudra prasad biswas, Column families are used to optimize RegionServer memory by separating frequently and infrequently used columns. For example, when using HBase for document storage, document metadata is queried 5-10x for every one read of document content. In this type of application, you would store document content in one column family, and document metadata in another, allowing HBase to avoid loading document content into memory until it is explicitly accessed. As @Ryan Cicak hinted at, Phoenix tables assign columns to the first column family by default. However, you can specify which family to assign columns to: CREATE TABLE TEST (MYKEY VARCHAR NOT NULL PRIMARY KEY, A.COL1 VARCHAR, A.COL2 VARCHAR, B.COL3 VARCHAR) In this example, COL3 is stored in column family B, and other columns belong to family A. Assuming there is a subset of column qualifiers you know you will use, they should be defined in the table definition (DDL). However, this is not required. See the Phoenix docs on dynamic columns and views to see how to use Phoenix to support writing arbitrary column qualifiers to a table without pre-defining those columns. Views let you keep track of and expose dynamic columns when needed.

rgelhausen · ‎12-12-2016

I have a very high incoming event rate and a parser written in Python. I'm able to run it with ExecuteStreamCommand, but the latency for starting/stopping a process for every incoming FlowFile is too high and I cant' keep up with my source. Is it possible for ExecuteStreamCommand to keep the external process alive and pass new FlowFiles after some signal?

rgelhausen · ‎11-01-2016

Hi @Zack Riesland, Hive shell activity is like other bash programs. You can check the status of the previous command with $? -- unsuccessful queries/jobs will return a non-zero exit code. [root@dn0 /]# hive -e "show databas;" Logging initialized using configuration in file:/etc/hive/2.5.0.0-1245/0/hive-log4j.properties Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/user/root":hdfs:hdfs:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:292) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:213) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1827) .... [root@dn0 /]# echo $? 1

rgelhausen · ‎10-11-2016

@Sunile Manjee, Depending on what data your Spark application attempts to access, it uses the relevant JVM HBase client APIs: filters, scans, gets, range-gets, etc. See the code here.

rgelhausen · ‎10-05-2016

I've seen the nifi.variable.registry.properties config, but those seem to by loaded statically and only at NiFi start. I have a use case where connection strings and other environment settings need to be read periodically from a DataBase and used to control, for example, the hostname and directory for GetSFTP. How can I achieve this?

rgelhausen · ‎10-05-2016

Thank you, @Michael Young!

rgelhausen · ‎10-05-2016

Online	Offline
Last Visited	‎01-23-2018 02:10 AM

Member Since	‎09-21-2015 08:50 PM
Last Visited	‎01-23-2018 02:10 AM
Posts	133
Kudos received	123

Cloudera Community

Re: Phoenix table design

Re: How to determine whether a hive script fails?

Re: Performance metrics phoenix bulk load vs hbase...

Re: What is recommended way of moving mainframe da...

Re: HBase Row Level Filtering

Re: Any tips on how to optimize Kafka broker perfo...

Does NiFi's DistributeLoad (or similar) have any c...

Re: Phoenix table design

Re: Phoenix table design

Is it possible for ExecuteStreamCommand to run a l...

Re: How to determine whether a hive script fails?

Re: what protocol is used for the new spark hbase ...

Does NiFi have a means of updateable variables/con...

Re: Can Apache NiFi read/write to Tibco JMS?

Can Apache NiFi read/write to Tibco JMS?