About KFredrickson

KFredrickson · ‎01-09-2024

We're currently looking at upgrading NiFi from 1.11 to 1.22.0. During the past NiFi upgrade we encountered a bunch of breaking changes in processor behavior and found that there was no way to safely upgrade without going through and checking the behavior of every single flow in the new version. We have dozens of flows in production so this was pretty cumbersome. To possibly make this upgrade a bit easier, I was considering including the old NAR files along with the new NAR files so that we would have the option of using old 1.11 processor versions for our existing flows, avoiding any problems coming from new behavior in the 1.22.0 versions of processors. NiFi doesn't complain when you do this and even seems to have some built-in features to support it. But I eventually encountered errors that looked like they were coming from incompatible Java classes. One error came from the ExecuteSQL processor complaining about SNAPPY compression for its avro file output, something about the class org.xerial.snappy. We also saw that the UpdateAttribute processor was losing the configurations in the "Advanced" UI. I certainly don't understand much about how class loaders are supposed to work, but I thought that NAR files were supposed to be isolated in some way so that you wouldn't have problems with different NARs using incompatible versions of the same Java class. I did read that you are supposed to put extra NAR files in a different directory other than the default lib directory, but that didn't seem to help. Does anyone have experience doing this successfully?

KFredrickson · ‎02-06-2022

This was a result of a bug in my code and not anything to do with Hive itself - please ignore.

KFredrickson · ‎02-03-2022

Hi, I am seeing some situations where I have two Hive SQL commands running concurrently and I'm getting a lost update. I am running Hive 2.3.6 on EMR with hive.support.concurrency = true and I believe this shouldn't be happening based on what I understand about Hive table locking. (I am not using ACID transactions but the table locking should still prevent lost update as far as I know;) Specifically I have a "load data" statement loading data into table T from an S3 location. I have an "insert overwrite T select * from T" table running concurrently from another Hive connection that deletes some rows from T but should not be affecting rows from the load data statement. I am seeing that the data from the load data statement disappears after the insert overwrite finishes. My understanding is that the load data and insert overwrite should create an exclusive table lock on T so they should allow each other to finish before reading or writing data from T. (I checked this using "show locks" and they do definitely create an exclusive lock.) Has anyone seen this issue before and are there any Hive settings I can try changing to prevent this behavior?

KFredrickson · ‎02-12-2021

Looks like the files are available here: https://repo.hortonworks.com/content/repositories/releases/org/apache/nifi/nifi-hive-nar/

KFredrickson · ‎01-01-2020

I found that it's possible to fix this problem (as well as a different problem we were having with accessing Hive via a zookeeper connection string) by doing the following: Use a custom NiFi Hive NAR file that has the Hortonworks versions of the hive, hadoop and zookeeper jars. This will get rid of the problem with backticks and the problem with the ZooKeeper connection string. To create the NAR file I just unzipped nifi-hive-nar-1.10.0.nar that comes with the Apache NiFi distro, then replaced all the the hive-*, hadoop-*, and zookeeper-* jars with the ones in http://repo.spring.io/hortonworks/org/apache/nifi/nifi-hive-nar/1.9.0.3.4.1.9-2/ You can just treat the NAR files as regular ZIP files. There is no need to compile anything or use Maven. We have been using this custom NAR for a few weeks and the NiFi Hive processors seem to be working without any problems.

KFredrickson · ‎12-23-2019

When running Hive queries from NiFi 1.10 that contain backticks, I get the following error: 2019-12-23 15:17:00,191 WARN [Timer-Driven Process Thread-2] o.a.nifi.processors.hive.SelectHiveQL SelectHiveQL[id=075b3a1e-7632-1647-68d9-338231b5921b] Failed to parse query: select 1 as `asdf` due to java.lang.NullPointerException: I thought Hive queries allowed backticks to escape column names, so I'm not sure why NiFi can't parse this. The actual query runs fine on the Hive server, and I get a valid flow file with the query results, but it still raises a red NiFi bulletin (which we would prefer not to have if there is not a real problem). The Hive server running the query is the one that comes with HDP 2.5.

KFredrickson · ‎08-08-2018

The id values in the extra rows are null even though the source data does not contain any rows with a null id. By all appearances it's returning rows of corrupt data that do not really exist in the table.

KFredrickson · ‎08-03-2018

AFAIK Hive should be able to handle text with newlines if the table is stored in a binary format such as ORC, Avro, Parquet etc. We could strip out the newlines from the data before putting it in Hive, but we'd rather not do that since the original source data (coming from SQL Server) contains newlines.

KFredrickson · ‎08-03-2018

Not 100% sure what you mean but I do think the issue is related to certain columns in the table. If I do a "select id order by id" that comes through without any issues. It's only when certain columns are included that we get corruption, and these seem to be the columns that are of datatype string with newlines in them.

KFredrickson · ‎08-03-2018

We have an external hive table (let's call it example_table) created on top of an orc file in Hadoop. Doing a simple select query such as: select * from example_table works fine, but select * from example_table order by id returns lots of extra rows that look like corrupt data. We have seen cases where this returns 10x the number of rows as the query without the order by clause. If its relevant, the orc files were created by Spark and we are using Hive 1.2.1 on HDP 2.5. I suspect this may have something to do with the fact that the orc data has fields of string datatype that contain newlines. Is this a known bug and/or are there any Hive settings we could try changing to fix this? UPDATE: Here is a Hive script we came up with to reproduce the problem: create table default.test_errors(c1 string) stored as orc; --Make sure to include newlines here with CTE as (select 'a b c' as c1) insert into default.test_errors select c1 from CTE; --Returns 1 row select * from default.test_errors; --Returns 3 rows select * from default.test_errors order by c1;

Online	Offline
Last Visited	‎07-28-2024 09:03 PM

Member Since	‎10-24-2017 04:49 PM
Last Visited	‎07-28-2024 09:03 PM
Posts	17
Kudos received	2

Cloudera Community

Re: Hive concurrency - lost update

NiFi support for multiple versions of same NAR

Re: Hive concurrency - lost update

Hive concurrency - lost update

Re: NiFi 1.10 Hive processors and backticks

Re: NiFi 1.10 Hive processors and backticks

NiFi 1.10 Hive processors and backticks

Re: Hive returns extra rows with "order by"

Re: Hive returns extra rows with "order by"

Re: Hive returns extra rows with "order by"

Hive returns extra rows with "order by"