Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to move more than a week old data into different HDFS location using NIFI

Highlighted

How to move more than a week old data into different HDFS location using NIFI

Explorer

My hive queries are running slow due to the enormous amount of data. The client is only interested in one week of data in the table. Currently, it holds one year of data. Partitions /bucketing didn't improve a lot. So, I want to only keep one week of data in the target table( external table). So, every day at some point of time, I want to move the data which is more than one week old to a different hdfs location where an archive external table points too.This is a backup table to look at any old data. We want to do all the data flow through NIFI. So, Is it possible to run a processor that move > week old data?
One option i thought is to write a shell script and NIFI calling the shell script? Is there anyother way? My File names have a datetime stamp as well

I highly appreciate some advice on this one. Thanks!

2 REPLIES 2
Highlighted

Re: How to move more than a week old data into different HDFS location using NIFI

Super Guru

Other folks more familiar with Hive can hopefully give you better pointers, but did you partition on timestamp, day, week, etc? Seems like if you had a column with a "week since epoch" or something, partitioning should work fairly well.

Alternatively, if you have a timestamp field (let's say "myTime") you should be able to use a separate NiFi flow consisting of a GenerateFlowFile -> PutHiveQL with the GenerateFlowFile running every week and adding an attribute "hiveql.args.1.value" with the value "${now()}", an attribute "hiveql.args.1.type" set to 93 (or the correct Timestamp type) and PutHiveQL with something like:

INSERT INTO TABLE myTable SELECT * FROM myOriginalTable WHERE myTime >= ?;
DELETE from myOriginalTable where myTime >= ?

IIRC you'll need ACID transactions enabled for that table in order to do the delete. I also didn't try this so I'm not sure of the exact syntax, just thinking it's worth a try :)

Highlighted

Re: How to move more than a week old data into different HDFS location using NIFI

Super Guru

Your queries should improve drastically with the correct partitioning and bucketing. Can you share your DDL? Use internal tables (vs external). use ORC. Run stats.

Don't have an account?
Coming from Hortonworks? Activate your account here