About saikrishnamakin

elserj · ‎12-10-2018

You do not need to create a new table. You can use the existing table if you alter it to add the new column family. "rewrit[ing] the data" means that you must read all data and write it again using the new column family. Whether you read it from HBase or from its original form is of no consequence.

JonathanSneep · ‎10-08-2018

Hi @Sai Krishna Makineni There isn't really any option for oozie to run against a specific host, the only option you would have in Oozie to execute on a specific host is with the SSH Action Extension; The ssh action starts a shell command on a remote machine as a remote secure shell in background. The workflow job will wait until the remote shell command completes before continuing to the next action. The shell command must be present in the remote machine and it must be available for execution via the command path. Examples and more info: https://oozie.apache.org/docs/4.2.0/DG_SshActionExtension.html

Former Member · ‎08-23-2018

https://www.welookups.com

cravani · ‎08-24-2018

@Sai Krishna Makineni Somehow the query does not work on Hive 1.2.1, Your query looks good, can you check data format in ts column? I had in below format YYYY-mm-dd HH:MM:SS

Shu_ashu · ‎08-15-2018

@Sai Krishna Makineni You can use either ListHDFS (or) GetHDFSFileInfo processors and then processor will not store the state and you can schedule this processor to run at nightly and once you list the files from HDFS then you can use hdfs.lastModified attribute(or) you can use your filename with substringAfter function and check the timestamp value in your RouteOnAttribute processor. Once you filterout the files that are more than specific time then feed to DeleteHDFS processor to delete them. In addition ListHDFS processor stores the state and runs only incrementally so if you want to clear the state then use RestAPI with /processors/{id}/state/clear-requests To clear the state and run the processor once you clear the state. Flow: 1.ListHDFS2.RouteOnAttribute //check the filename (or) lastmodified time3.DeleteHDFS //delete the files in hdfs Flow: 1.GenerateFlowFile 2.GetHDFSFileINFO 3.RouteOnAttribute 4.DeleteHDFS (or) You can use GetHDFS processor(Keep source file to true) which doesn't store the state but in this processor we are fetching the files from HDFS if the file is big then we are keeping lot of load on NiFi.

mburgess · ‎06-19-2018

You will want to set whatever column has your "LAST MODIFIED" values as the Maximum Value Column in GenerateTableFetch. The first time it will still generate SQL to pull the complete data, but it will also keep track of the maximum observed value from your Maximum Value Column. The next time GenerateTableFetch runs, it will only generate SQL to fetch the rows whose value for LAST MODIFIED is greater than the last observed maximum. If you want the first generation to start at a particular value (for the Maximum Value Column), you can add a user-defined property called "initial.maxvalue.<maxvaluecol>", where "<maxvaluecol>" is the name of the column you specified as the Maximum Value Column. This allows you to "skip ahead", and from then on GenerateTableFetch will continue in normal operation, keeping track of the current maximum and only generating SQL to fetch rows whose values are larger than the current max. If you need a custom query (or, more correctly, you want to add a custom WHERE clause), you can do that by setting the Custom WHERE Clause property of GenerateTableFetch. If you need completely arbitrary queries, then in NiFi 1.7.0 (via NIFI-1706) you can use QueryDatabaseTable to provide arbitrary queries. This capability does not exist for GenerateTableFetch, but we can investigate adding it as an improvement, please feel free to file a Jira for this.

RahulSoni · ‎03-18-2018

Hello @Sai Krishna Makineni Updating the existing records on HDFS/HBase has always been a trivial use case and the implementation vary greatly depending on per use case basis. There have been multiple options mentioned in previous answers and have there pros and cons. Follows my inputs. 1. Storing the data in HBase and exposing it using Hive is a very bad idea! Suddenly firing queries without using RowKey column is a sin and joins are a strict no-no. You need to understand NoSQL databases are not meant for your usual RDBMS like operations. In RDBMS world, you first think of your design and later the queries come into the picture. In NoSQL, first, you think of your queries and then design your schema accordingly. So having your data in HBase and then relying on HBase to deduplicate the data using "upserts" and keeping your data there is the only good thing about that solution. Your "random" SQL operations and joins requirements will be marginalized. Is there still hope using HBase? Maybe may not be. This really depends on your data size. And even more on your requirements. For example, are your table a few gigs in size? Is your business OK with a certain level of stale data? If the answers to above questions are YES, you may still can go ahead with using HBase for deduplication, you don't need to do anything, HBase will automatically take care of it by its upsert logic, and can them dump the contents of that HBase table(s) onto HDFS, creating a Hive table on top of that, let's say, once a day. Your data will be a day stale but you will get a far better performance than using HBase+Hive or even HBase+Phoenix solutions. 2. Deduplication using Hive can have its own consequences. Beware! Things can really go out of hand very easily here. Let's say if you are using a window function like RankOver() every time you dump some new file into HDFS and using the "Last Modified Date", picking up the "latest" record and end up keeping it and deleting rest of them. The problem is, with every passing day, your data will only grow in size and the time/resources taken to get this job is going to increase and at a certain point, you may not want that operation to even trigger since either it will eat a substantial amount of your cluster resources or will end up taking a lot of time to get processed, which may not make any business sense to you. And if you have a large table beforehand, I don't think this is even an option. Can you use Hive for deduplication at all? Like the first point of using HBase. Totally depends on your use case. If you have a few columns, which are good candidates for being considered as Partitioning Columns or Bucketing columns, you should for sure use them. Let me give you an example! You identify latest records using LastModifiedTS. Also, your table has a column called CreatedTS. You should always use the CreatedTS column as partitioning column if it fits well in the use case. You say what's the benefit? Next time you have a data dump on HDFS from NiFi, simply identify unique CreatedTS values from that table. Now you only pick-up partitions from the existing table which corresponds to these CreatedTS values. You will realize that you are using only a fraction of the existing data for "Upsert" operations using windowing operations as compared to using the entire table for a RankOver. The operation is sustainable over a longer period of time and will take way less time/resources for getting the job done. This is the best option that you can use, if applicable, in your ingestion pattern.

Online	Offline
Last Visited	‎08-14-2019 06:45 PM

Member Since	‎01-16-2018 11:37 PM
Last Visited	‎08-14-2019 06:45 PM
Posts	16
Kudos received	1

Cloudera Community

Re: Rename the HBase column family without losing ...

Re: can we force oozie to run on specific node?

Re: Fetch files from remote server and put it to H...

Re: Need the query to get the past 24 hours data f...

Re: Nifi processor that deletes the older day file...

Re: Fetch LAST MODIFIED data using NiFi

Re: Need to parse the updated records from RDBMS t...