Lets assume we have data in hive table for past 60 days. How to automatically move the data beyond a time period (30 days) to S3 and have only the latest 30 days data in hdfs. How to write a hive query to read the entire 60 days data ? How to point single hive table to multiple data storage - S3 and hdfs ?
Also is it possible to configure S3 as archival storage ?
To be able to use both S3 and HDFS for your Hive table, you could use an external table with partitions pointing to different locations.
Look for the process that starts at "An interesting benefit of this flexibility is that we can archive old data on inexpensive storage" in this link:
To automate this process, you could use Cron but I guess Falcon should also be possible.