I am testing Ephemeral Hadoop that uses S3 as permanent storage, and uses Hadoop only for computation and intermediate result storage. I found when I start a new Hadoop cluster, the process of loading S3 data into HDFS is fast, but creating partitions and generating hive table-level and column-level statistics are very slow. My question is that is there a way to save hive table information, such as partition infomration, the table-level and column-level statistics, so that a new Hadoop cluster can reuse these information.
Not sure about the version of hive you are using. But in recent versions, loading partitions and computing col statistics have been improved a lot.
1. You can store the data in S3. Spin up a separate RDS which can act as metastore database.
2. Create hive tables as "external table". Dropping these tables would not delete the contents as well. Use MSCK to import data from S3 and to populate metadata information in metastore.
3. When you tear down the cluster, metadata would still be available in RDS. When another ephemeral cluster is spun up, you can point your metastore database to the RDS created in step #1. In such case, you do not need to reimport anything. All tables would be visible automatically.