Created on 03-07-2016 02:20 PM - edited 09-16-2022 03:07 AM
We have a use case to reload all transactions data every month for defined set of years. We are going to use Spark and create required reporting tables. Will use Impala for analytical workloads with BI tool.
how do we separate the data processing tables vs reporting tables and then swap tables in Impala? We want to minimise the impact to users in terms of availability of BI system and to ensure read consistency.
Any ideas?
Ive come across couple of options - partitioning (but need to copy or move the files to the reporting directory location) so that for the next run we can make use of data processing tables
and other option is lock table - remove files, and move files from working to reporting directory (again will cause an impact to users during that file removal and movement duration).
ideally it would work if we can have two databases and swap them based on the data load completion.
thanks.
Suresh
Created 03-08-2016 11:06 PM
Hi Suresh,
that solution seems fine to me. Changing the location of a single table with ALTER is atomic, but you won't be able to atomically change the locations of two tables simultaneously. Just something to be aware of.
Alex
Created 03-07-2016 11:18 PM
Hi Suresh,
even if your use case may be slightly different, I'd recomment you take a look at this blog post that presents best practices and may give you a few ideas:
http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
Alex
Created 03-08-2016 09:26 AM
Thanks Alex. I can see some references to swapping tables/views and so on but it looks very complex to maintain I think.
How about if we use two different HDFS locations say Location1 and Location2 - swap them to use the right location for the reporting tables once the data load is complete on one of the locations for the data processing? It looks like we can make use of Alter Table command to change the HDFS location for a table so that would nicely swap the location just a metadata operation. what do you think?
1st run:
Location1 - use for Data Processing
Location2 - use for Reporting
2nd run:
Location1 - use for Reporting
Location2 - use for Data Processing
thanks
Suresh
Created 03-08-2016 11:06 PM
Hi Suresh,
that solution seems fine to me. Changing the location of a single table with ALTER is atomic, but you won't be able to atomically change the locations of two tables simultaneously. Just something to be aware of.
Alex