Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Read consistency - Impala

avatar
Explorer

We have a use case to reload all transactions data every month for defined set of years. We are going to use Spark and create required reporting tables. Will use Impala for analytical workloads with BI tool.

how do we separate the data processing tables vs reporting tables and then swap tables in Impala? We want to minimise the impact to users in terms of availability of BI system and to ensure read consistency.

 

Any ideas?

 

Ive come across couple of options - partitioning (but need to copy or move the files to the reporting directory location) so that for the next run we can make use of data processing tables

and other option is lock table - remove files, and move files from working to reporting directory (again will cause an impact to users during that file removal and movement duration).

 

ideally it would work if we can have two databases and swap them based on the data load completion.

 

 

thanks.

Suresh

 

1 ACCEPTED SOLUTION

avatar

Hi Suresh,

 

that solution seems fine to me. Changing the location of a single table with ALTER is atomic, but you won't be able to atomically change the locations of two tables simultaneously. Just something to be aware of.

 

Alex

View solution in original post

3 REPLIES 3

avatar

Hi Suresh,

 

even if your use case may be slightly different, I'd recomment you take a look at this blog post that presents best practices and may give you a few ideas:

 

http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/

 

Alex

avatar
Explorer

Thanks Alex. I can see some references to swapping tables/views and so on but it looks very complex to maintain I think. 

How about if we use two different HDFS locations say Location1 and Location2 - swap them to use the right location for the reporting tables once the data load is complete on one of the locations for the data processing? It looks like we can make use of Alter Table command to change the HDFS location for a table so that would nicely swap the location just a metadata operation. what do you think?

 

1st run:

Location1 - use for Data Processing

Location2 - use for Reporting

 

2nd run:

Location1 - use for Reporting

Location2 - use for Data Processing

 

thanks

Suresh

avatar

Hi Suresh,

 

that solution seems fine to me. Changing the location of a single table with ALTER is atomic, but you won't be able to atomically change the locations of two tables simultaneously. Just something to be aware of.

 

Alex