Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Impala external table partition files added over time - How can I broadcast a REFRESH to all nodes?

New Contributor

I have a scenario where parquet files are being added over time to a partition of an Impala external table. The catalog service broadcasts changed metadata as a result of ALTER TABLE, INSERT and LOAD DATA to all nodes but it is not aware obviously when I create a parquet file and add it to a partition. Is there a way to force the catalog service to broadcast to all nodes and cause each of them to refresh their metadata? I'm trying to avoid having to connect to every impala daemon (any of which can be connected to and used as a coordinator) to issue a REFRESH.

 

 

1 ACCEPTED SOLUTION

Cloudera Employee
Hi,
The issuing a REFRESH will broadcast the updated table to all nodes. There
may be a short delay after the command completes before the changes are
visible on remote nodes. This is because we must wait for a statestore
update to propagate the changes out.

If you need stronger consistency guarantees (the change is visible on all
nodes at the time it completes) you can use the query option:
SET SYNC_DDL=true

This will impact DDL performance, so it is disabled by default. For optimal
performance, you can batch your DDL statements and only enable this query
option for the final statement in the batch. For example:

> connect to node 1
CREATE TABLE Foo
ALTER TABLE ADD PARTITION 1
..
SET SYNC_DDL=true
ALTER TABLE ADD PARTITION N

> connect to node2
SELECT * FROM TABLE

Thanks,
Lenni

View solution in original post

1 REPLY 1

Cloudera Employee
Hi,
The issuing a REFRESH will broadcast the updated table to all nodes. There
may be a short delay after the command completes before the changes are
visible on remote nodes. This is because we must wait for a statestore
update to propagate the changes out.

If you need stronger consistency guarantees (the change is visible on all
nodes at the time it completes) you can use the query option:
SET SYNC_DDL=true

This will impact DDL performance, so it is disabled by default. For optimal
performance, you can batch your DDL statements and only enable this query
option for the final statement in the batch. For example:

> connect to node 1
CREATE TABLE Foo
ALTER TABLE ADD PARTITION 1
..
SET SYNC_DDL=true
ALTER TABLE ADD PARTITION N

> connect to node2
SELECT * FROM TABLE

Thanks,
Lenni