How to add a new column to an existing parquet table and how to update it ?
@nhakhoaparis. I often seen questions like this and although there are many solutions, the one i prefer is to leave the source table alone, create another table with the modified schema. In one form of this concept, you leave the old table behind, and INSERT INTO new_table SELECT * FROM old_table.
This is immutable table is also common concept for parquet. You do not modify columns. You read, change, then re-write. One way to complete the above concept in hive query language: select parquet into a non parquet table, do your work to modify the new table, update the new column, etc, then select back into a new parquet table with the new schema.
You can also do some of the above with spark or other programming languages. Many options, but in summary: leave the source table alone and create a new tables. During the course, i like to call these staging tables, and sometimes keep the new names, or drop original table, and rename new table to the old table.
If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post.
We have some background on schema evolution in Parquet in the docs - https://docs.cloudera.com/runtime/7.2.2/impala-reference/topics/impala-parquet.html. See "Schema Evolution for Parquet Tables". Some of the details are specific to Impala but the concepts are the same across engines including Hive and Spark that use parquet tables.
At a high level, you can think of the data files being immutable while the table schema evolves. If you add a new column at the end of the table, for example, that updates the table schema but leaves the parquet files unchanged. When the table is queried, the table schema and parquet file schema are reconciled and the new column's values will be all NULL.
If you want to modify the existing rows and include new non-NULL values, that would require rewriting the data, e.g. with an INSERT OVERWRITE statement for a partition or a CREATE TABLE .. AS SELECT to create an entirely new table.
Keep in mind that traditional Parquet tables are not optimized for workloads with updates - Apache Kudu in particular and also transactional tables in Hive3+ have support for row-level updates that is more convenient/efficient.
We definitely don't require rewriting the whole table every time you want to add a column, that would be impractical for large tables!