Support Questions

Find answers, ask questions, and share your expertise

Adding partition column

avatar
Rising Star

Hello,

 

I have a very large table (50B+ rows) and it's partitioned with (year,month,day).

Now i want to add a new "server" partition column, the resulting partition schema shoud be (year,month,day,server)

 

looking at the ALTER command there seems to be no way to add an additional partition column. 

 

An alternative would be to create a new table with the  (year,month,day,server)  partition schema and then do a select * from old table to the new table. This is an expensive operation, am i missing something and is there an easier way to do this?

 

 

Thx,

 

Maarten

1 ACCEPTED SOLUTION

avatar

Hi Maarten,

 

in general, that approach will not work, but in your specific case it might based on the contents of your data files.

 

Let's take an example of a single HDFS file:

 

You have an HDFS file /mytbl/year=2015/month=1/day=1/example_file

 

Now your example_file may contain rows with many 'server' values.

 

My understanding is that you are proposing to move that file into something like:

/mytbl/year=2015/month=1/day=1/server='foo'/example_file

 

Yes, you will be able to create a table over such a structure (with a duplicated 'server' column, one for the partition column and one for the non-partition column), but it doesn't change the fact that the partitioning is wrong if 'example_file' contains data for servers other than 'foo'. If you guarantee that all your files only contain data for a single server, then your approach may work.

 

Alex

View solution in original post

4 REPLIES 4

avatar

Hi Maarten,

 

I'm afraid you are correct, you will have to create a new table and insert .. select * from oldtable.

 

Keep in mind that partition columns are only stored in the corredponding HDFS directory structure, and not in the data files themselves. So adding or removing a partition column will require a re-write of the data (plus//minus the partition column to be added/removed). Also, the HDFS directory structure must be created for the new set of partition keys.

 

It would certainly be more conveinent to have a single comment for this operation, but re-writing the data will be necessary due to the reasons outline above.

 

Alex

avatar
Rising Star

i tested moving the data with "insert into select * from" but this is extremely slow.

 

what if i drop all the partitions from the external table ( the data should remain on hdfs) and then create the new partition sub directories with a script and then move all the data files to the new sub directory. Then i can add the partitions again (with alter table add partition) and the data should be available again in the new partition schema.

 

I did a test with a single partition directory (year, month,date) and after adding the "server" sub directory and moving the parquet data files to it. The after an alter table add partition ... the data could be queried again with the (year, month,date,server) partition schema.

 

I am not sure if this method of adding the new partition has some drawbacks i don'nt know about?

 

Maarten

 

 

avatar

Hi Maarten,

 

in general, that approach will not work, but in your specific case it might based on the contents of your data files.

 

Let's take an example of a single HDFS file:

 

You have an HDFS file /mytbl/year=2015/month=1/day=1/example_file

 

Now your example_file may contain rows with many 'server' values.

 

My understanding is that you are proposing to move that file into something like:

/mytbl/year=2015/month=1/day=1/server='foo'/example_file

 

Yes, you will be able to create a table over such a structure (with a duplicated 'server' column, one for the partition column and one for the non-partition column), but it doesn't change the fact that the partitioning is wrong if 'example_file' contains data for servers other than 'foo'. If you guarantee that all your files only contain data for a single server, then your approach may work.

 

Alex

avatar
Rising Star

yes that's exactly my situation right now. i've only stored data for 1 server sofar.

 

i will continue with my workaround, thx for confirming that this will work for this situation.