05-17-2015 09:25 AM
I have a very large table (50B+ rows) and it's partitioned with (year,month,day).
Now i want to add a new "server" partition column, the resulting partition schema shoud be (year,month,day,server)
looking at the ALTER command there seems to be no way to add an additional partition column.
An alternative would be to create a new table with the (year,month,day,server) partition schema and then do a select * from old table to the new table. This is an expensive operation, am i missing something and is there an easier way to do this?
05-17-2015 03:23 PM
I'm afraid you are correct, you will have to create a new table and insert .. select * from oldtable.
Keep in mind that partition columns are only stored in the corredponding HDFS directory structure, and not in the data files themselves. So adding or removing a partition column will require a re-write of the data (plus//minus the partition column to be added/removed). Also, the HDFS directory structure must be created for the new set of partition keys.
It would certainly be more conveinent to have a single comment for this operation, but re-writing the data will be necessary due to the reasons outline above.
05-17-2015 11:53 PM
i tested moving the data with "insert into select * from" but this is extremely slow.
what if i drop all the partitions from the external table ( the data should remain on hdfs) and then create the new partition sub directories with a script and then move all the data files to the new sub directory. Then i can add the partitions again (with alter table add partition) and the data should be available again in the new partition schema.
I did a test with a single partition directory (year, month,date) and after adding the "server" sub directory and moving the parquet data files to it. The after an alter table add partition ... the data could be queried again with the (year, month,date,server) partition schema.
I am not sure if this method of adding the new partition has some drawbacks i don'nt know about?
05-18-2015 09:29 AM
in general, that approach will not work, but in your specific case it might based on the contents of your data files.
Let's take an example of a single HDFS file:
You have an HDFS file /mytbl/year=2015/month=1/day=1/example_file
Now your example_file may contain rows with many 'server' values.
My understanding is that you are proposing to move that file into something like:
Yes, you will be able to create a table over such a structure (with a duplicated 'server' column, one for the partition column and one for the non-partition column), but it doesn't change the fact that the partitioning is wrong if 'example_file' contains data for servers other than 'foo'. If you guarantee that all your files only contain data for a single server, then your approach may work.
05-18-2015 11:55 AM
yes that's exactly my situation right now. i've only stored data for 1 server sofar.
i will continue with my workaround, thx for confirming that this will work for this situation.