Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What will happen to data when i replace columns in hive?

avatar
Explorer

I am having external table emp with columns id,name,sal,comm and data's are already available . Now i removed 'comm' column using replace in hive. will data for that column still available ? If it was there won't it create any error while querying that table ? Do we need to reload data again with new set of columns?

1 ACCEPTED SOLUTION

avatar
Master Guru

@Surya RS

Data for comm column will be available in HDFS table location and it won't create any error because Hive works "schema on read" when we query the table it will show the data for all the available columns only.

If you reload(insert overwrite) data for the emp table with new set of columns(id,name,sal) then the data for comm column will be lost because we are going to overwrite the all existing data(having data for comm column) with new columns.

After reloading with new data if you add comm column to the table it will show null values for comm column because there is no data existed for comm column in hdfs location.

Example:-

i have emp external table with same schema as you have mentioned in the question

hive# select * from emp;
+-----+-------+-------+------------+--+
| id  | name  |  sal  |    comm    |
+-----+-------+-------+------------+--+
| 1   | hcc   | 1000  | empsalary  |
| 1   | hdp   | 2000  | hdpsalary  |
| 3   | hdf   | 3000  | hdfsalary  |
+-----+-------+-------+------------+--+

Now i'm going to remove comm column by using replace columns

hive# ALTER TABLE emp REPLACE COLUMNS(id int,name string,sal int);

Selecting the data from emp table again

hive# select * from emp;
+-----+-------+-------+--+
| id  | name  |  sal  |
+-----+-------+-------+--+
| 1   | hcc   | 1000  |
| 1   | hdp   | 2000  |
| 3   | hdf   | 3000  |
+-----+-------+-------+--+

comm column has been removed from table and there will be no issues when you query the table.

If you specify comm column in select statement then hive returns error.

hive# select comm from emp;
Error: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'comm': (possible column names are: id, name, sal) (state=42000,code=10004)

Hive returns error as the specified column is not in table.

If you check HDFS directory the data for comm column will be still exists, when we add again comm column to the table it will returns the data for comm column again.

Adding comm column to the table again:-

hive# ALTER TABLE emp add COLUMNS(comm string);
hive# select * from emp;
+-----+-------+-------+------------+--+
| id  | name  |  sal  |    comm    |
+-----+-------+-------+------------+--+
| 1   | hcc   | 1000  | empsalary  |
| 1   | hdp   | 2000  | hdpsalary  |
| 3   | hdf   | 3000  | hdfsalary  |
+-----+-------+-------+------------+--+

After adding comm column again to the table we are able to see the data again.Hive works on schema on read it will displays the data as it matches with schema and datatypes(if datatype doesn't match it shows as nulls).

If you are trying to access this table using spark then spark gets table info from hcatalog server and loads the data as per your schema.

Loading data into spark after comm column gets removed.

>>> hc.table("default.emp").show(10,False)
+---+----+----+
|id |name|sal |
+---+----+----+
|1  |hcc |1000|
|1  |hdp |2000|
|3  |hdf |3000|
+---+----+----+

Adding comm column again and loading into Spark

>>> hc.table("default.emp").show(10,False)
+---+----+----+---------+
|id |name|sal |comm     |
+---+----+----+---------+
|1  |hcc |1000|empsalary|
|1  |hdp |2000|hdpsalary|
|3  |hdf |3000|hdfsalary|
+---+----+----+---------+

.

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

View solution in original post

2 REPLIES 2

avatar
Master Guru

@Surya RS

Data for comm column will be available in HDFS table location and it won't create any error because Hive works "schema on read" when we query the table it will show the data for all the available columns only.

If you reload(insert overwrite) data for the emp table with new set of columns(id,name,sal) then the data for comm column will be lost because we are going to overwrite the all existing data(having data for comm column) with new columns.

After reloading with new data if you add comm column to the table it will show null values for comm column because there is no data existed for comm column in hdfs location.

Example:-

i have emp external table with same schema as you have mentioned in the question

hive# select * from emp;
+-----+-------+-------+------------+--+
| id  | name  |  sal  |    comm    |
+-----+-------+-------+------------+--+
| 1   | hcc   | 1000  | empsalary  |
| 1   | hdp   | 2000  | hdpsalary  |
| 3   | hdf   | 3000  | hdfsalary  |
+-----+-------+-------+------------+--+

Now i'm going to remove comm column by using replace columns

hive# ALTER TABLE emp REPLACE COLUMNS(id int,name string,sal int);

Selecting the data from emp table again

hive# select * from emp;
+-----+-------+-------+--+
| id  | name  |  sal  |
+-----+-------+-------+--+
| 1   | hcc   | 1000  |
| 1   | hdp   | 2000  |
| 3   | hdf   | 3000  |
+-----+-------+-------+--+

comm column has been removed from table and there will be no issues when you query the table.

If you specify comm column in select statement then hive returns error.

hive# select comm from emp;
Error: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'comm': (possible column names are: id, name, sal) (state=42000,code=10004)

Hive returns error as the specified column is not in table.

If you check HDFS directory the data for comm column will be still exists, when we add again comm column to the table it will returns the data for comm column again.

Adding comm column to the table again:-

hive# ALTER TABLE emp add COLUMNS(comm string);
hive# select * from emp;
+-----+-------+-------+------------+--+
| id  | name  |  sal  |    comm    |
+-----+-------+-------+------------+--+
| 1   | hcc   | 1000  | empsalary  |
| 1   | hdp   | 2000  | hdpsalary  |
| 3   | hdf   | 3000  | hdfsalary  |
+-----+-------+-------+------------+--+

After adding comm column again to the table we are able to see the data again.Hive works on schema on read it will displays the data as it matches with schema and datatypes(if datatype doesn't match it shows as nulls).

If you are trying to access this table using spark then spark gets table info from hcatalog server and loads the data as per your schema.

Loading data into spark after comm column gets removed.

>>> hc.table("default.emp").show(10,False)
+---+----+----+
|id |name|sal |
+---+----+----+
|1  |hcc |1000|
|1  |hdp |2000|
|3  |hdf |3000|
+---+----+----+

Adding comm column again and loading into Spark

>>> hc.table("default.emp").show(10,False)
+---+----+----+---------+
|id |name|sal |comm     |
+---+----+----+---------+
|1  |hcc |1000|empsalary|
|1  |hdp |2000|hdpsalary|
|3  |hdf |3000|hdfsalary|
+---+----+----+---------+

.

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

avatar
Explorer

Thank you @Shu