Support Questions

Find answers, ask questions, and share your expertise

Hive on Tez- Adding a row column messes up table's order.

avatar

Hi, here's a sample of my table (source_table1)- all nicely ordered.

select * from source_table1 limit 10;

1510764333	"000003825fc3200b8d541e8ee3844522"	"St. George"	"Upper Platform"	
1510764342	"000003825fc3200b8d541e8ee3844522"	"St. George"	"Upper Platform"	
1510764344	"000003825fc3200b8d541e8ee3844522"	"St. George"	"Upper Platform"	
1510764360	"000003825fc3200b8d541e8ee3844522"	"St. George"	"Upper Platform"	
1510764420	"000003825fc3200b8d541e8ee3844522"	"Museum"	"Museum Platform"	
1510764448	"000003825fc3200b8d541e8ee3844522"	"Museum"	"Museum Platform"	
1510764509	"000003825fc3200b8d541e8ee3844522"	"Queens Park"	"Platform"	
1510764511	"000003825fc3200b8d541e8ee3844522"	"Queens Park"	"Platform"	
1510764541	"000003825fc3200b8d541e8ee3844522"	"Queens Park"	"Platform"	
1510764779	"000003825fc3200b8d541e8ee3844522"	"St. Andrew"	"St. Andrew Platform"

All I simply want to do is add a row column to the entire table (about 3 mil rows) but it seems to completely mess up the table's ordering, see below:

select row_number() over () as rowid,* from source_table1;

1	1510753659	"c647a9116db03135d0b419d4a6b8a141"	"Osgoode"	"Osgoode Platform"	
2	1510768565	"bbab071f0f1a16e40b02a88fb99fc9f8"	"Don Mills"	"Don Mills Platfo
3	1510753683	"c647a9116db03135d0b419d4a6b8a141"	"St. Andrew"	"St. Andrew Platform"	
4	1510768570	"bbab071f0f1a16e40b02a88fb99fc9f8"	"Don Mills"	"Don Mills Platform"	
5	1510753685	"c647a9116db03135d0b419d4a6b8a141"	"St. Andrew"	"St. Andrew Platform"	
6	1510768574	"bbab071f0f1a16e40b02a88fb99fc9f8"	"Don Mills"	"Don Mills Platform"	
7	1510753687	"c647a9116db03135d0b419d4a6b8a141"	"St. Andrew"	"St. Andrew Platform"	
8	1510768576	"bbab071f0f1a16e40b02a88fb99fc9f8"	"Don Mills"	"Don Mills Platform"	
9	1510753691	"c647a9116db03135d0b419d4a6b8a141"	"St. Andrew"	"St. Andrew Platform"	
10	1510768578	"bbab071f0f1a16e40b02a88fb99fc9f8"	"Don Mills"	"Don Mills Platform"	

I've also tried things like:

select row_number() over () as rowid,* from source_table1 cluster by baic_id, point;

select row_number() over () as rowid,* from source_table1 sort by baic_id,point;

etc

... although none of it makes a difference. Is there any way to take an existing ordered table and simple add a row column for the whole thing? This job is running in Tez which is using multiple reducers for this one task.

Thanks!

1 ACCEPTED SOLUTION

avatar
Master Guru
@Alex Witte

To add row column use order by clause, by using this clause we are going to order the whole data set with in one reducer and assigns row_number value. Now you will get the correct ordered data set with row column added.

select row_number() over (order by col1) as rowid,* from source_table1;

View solution in original post

3 REPLIES 3

avatar
Master Guru
@Alex Witte

To add row column use order by clause, by using this clause we are going to order the whole data set with in one reducer and assigns row_number value. Now you will get the correct ordered data set with row column added.

select row_number() over (order by col1) as rowid,* from source_table1;

avatar
Master Guru

@Alex Witte

To add more details why cluster by/Sort by are not working is there is no order by clause specified in your window function,

if we won't mention then hive initializes more than one reducer and distributes/sorts the data based on baic_id,point values.

Example:

i have a table and the order of id is not consistent

hive> select * from test_table;
+-----+-------+-------+--+
| id  |  dt   | name  |
+-----+-------+-------+--+
| 2   | fajf  | sda   |
| 4   | fajf  | sda   |
| 1   | fajf  | sda   |
| 3   | asd   | $h    |
| 3   | asd   | $h    |
+-----+-------+-------+--+

2.then i ran row_number without specifying any order by clause and clustered by id

hive> select row_number() over () as rowid,* from test_table cluster by id;

Output:

+--------+-----+-------+-------+--+
| rowid  | id  |  dt   | name  |
+--------+-----+-------+-------+--+
| 3      | 1   | fajf  | sda   |
| 1      | 2   | fajf  | sda   |
| 4      | 3   | asd   | $h    |
| 5      | 3   | asd   | $h    |
| 2      | 4   | fajf  | sda   |
+--------+-----+-------+-------+--+

The id field is ordered but rowid is not in the correct order.

rowid 1 is assigned to id=2 because it's first value in the table and rowid 2 assigned to 4 ..

3.row_number with order by clause:

hive> select row_number() over (order by id) as rowid,* from test_table cluster by id;
+--------+-----+-------+-------+--+
| rowid  | id  |  dt   | name  |
+--------+-----+-------+-------+--+
| 1      | 1   | fajf  | sda   |
| 2      | 2   | fajf  | sda   |
| 3      | 3   | asd   | $h    |
| 4      | 3   | asd   | $h    |
| 5      | 4   | fajf  | sda   |
+--------+-----+-------+-------+--+

(or)

without cluster by

hive> select row_number() over (order by id) as rowid,* from test_table;

+--------+-----+-------+-------+--+
| rowid  | id  |  dt   | name  |
+--------+-----+-------+-------+--+
| 1      | 1   | fajf  | sda   |
| 2      | 2   | fajf  | sda   |
| 3      | 3   | asd   | $h    |
| 4      | 3   | asd   | $h    |
| 5      | 4   | fajf  | sda   |
+--------+-----+-------+-------+--+

Both results the same output and now the ordering is based on order by clause.

If you need to assign row column based on some column value then use the column in window function order by clause then the results will be always consistent.

avatar

Shu thanks so much this did it! Appreciate your detailed answer!