Created on 07-23-2019 02:40 AM - edited 09-16-2022 07:31 AM
cdh version is 6.1.0
1. Try merge small files.
insert overwrite table tdb.tb_activity partition(ymd) select * from tdb.tb_activity where ymd = '2019-07-22';
2. But, Exception raised.
"UnknownReason Blacklisting behavior can be configured..." in HUE.
And, Below is a spark container error log.
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Number of input columns was different than output columns (in = 9 vs out = 8
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:805)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:882)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:882)
        at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
        at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:146)
        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:484)
        ... 19 more
Caused by: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Number of input columns was different than output columns (in = 9 vs out = 8
        at org.apache.hadoop.hive.serde2.avro.AvroSerializer.serialize(AvroSerializer.java:75)
        at org.apache.hadoop.hive.serde2.avro.AvroSerDe.serialize(AvroSerDe.java:212)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:725)
        ... 25 more
19/07/23 17:25:32 ERROR executor.Executor: Exception in task 4.0 in stage 0.0 (TID 2)
3. I found a strange in "explain query".
line24 : expressions: ..... extras (type: string), ....
line41: avro.schema.literal {"type":"record", ....... ,{"name":"extras","type":["null","string"],"default":null}, .....
line44: columns actiontype,contentid,contenttype,device,serviceid,timestamp,userip,userid
No have "extras" field on line 44.
10	
11	STAGE PLANS:
12	  Stage: Stage-1
13	    Spark
14	      DagName: hive_20190723175708_8b93a3ff-d533-48cc-865e-6af87f576858:29
15	      Vertices:
16	        Map 1 
17	            Map Operator Tree:
18	                TableScan
19	                  alias: tb_activity
20	                  filterExpr: (ymd = '2019-07-22') (type: boolean)
21	                  Statistics: Num rows: 16429 Data size: 13275304 Basic stats: COMPLETE Column stats: NONE
22	                  GatherStats: false
23	                  Select Operator
24	                    expressions: actiontype (type: string), contentid (type: string), contenttype (type: string), device (type: string), extras (type: string), serviceid (type: string), timestamp (type: bigint), userip (type: string), userid (type: string), '2019-07-22' (type: string)
25	                    outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9
26	                    Statistics: Num rows: 16429 Data size: 13275304 Basic stats: COMPLETE Column stats: NONE
27	                    File Output Operator
28	                      compressed: true
29	                      GlobalTableId: 1
30	                      directory: hdfs://nameservice1/etl/flume/tb_activity/.hive-staging_hive_2019-07-23_17-57-08_154_1733664706886256905-5/-ext-10002
31	                      NumFilesPerFileSink: 1
32	                      Statistics: Num rows: 16429 Data size: 13275304 Basic stats: COMPLETE Column stats: NONE
33	                      Stats Publishing Key Prefix: hdfs://nameservice1/etl/flume/tb_activity/.hive-staging_hive_2019-07-23_17-57-08_154_1733664706886256905-5/-ext-10000/
34	                      table:
35	                          input format: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
36	                          output format: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
37	                          properties:
38	                            DO_NOT_UPDATE_STATS true
39	                            EXTERNAL TRUE
40	                            STATS_GENERATED TASK
41	                            avro.schema.literal {"type":"record","name":"Activity","namespace":"com.bigdata.avro","doc":"Schema for com.bigdata.avro.Activity","fields":[{"name":"actionType","type":["null","string"]},{"name":"contentId","type":["null","string"]},{"name":"contentType","type":["null","string"]},{"name":"device","type":["null","string"]},{"name":"extras","type":["null","string"],"default":null},{"name":"serviceId","type":["null","string"]},{"name":"timestamp","type":["null","long"]},{"name":"userIp","type":["null","string"]},{"name":"userid","type":["null","string"]}]}
42	                            avro.schema.url hdfs:///metadata/avro/tb_activity.avsc
43	                            bucket_count -1
44	                            columns actiontype,contentid,contenttype,device,serviceid,timestamp,userip,userid
45	                            columns.comments 
46	                            columns.types string:string:string:string:string:bigint:string:string
47	                            file.inputformat org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
48	                            file.outputformat org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
49	                            impala.lastComputeStatsTime 1559636056
50	                            location hdfs://nameservice1/etl/flume/tb_activity
51	                            name tdb.tb_activity
52	                            numRows 83020631
53	                            partition_columns ymd
54	                            partition_columns.types string
55	                            serialization.ddl struct tb_activity { string actiontype, string contentid, string contenttype, string device, string serviceid, i64 timestamp, string userip, string userid}
56	                            serialization.format 1
57	                            serialization.lib org.apache.hadoop.hive.serde2.avro.AvroSerDe
58	                            totalSize 6334562388
59	                            transient_lastDdlTime 1556875047
60	                          serde: org.apache.hadoop.hive.serde2.avro.AvroSerDe
61	                          name: tdb.tb_activity
62	                      TotalFiles: 1
63	                      GatherStats: true
64	                      MultiFileSpray: false
"extras" field on avro schema have "default" property.
and other fields has no "default" property.
I have been doing avro schema changes in the past. The "extras" field was then added.
What is wrong?