Created on 07-23-2019 02:40 AM - edited 09-16-2022 07:31 AM
cdh version is 6.1.0
1. Try merge small files.
insert overwrite table tdb.tb_activity partition(ymd) select * from tdb.tb_activity where ymd = '2019-07-22';
2. But, Exception raised.
"UnknownReason Blacklisting behavior can be configured..." in HUE.
And, Below is a spark container error log.
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Number of input columns was different than output columns (in = 9 vs out = 8 at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:805) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:882) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:882) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:146) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:484) ... 19 more Caused by: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Number of input columns was different than output columns (in = 9 vs out = 8 at org.apache.hadoop.hive.serde2.avro.AvroSerializer.serialize(AvroSerializer.java:75) at org.apache.hadoop.hive.serde2.avro.AvroSerDe.serialize(AvroSerDe.java:212) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:725) ... 25 more 19/07/23 17:25:32 ERROR executor.Executor: Exception in task 4.0 in stage 0.0 (TID 2)
3. I found a strange in "explain query".
line24 : expressions: ..... extras (type: string), ....
line41: avro.schema.literal {"type":"record", ....... ,{"name":"extras","type":["null","string"],"default":null}, .....
line44: columns actiontype,contentid,contenttype,device,serviceid,timestamp,userip,userid
No have "extras" field on line 44.
10 11 STAGE PLANS: 12 Stage: Stage-1 13 Spark 14 DagName: hive_20190723175708_8b93a3ff-d533-48cc-865e-6af87f576858:29 15 Vertices: 16 Map 1 17 Map Operator Tree: 18 TableScan 19 alias: tb_activity 20 filterExpr: (ymd = '2019-07-22') (type: boolean) 21 Statistics: Num rows: 16429 Data size: 13275304 Basic stats: COMPLETE Column stats: NONE 22 GatherStats: false 23 Select Operator 24 expressions: actiontype (type: string), contentid (type: string), contenttype (type: string), device (type: string), extras (type: string), serviceid (type: string), timestamp (type: bigint), userip (type: string), userid (type: string), '2019-07-22' (type: string) 25 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9 26 Statistics: Num rows: 16429 Data size: 13275304 Basic stats: COMPLETE Column stats: NONE 27 File Output Operator 28 compressed: true 29 GlobalTableId: 1 30 directory: hdfs://nameservice1/etl/flume/tb_activity/.hive-staging_hive_2019-07-23_17-57-08_154_1733664706886256905-5/-ext-10002 31 NumFilesPerFileSink: 1 32 Statistics: Num rows: 16429 Data size: 13275304 Basic stats: COMPLETE Column stats: NONE 33 Stats Publishing Key Prefix: hdfs://nameservice1/etl/flume/tb_activity/.hive-staging_hive_2019-07-23_17-57-08_154_1733664706886256905-5/-ext-10000/ 34 table: 35 input format: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat 36 output format: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat 37 properties: 38 DO_NOT_UPDATE_STATS true 39 EXTERNAL TRUE 40 STATS_GENERATED TASK 41 avro.schema.literal {"type":"record","name":"Activity","namespace":"com.bigdata.avro","doc":"Schema for com.bigdata.avro.Activity","fields":[{"name":"actionType","type":["null","string"]},{"name":"contentId","type":["null","string"]},{"name":"contentType","type":["null","string"]},{"name":"device","type":["null","string"]},{"name":"extras","type":["null","string"],"default":null},{"name":"serviceId","type":["null","string"]},{"name":"timestamp","type":["null","long"]},{"name":"userIp","type":["null","string"]},{"name":"userid","type":["null","string"]}]} 42 avro.schema.url hdfs:///metadata/avro/tb_activity.avsc 43 bucket_count -1 44 columns actiontype,contentid,contenttype,device,serviceid,timestamp,userip,userid 45 columns.comments 46 columns.types string:string:string:string:string:bigint:string:string 47 file.inputformat org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat 48 file.outputformat org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat 49 impala.lastComputeStatsTime 1559636056 50 location hdfs://nameservice1/etl/flume/tb_activity 51 name tdb.tb_activity 52 numRows 83020631 53 partition_columns ymd 54 partition_columns.types string 55 serialization.ddl struct tb_activity { string actiontype, string contentid, string contenttype, string device, string serviceid, i64 timestamp, string userip, string userid} 56 serialization.format 1 57 serialization.lib org.apache.hadoop.hive.serde2.avro.AvroSerDe 58 totalSize 6334562388 59 transient_lastDdlTime 1556875047 60 serde: org.apache.hadoop.hive.serde2.avro.AvroSerDe 61 name: tdb.tb_activity 62 TotalFiles: 1 63 GatherStats: true 64 MultiFileSpray: false
"extras" field on avro schema have "default" property.
and other fields has no "default" property.
I have been doing avro schema changes in the past. The "extras" field was then added.
What is wrong?