Member since
04-11-2020
3
Posts
2
Kudos Received
0
Solutions
02-13-2024
09:09 PM
1 Kudo
I have an HDFS path including two different file format 1- Avro 2-JSON I want to build a hive external table on that path but only for Avro files. I do not have the option to have each file format in a separate location and I could not find a way to pick those file based on the file extension for the hive external table the suggested solutions was to build a custom class which will fitter the files based on the targeted type which is extended for the main avro input class AvroContainerInputFormat. package com.company.hadoop.customavroinputformat;
import org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.RecordReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.mapred.AvroWrapper;
import org.apache.hadoop.io.NullWritable;
public class CustomAvroContainerInputFormat extends AvroContainerInputFormat {
@Override
public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {
// Get the original splits
InputSplit[] originalSplits = super.getSplits(job, numSplits);
// Create a list to store the filtered splits
List<InputSplit> filteredSplits = new ArrayList<>();
// Your specific file extension
String fileExtension = ".avroSnappy";
for (InputSplit split : originalSplits) {
// Check if the file path ends with the specified extension
String filePath = ((FileSplit) split).getPath().toString();
if (filePath.endsWith(fileExtension)) {
System.out.println("File Path considered: " + filePath);
filteredSplits.add(split);
}
}
// Convert the list to an array
InputSplit[] filteredSplitsArray = new InputSplit[filteredSplits.size()];
return filteredSplits.toArray(filteredSplitsArray);
}
// You can override other methods as needed
// ...
} Above is the customized class the class is added to hive using below command ADD JAR /path/CustomDSAvroInputFormat.jar; and the hive external table is created as bellow CREATE EXTERNAL TABLE table_name
PARTITIONED BY (
event_day STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'com.company.hadoop.customavroinputformat.CustomAvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '.....'
TBLPROPERTIES ('avro.schema.url'='/path/AVRO_SCHEMA_FILE'); Now when I do select * from the table, I am getting the result as expected and the data being shown normally But when I am using select count (*) or where clause, it's trying to create mapreduc job and somehow it's not using the customized class so it's scan the binary files as well and getting failed select * resulte > select event_day from <table_name> limit 2;
OK
File Path considered: hdfs://<URL><path>/c5f085e1-7a81-47e2-981e-702982c0ac00.avroSnappy
20240204000000
20240204000000
Time taken: 2.041 seconds, Fetched: 2 row(s) select count select count(*) from <table_name> limit 2;
Query ID = user_20240214072828_73a1c989-a71f-4153-93e0-5e719b949014
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1707415269562_0043, Tracking URL = <url>/application_1707415269562_0043/
Kill Command = /opt/cloudera/parcels/CDH-5.16.1-1.cdh5.16.1.p0.3/lib/hadoop/bin/hadoop job -kill job_1707415269562_0043
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2024-02-14 07:28:34,044 Stage-1 map = 0%, reduce = 0%
2024-02-14 07:28:52,709 Stage-1 map = 100%, reduce = 100%
Ended Job = job_1707415269562_0043 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1707415269562_0043_m_000001 (and more) from job job_1707415269562_0043
Task with the most failures(4):
-----
Task ID:
task_1707415269562_0043_m_000001
URL:
<url>
-----
Diagnostic Messages for this Task:
Error: java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:267)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.<init>(HadoopShimsSecure.java:213)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:334)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:695)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:438)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:253)
... 11 more
Caused by: java.io.IOException: Not a data file.
at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.<init>(AvroGenericRecordReader.java:89)
at org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat.getRecordReader(AvroContainerInputFormat.java:51)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
... 16 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec so, is there any ideas why and how can i fix this or an alternative solution? Thanks in advance I tried to create view on top of the main table or using nested queries, but it did not work also.
... View more
Labels:
- Labels:
-
Apache Hive
-
MapReduce