I found my problem. When the ORC is read in the mapper it was putting a space in front of each piece of data. So when the mapper tried to parse a datetime it was throwing an exception. This caused the exception to be written out instead of my expected output which in turn caused the reducers to fail when attempting to parse the input.
... View more
I have a Yarn streaming map reduce job written in Python. When I run it using text input and output it runs fine although it's really slow. When I run it using ORC input and output it runs through the mappers lightning fast (at least compared to the text version) and then fails on the reducers with ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.NullWritable. I can't figure out why. Here are the two Yarn streaming calls: Text: yarn jar /usr/hdp/188.8.131.52-2800/hadoop-mapreduce/hadoop-streaming.jar \ -mapper gas_station_information_history_time_bucketed_mapper.py \ -reducer gas_station_information_history_time_bucketed_reducer.py \ -input /apps/hive/warehouse/germanypricing2.db/gas_station_information_history_enhanced_text \ -output /data/USM/GermanyPricing/gas_station_information_history_time_bucketed_text \ -numReduceTasks 1000 \ -file gas_station_information_history_time_bucketed_mapper.py \ -file gas_station_information_history_time_bucketed_reducer.py \ -file time_buckets.tsv ORC: yarn jar /usr/hdp/184.108.40.206-2800/hadoop-mapreduce/hadoop-streaming.jar \ -libjars /usr/hdp/220.127.116.11-2800/hive/lib/hive-exec.jar \ -mapper gas_station_information_history_time_bucketed_mapper.py \ -reducer gas_station_information_history_time_bucketed_reducer.py \ -input /apps/hive/warehouse/germanypricing2.db/gas_station_information_history_enhanced \ -output /data/USM/GermanyPricing/gas_station_information_history_time_bucketed \ -numReduceTasks 1000 \ -file gas_station_information_history_time_bucketed_mapper.py \ -file gas_station_information_history_time_bucketed_reducer.py \ -file time_buckets.tsv \ -inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat \ -outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat The time_buckets.tsv file is used by the mappers so I'm sure that's not causing the problem. I can share the map and reduce code if needed.
... View more