Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Yarn Streaming with Python and ORC

Highlighted

Yarn Streaming with Python and ORC

New Contributor

I have a Yarn streaming map reduce job written in Python. When I run it using text input and output it runs fine although it's really slow. When I run it using ORC input and output it runs through the mappers lightning fast (at least compared to the text version) and then fails on the reducers with ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.NullWritable. I can't figure out why. Here are the two Yarn streaming calls:

Text:

yarn jar /usr/hdp/2.2.6.0-2800/hadoop-mapreduce/hadoop-streaming.jar \

-mapper gas_station_information_history_time_bucketed_mapper.py \

-reducer gas_station_information_history_time_bucketed_reducer.py \

-input /apps/hive/warehouse/germanypricing2.db/gas_station_information_history_enhanced_text \

-output /data/USM/GermanyPricing/gas_station_information_history_time_bucketed_text \

-numReduceTasks 1000 \

-file gas_station_information_history_time_bucketed_mapper.py \

-file gas_station_information_history_time_bucketed_reducer.py \

-file time_buckets.tsv

ORC:

yarn jar /usr/hdp/2.2.6.0-2800/hadoop-mapreduce/hadoop-streaming.jar \

-libjars /usr/hdp/2.2.6.0-2800/hive/lib/hive-exec.jar \

-mapper gas_station_information_history_time_bucketed_mapper.py \

-reducer gas_station_information_history_time_bucketed_reducer.py \

-input /apps/hive/warehouse/germanypricing2.db/gas_station_information_history_enhanced \

-output /data/USM/GermanyPricing/gas_station_information_history_time_bucketed \

-numReduceTasks 1000 \

-file gas_station_information_history_time_bucketed_mapper.py \

-file gas_station_information_history_time_bucketed_reducer.py \

-file time_buckets.tsv \

-inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat \

-outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat

The time_buckets.tsv file is used by the mappers so I'm sure that's not causing the problem. I can share the map and reduce code if needed.

4 REPLIES 4
Highlighted

Re: Yarn Streaming with Python and ORC

New Contributor

I found my problem. When the ORC is read in the mapper it was putting a space in front of each piece of data. So when the mapper tried to parse a datetime it was throwing an exception. This caused the exception to be written out instead of my expected output which in turn caused the reducers to fail when attempting to parse the input.

Re: Yarn Streaming with Python and ORC

Explorer

Is how to achieve?

What parameters are required?

Highlighted

Re: Yarn Streaming with Python and ORC

New Contributor

You must have removed the leading space, however, can you clue me in as to how to discover the problem. My simple 'cat' reducer is getting the same NullWritable exception

Highlighted

Re: Yarn Streaming with Python and ORC

New Contributor

Trying to create an orc file from a text file using python streaming. Facing the same error mentioned here. How did you fix the issue? @Kevin Richardson

hadoop jar $STRMJAR \ -D mapred.reduce.tasks=0 \ -D mapred.map.tasks=1 \ -D stream.map.input.ignoreKey=true \ -libjars /usr/hdp/2.2.4.8-24/hive/lib/hive-exec-0.14.0.2.2.4.8-40.jar \ -mapper "cat" \ -input "/user/hive/warehouse/ewwdev.db/orc_input/data" \ -output "/user/hive/warehouse/ewwdev.db/orc_output" \ -inputformat org.apache.hadoop.mapred.TextInputFormat \ -outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat

Don't have an account?
Coming from Hortonworks? Activate your account here