Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Importing fixed data in Hive gives some strange results on some of the columns.

Highlighted

Importing fixed data in Hive gives some strange results on some of the columns.

Contributor

I'm learning for the HDPCD:Spark using Python exam and using Hive is one part of the exam objectives. So I decided to have some fun and import data on all known asteroid data in a table in Hive on a HDP 2.6.1 sandbox.

(Probably important: I start Hive like this because other problems I have encountered before, like not getting to the Hive prompt:

export SPARK_MAJOR_VERSION=2
export HIVE_HOME=/usr/hdp/2.6.1.0-129/hive
hive -hiveconf hive.execution.engine=mr

)

The Minor Planet Center has this data on 745,000 asteroids (http://www.minorplanetcenter.net/data) in JSON and fixed length format. I would have chosen JSON, but its JSON format seems a bit weird. Didn't manage to import this JSON in Python, Spark or Hive.

So I decided to use the .dat file in Fortran format no less (I first removed the non-data header). I've read a tutorial to import fixed data in Hive via an in between table, like this:

create table asteroids_raw (line STRING);

LOAD DATA LOCAL INPATH '/data/data-master/mpcorb_extended_noheader.dat' OVERWRITE INTO TABLE asteroids_raw;

Next I created the actual table:

CREATE TABLE asteroids (designation varchar(7),
absolute_magnitude decimal(5,2),
slope_parameter decimal(5,2),
epoch varchar(5),
mean_anomaly decimal(9,5),
argument_perhelion decimal(9,5),
longitude_ascending decimal(9,5),
inclination decimal(9,5),
eccentricity decimal(9,7),
mean_daily_motion decimal(11,8),
semimajor_axis decimal(11,7),
uncertainty_parameter varchar(1),
reference varchar(9),
number_observations int,
number_oppositions int,
observation_data varchar(10),
rms_residual decimal(4,2),
coarse_indicator_pertubers varchar(3),
precise_indicator_pertubers varchar(3),
computer_name varchar(11),
hex_flags varchar(4),
readable_designation varchar(30),
last_observation timestamp
) row format delimited fields terminated by ' '
stored as textfile;

And now I insert the data by reading that from the asteroids_raw table like this:

INSERT INTO TABLE asteroids
(designation, absolute_magnitude, slope_parameter, epoch, mean_anomaly, argument_perhelion,
longitude_ascending, inclination, eccentricity, mean_daily_motion, semimajor_axis, 
uncertainty_parameter, reference, number_observations, number_oppositions, observation_data, 
rms_residual, coarse_indicator_pertubers, precise_indicator_pertubers, computer_name, hex_flags, 
readable_designation, last_observation)
SELECT
trim(substr(line,1,7)) as designation,
cast(trim(substr(line,9,5)) as decimal(5,2)) as absolute_magnitude,
cast(trim(substr(line,15,5)) as decimal(5,2)) as slope_parameter,
substr(line,21,5) as epoch,
cast(trim(substr(line,27,9)) as decimal(9,5)) as mean_anomaly,
cast(trim(substr(line,38, 9)) as decimal(9,5)) as argument_perhelion,
cast(trim(substr(line,49, 9)) as decimal(9,5)) as longitude_ascending,
cast(trim(substr(line,60, 9)) as decimal(9,5)) as inclination,
cast(trim(substr(line,71, 9)) as decimal(9,7)) as eccentricity,
cast(trim(substr(line,81, 13)) as decimal(11,8)) as mean_daily_motion,
cast(trim(substr(line,93, 11)) as decimal(11,7)) as semimajor_axis,
trim(substr(line, 106, 1)) as uncertainty_parameter,
trim(substr(line, 108, 9)) as reference,
cast(trim(substr(line, 118, 5)) as int) as number_observations,
cast(trim(substr(line, 124, 3)) as int) as number_oppositions,
trim(substr(line, 128, 9)) as observation_data,
cast(trim(substr(line, 138, 4)) as decimal(4,2)) as rms_residual,
trim(substr(line, 143, 3)) as coarse_indicator_pertubers,
trim(substr(line, 147, 3)) as precise_indicator_pertubers,
trim(substr(line, 151, 11)) as computer_name,
trim(substr(line, 162, 4)) as hex_flags,
trim(substr(line, 167, 27)) as readable_designation,
cast(regexp_replace(trim(substr(line, 195, 8)), '(\\d{4})(\\d{2})(\\d{2})', '$1-$2-$3 0:0:0.000') as timestamp) as last_observation
FROM asteroids_raw;

I manage to get just the right data in the columns this way, with the right precisions and everything. But the columns after that are a mess. Here is for example the data from one asteroid from the raw data:

SELECT
trim(substr(line, 128, 10)) as observation_data,
cast(trim(substr(line, 138, 5)) as decimal(4,2)) as rms_residual,
trim(substr(line, 143, 3)) as coarse_indicator_pertubers,
trim(substr(line, 147, 3)) as precise_indicator_pertubers,
trim(substr(line, 151, 11)) as computer_name,
trim(substr(line, 162, 4)) as hex_flags,
substr(line, 167, 27) as readable_designation,
cast(regexp_replace(trim(substr(line,
 195, 8)), '(\\d{4})(\\d{2})(\\d{2})', '$1-$2-$3 0:0:0.000') as 
timestamp) as last_observation
FROM asteroids_raw
WHERE trim(substr(line,1,7)) = 'K10B02G';

The result:

46 days  0.35  M-v  3Eh  MPCALB  2803  2010 BG2  2010-03-05 00:00:00

This is the right data as read from the asteroids_raw table. Now look what happens after I insert this data in the table:

46      NULL            0.3     M-v     3Eh     MPCA    2803    NULL

Suddenly the rms_residual column doesn't have the right precision. There should be a readable_designation "2010 BG2" somewhere in there, but it isn't. And don't even start about the timestamp in last_observation. I've tried all kinds of combinations of a table with larger precision, larger varchars, larger substr's and I just can't get the right data in.

Just to make sure: the problem can't be in the asteroids_raw table. I'm getting the right results from there. To rule out some kind of bug I've also created an asteroids table with only the columns that go wrong, in case there is a problem with the amount of columns or something.

I just can't seem to get this data in the columns in the right way. What am I doing wrong?

1 REPLY 1

Re: Importing fixed data in Hive gives some strange results on some of the columns.

Contributor

I've written this blogpost on this topic:

http://marcel-jan.eu/datablog/2017/11/14/tutorial-lets-throw-some-asteroids-in-apache-hive/

In it you can find any details which I might have forgotten to tell here. Just in case.