Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Single records of a file split into multiple?

Solved Go to solution

Single records of a file split into multiple?

Hi,

I am currently facing an issue related to single record of a file split into multiple. See the below example.

filename - test.txt

col1|col2|col3|col4|col5|col6|col7|col8|col9|col10

qwe|345|point(-10.0, -11.0)|exec|1234|mana

ger|124|team|specia|1100

mwe|123|point(-0.9,-1.0)|exec|4563|superman|134|team|special|1101

The above one is the sample file I have in mentioned schema and two sample records. first record is splitted into two lines when I tried to load in hive getting the null values. Second record values are loading correctly . Currently we are handling with pig but not getting accurate values. Can anyone suggest me the best way of handling the scenario.

Thanks,

Rushikesh

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Single records of a file split into multiple?

@Roman Boyko Is it possible to handle this scenario using PIG or HIVE?

I am currently tried it using HIVE with below LOAD statement:

hive> create table test(col1 string, col2 string, col3 string, col4 string, col5 string, col6 string, col7 string, col8 string, col9 string, col10 string) row format delimited fields terminated by '\t' stored as textfile;
hive> load data local inpath '/tmp/test.txt' into table test;
9 REPLIES 9

Re: Single records of a file split into multiple?

Mentor

@Rushikesh Deshmukh please provide load statement

Re: Single records of a file split into multiple?

New Contributor

Hi!

I think the best way is to write MapReduce job that corrects data in file, based on real and expected count of delimiters in the record.

Re: Single records of a file split into multiple?

@Roman Boyko Is it possible to handle this scenario using PIG or HIVE?

I am currently tried it using HIVE with below LOAD statement:

hive> create table test(col1 string, col2 string, col3 string, col4 string, col5 string, col6 string, col7 string, col8 string, col9 string, col10 string) row format delimited fields terminated by '\t' stored as textfile;
hive> load data local inpath '/tmp/test.txt' into table test;

Re: Single records of a file split into multiple?

New Contributor

@Rushikesh Deshmukh You can use lead and lag functions in Hive, but in this case you'll face many constraints (e.g. only one spurious record delimiter in row, no null columns and so on).

Or you can try to use Pig like this example.

Re: Single records of a file split into multiple?

@Roman Boyko, thanks for sharing this information and link.

Re: Single records of a file split into multiple?

New Contributor

@Rushikesh Deshmukh- I see Geometry data type (point) included in your data set. For insights on geo-spacial calculations, you can start at- https://cwiki.apache.org/confluence/display/Hive/Spatial+queries and https://github.com/Esri/spatial-framework-for-hadoop/wiki/ST_Geometry-in-Hive-versus-SQL.

Re: Single records of a file split into multiple?

@Mayank Shekhar, thanks for sharing this information and link.

Re: Single records of a file split into multiple?

New Contributor

@Rushikesh Deshmukh

On the other hand.. HIVE's regexp_replace can help in cleaning the data.. eg below.. this removes nested '\','\t' and '\r' combination of unformatted data within a single JSON string..

--populate clean src table insert overwrite table src_clean PARTITION (ddate='${hiveconf:DATE_VALUE}') select regexp_replace(regexp_replace(regexp_replace(regexp_replace(full_json_line, "\\\\\\\\\\\\\\\\t|\\\\\\\\\\\\\\\\n|\\\\\\\\\\\\\\\\r", "\\\\\\\\\\\\\\\\<t or n or r>"), "\\\\\\\\\\\\t|\\\\\\\\\\\\n|\\\\\\\\\\\\r", "\\\\\\\\ "), "\\\\\\\\t|\\\\\\\\n|\\\\\\\\r", "\\\\\\\\<t or n or r>"),"\\\\t|\\\\n|\\\\r", "") as full_json_line from src_unclean where ddate='${hiveconf:DATE_VALUE}';

Highlighted

Re: Single records of a file split into multiple?

@Mayank Shekhar, thanks for sharing this information.