Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Single records of a file split into multiple?

avatar

Hi,

I am currently facing an issue related to single record of a file split into multiple. See the below example.

filename - test.txt

col1|col2|col3|col4|col5|col6|col7|col8|col9|col10

qwe|345|point(-10.0, -11.0)|exec|1234|mana

ger|124|team|specia|1100

mwe|123|point(-0.9,-1.0)|exec|4563|superman|134|team|special|1101

The above one is the sample file I have in mentioned schema and two sample records. first record is splitted into two lines when I tried to load in hive getting the null values. Second record values are loading correctly . Currently we are handling with pig but not getting accurate values. Can anyone suggest me the best way of handling the scenario.

Thanks,

Rushikesh

1 ACCEPTED SOLUTION

avatar

@Roman Boyko Is it possible to handle this scenario using PIG or HIVE?

I am currently tried it using HIVE with below LOAD statement:

hive> create table test(col1 string, col2 string, col3 string, col4 string, col5 string, col6 string, col7 string, col8 string, col9 string, col10 string) row format delimited fields terminated by '\t' stored as textfile;
hive> load data local inpath '/tmp/test.txt' into table test;

View solution in original post

9 REPLIES 9

avatar
Master Mentor

@Rushikesh Deshmukh please provide load statement

avatar
Contributor

Hi!

I think the best way is to write MapReduce job that corrects data in file, based on real and expected count of delimiters in the record.

avatar

@Roman Boyko Is it possible to handle this scenario using PIG or HIVE?

I am currently tried it using HIVE with below LOAD statement:

hive> create table test(col1 string, col2 string, col3 string, col4 string, col5 string, col6 string, col7 string, col8 string, col9 string, col10 string) row format delimited fields terminated by '\t' stored as textfile;
hive> load data local inpath '/tmp/test.txt' into table test;

avatar
Contributor

@Rushikesh Deshmukh You can use lead and lag functions in Hive, but in this case you'll face many constraints (e.g. only one spurious record delimiter in row, no null columns and so on).

Or you can try to use Pig like this example.

avatar

@Roman Boyko, thanks for sharing this information and link.

avatar
Contributor

@Rushikesh Deshmukh- I see Geometry data type (point) included in your data set. For insights on geo-spacial calculations, you can start at- https://cwiki.apache.org/confluence/display/Hive/Spatial+queries and https://github.com/Esri/spatial-framework-for-hadoop/wiki/ST_Geometry-in-Hive-versus-SQL.

avatar

@Mayank Shekhar, thanks for sharing this information and link.

avatar
Contributor

@Rushikesh Deshmukh

On the other hand.. HIVE's regexp_replace can help in cleaning the data.. eg below.. this removes nested '\','\t' and '\r' combination of unformatted data within a single JSON string..

--populate clean src table insert overwrite table src_clean PARTITION (ddate='${hiveconf:DATE_VALUE}') select regexp_replace(regexp_replace(regexp_replace(regexp_replace(full_json_line, "\\\\\\\\\\\\\\\\t|\\\\\\\\\\\\\\\\n|\\\\\\\\\\\\\\\\r", "\\\\\\\\\\\\\\\\<t or n or r>"), "\\\\\\\\\\\\t|\\\\\\\\\\\\n|\\\\\\\\\\\\r", "\\\\\\\\ "), "\\\\\\\\t|\\\\\\\\n|\\\\\\\\r", "\\\\\\\\<t or n or r>"),"\\\\t|\\\\n|\\\\r", "") as full_json_line from src_unclean where ddate='${hiveconf:DATE_VALUE}';

avatar

@Mayank Shekhar, thanks for sharing this information.