question Re: Single records of a file split into multiple? in Archives of Support Questions (Read Only)

Single records of a file split into multiple?

rushikeshdeshmu — Sat, 13 Feb 2016 20:48:47 GMT

Hi,

I am currently facing an issue related to single record of a file split into multiple. See the below example.

filename - test.txt

col1|col2|col3|col4|col5|col6|col7|col8|col9|col10

qwe|345|point(-10.0, -11.0)|exec|1234|mana

ger|124|team|specia|1100

mwe|123|point(-0.9,-1.0)|exec|4563|superman|134|team|special|1101

The above one is the sample file I have in mentioned schema and two sample records. first record is splitted into two lines when I tried to load in hive getting the null values. Second record values are loading correctly . Currently we are handling with pig but not getting accurate values. Can anyone suggest me the best way of handling the scenario.

Thanks,

Rushikesh

Re: Single records of a file split into multiple?

aervits — Sat, 13 Feb 2016 21:29:30 GMT

@Rushikesh Deshmukh please provide load statement

Re: Single records of a file split into multiple?

ro_v_boyko — Sun, 14 Feb 2016 02:48:18 GMT

Hi!

I think the best way is to write MapReduce job that corrects data in file, based on real and expected count of delimiters in the record.

Re: Single records of a file split into multiple?

rushikeshdeshmu — Sun, 14 Feb 2016 21:01:26 GMT

@Roman Boyko Is it possible to handle this scenario using PIG or HIVE?

I am currently tried it using HIVE with below LOAD statement:

hive> create table test(col1 string, col2 string, col3 string, col4 string, col5 string, col6 string, col7 string, col8 string, col9 string, col10 string) row format delimited fields terminated by '\t' stored as textfile;

hive> load data local inpath '/tmp/test.txt' into table test;

Re: Single records of a file split into multiple?

ro_v_boyko — Mon, 15 Feb 2016 11:09:05 GMT

@Rushikesh Deshmukh You can use lead and lag functions in Hive, but in this case you'll face many constraints (e.g. only one spurious record delimiter in row, no null columns and so on).

Or you can try to use Pig like this example.

Re: Single records of a file split into multiple?

rushikeshdeshmu — Sat, 20 Feb 2016 21:45:47 GMT

@Roman Boyko, thanks for sharing this information and link.

Re: Single records of a file split into multiple?

shekharonlin — Sun, 13 Mar 2016 13:29:31 GMT

@Rushikesh Deshmukh- I see Geometry data type (point) included in your data set. For insights on geo-spacial calculations, you can start at- https://cwiki.apache.org/confluence/display/Hive/Spatial+queries and https://github.com/Esri/spatial-framework-for-hadoop/wiki/ST_Geometry-in-Hive-versus-SQL.

Re: Single records of a file split into multiple?

rushikeshdeshmu — Sun, 13 Mar 2016 15:53:43 GMT

@Mayank Shekhar, thanks for sharing this information and link.

Re: Single records of a file split into multiple?

shekharonlin — Sun, 13 Mar 2016 16:00:53 GMT

@Rushikesh Deshmukh

On the other hand.. HIVE's regexp_replace can help in cleaning the data.. eg below.. this removes nested '\','\t' and '\r' combination of unformatted data within a single JSON string..

--populate clean src table insert overwrite table src_clean PARTITION (ddate='${hiveconf:DATE_VALUE}') select regexp_replace(regexp_replace(regexp_replace(regexp_replace(full_json_line, "\\\\\\\\\\\\\\\\t|\\\\\\\\\\\\\\\\n|\\\\\\\\\\\\\\\\r", "\\\\\\\\\\\\\\\\<t or n or r>"), "\\\\\\\\\\\\t|\\\\\\\\\\\\n|\\\\\\\\\\\\r", "\\\\\\\\ "), "\\\\\\\\t|\\\\\\\\n|\\\\\\\\r", "\\\\\\\\<t or n or r>"),"\\\\t|\\\\n|\\\\r", "") as full_json_line from src_unclean where ddate='${hiveconf:DATE_VALUE}';

Re: Single records of a file split into multiple?

rushikeshdeshmu — Wed, 16 Mar 2016 01:35:56 GMT

@Mayank Shekhar, thanks for sharing this information.