Created 02-13-2016 12:48 PM
Hi,
I am currently facing an issue related to single record of a file split into multiple. See the below example.
filename - test.txt
col1|col2|col3|col4|col5|col6|col7|col8|col9|col10
qwe|345|point(-10.0, -11.0)|exec|1234|mana
ger|124|team|specia|1100
mwe|123|point(-0.9,-1.0)|exec|4563|superman|134|team|special|1101
The above one is the sample file I have in mentioned schema and two sample records. first record is splitted into two lines when I tried to load in hive getting the null values. Second record values are loading correctly . Currently we are handling with pig but not getting accurate values. Can anyone suggest me the best way of handling the scenario.
Thanks,
Rushikesh
Created 02-14-2016 01:01 PM
@Roman Boyko Is it possible to handle this scenario using PIG or HIVE?
I am currently tried it using HIVE with below LOAD statement:
hive> create table test(col1 string, col2 string, col3 string, col4 string, col5 string, col6 string, col7 string, col8 string, col9 string, col10 string) row format delimited fields terminated by '\t' stored as textfile;
hive> load data local inpath '/tmp/test.txt' into table test;
Created 02-13-2016 01:29 PM
@Rushikesh Deshmukh please provide load statement
Created 02-13-2016 06:48 PM
Hi!
I think the best way is to write MapReduce job that corrects data in file, based on real and expected count of delimiters in the record.
Created 02-14-2016 01:01 PM
@Roman Boyko Is it possible to handle this scenario using PIG or HIVE?
I am currently tried it using HIVE with below LOAD statement:
hive> create table test(col1 string, col2 string, col3 string, col4 string, col5 string, col6 string, col7 string, col8 string, col9 string, col10 string) row format delimited fields terminated by '\t' stored as textfile;
hive> load data local inpath '/tmp/test.txt' into table test;
Created 02-15-2016 03:09 AM
@Rushikesh Deshmukh You can use lead and lag functions in Hive, but in this case you'll face many constraints (e.g. only one spurious record delimiter in row, no null columns and so on).
Or you can try to use Pig like this example.
Created 02-20-2016 01:45 PM
@Roman Boyko, thanks for sharing this information and link.
Created 03-13-2016 06:29 AM
@Rushikesh Deshmukh- I see Geometry data type (point) included in your data set. For insights on geo-spacial calculations, you can start at- https://cwiki.apache.org/confluence/display/Hive/Spatial+queries and https://github.com/Esri/spatial-framework-for-hadoop/wiki/ST_Geometry-in-Hive-versus-SQL.
Created 03-13-2016 08:53 AM
@Mayank Shekhar, thanks for sharing this information and link.
Created 03-13-2016 09:00 AM
On the other hand.. HIVE's regexp_replace can help in cleaning the data.. eg below.. this removes nested '\','\t' and '\r' combination of unformatted data within a single JSON string..
--populate clean src table insert overwrite table src_clean PARTITION (ddate='${hiveconf:DATE_VALUE}') select regexp_replace(regexp_replace(regexp_replace(regexp_replace(full_json_line, "\\\\\\\\\\\\\\\\t|\\\\\\\\\\\\\\\\n|\\\\\\\\\\\\\\\\r", "\\\\\\\\\\\\\\\\<t or n or r>"), "\\\\\\\\\\\\t|\\\\\\\\\\\\n|\\\\\\\\\\\\r", "\\\\\\\\ "), "\\\\\\\\t|\\\\\\\\n|\\\\\\\\r", "\\\\\\\\<t or n or r>"),"\\\\t|\\\\n|\\\\r", "") as full_json_line from src_unclean where ddate='${hiveconf:DATE_VALUE}';
Created 03-15-2016 06:35 PM
@Mayank Shekhar, thanks for sharing this information.