Support Questions
Find answers, ask questions, and share your expertise

Sqoop import to route records with null values in a particular columns into another table

New Contributor

I am trying to move records with null values in a particular column to a particular table and non-null records to another while SQOOP import. tried to explore on goolge but there is not much beyond --null-string and --null-non-String params but that will just replace with the defined characters ...

I can think of following ways to handle it

  1. once importing into hive, run a dedup to filter out the records but this is something to be tried in worst case.
  2. handling at sqoop level itself(no clue on this)

could any expert here can help me with the above ask.

ENV details : its a plain Apache hadoop cluster. sqoop version 1.4.6

1 REPLY 1

Hello @Erfan uddin.

Not sure If I really get it, but you have a source with a bunch of null values for some columns and want to have those records separately in a different table at the destination right? And the same for the non-null.
Ok, at first glance guess you have 3 options that just came to my mind:

1) Use 2 sqoop import + free form query to separate the null values from the non-null using explicit SQL.
2) Use 1 sqoop import without separating nothing, just putting all data in on HDFS path, and at the Hive side, you build 2 external tables with RegexSerde. One of them will read the registers with any NULL value and the reverse logic applied to the other one.
3) Somehow create the following logic (i didn't test it myself..)
- Use 1 sqoop import but with free form query. And for the explicit query put an UNION which the SELECT from the TOP has filter for the ALL NON NULL + Flag telling that part of the query has NON-NULL and the SELECT from the bottom the NULL values. And you will use those flags listed above as partition keys for hive.
E.g. of the query:

SELECT * FROM 
(SELECT	
	COL1	
	,COL2	
	,COL3	
	,'NULL-FLAG' as partitioned_field
FROM	
	BIGTABLE
WHERE
	--YOUR LOGIC TO FILTER NULL's
UNION
SELECT	
	COL1	
	,COL2	
	,COL3	
	,'NOT-NULL-FLAG' as partitioned_field
FROM	
	BIGTABLE
WHERE
	--YOUR LOGIC TO FILTER NON NULL's
) A WHERE $CONDITIONS

Hope this helps