Support Questions
Find answers, ask questions, and share your expertise

Sqoop import incremental job cannot save incremental import state to the metastore running on PostgreSQL due to serialization failure

Sqoop import incremental job cannot save incremental import state to the metastore running on PostgreSQL due to serialization failure

General Flow description:

We have a pipeline where we run simultaneously (10-15) tasks to import data into Hive from different database sources ( one task per source ). Each task incorporates internal loop over the list of tables we want to import into Hive which is sequentially run correspondent sqoop saved incremental job (one per table).

Recently we moved all sqoop incremental jobs from local metastore to PostgreSQL due to HA requirements.

Single jobs execute for particular task works without any issue.

Problem Description:

The issue happens occasionally on different sources with the error below:

18/09/05 07:54:15 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Error communicating with database
	at org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage.createInternal(HsqldbJobStorage.java:426)
	at org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage.update(HsqldbJobStorage.java:445)
	at org.apache.sqoop.tool.ImportTool.saveIncrementalState(ImportTool.java:164)
	at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:528)
	at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:615)
	at org.apache.sqoop.tool.JobTool.execJob(JobTool.java:243)
	at org.apache.sqoop.tool.JobTool.run(JobTool.java:298)
	at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
	at org.apache.sqoop.Sqoop.runTool(Sqoop.java:225)
	at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
	at org.apache.sqoop.Sqoop.main(Sqoop.java:243)
Caused by: org.postgresql.util.PSQLException: ERROR: could not serialize access due to read/write dependencies among transactions
  Detail: Reason code: Canceled on identification as a pivot, during conflict out checking.
  Hint: The transaction might succeed if retried.

The errors from postgresql.log

2018-09-01 07:47:36 UTC [19863-1] sqoop@sqoop ERROR:  could not serialize access due to read/write dependencies among transactions
2018-09-01 07:47:36 UTC [19863-2] sqoop@sqoop DETAIL:  Reason code: Canceled on identification as a pivot, during conflict out checking.
2018-09-01 07:47:36 UTC [19863-3] sqoop@sqoop HINT:  The transaction might succeed if retried.
2018-09-01 07:47:36 UTC [19863-4] sqoop@sqoop STATEMENT:  UPDATE SQOOP_SESSIONS SET propval = $1 WHERE job_name = $2 AND propclass = $3 AND propname = $4

Or 

2018-09-05 07:54:15 UTC [22084-1] sqoop@sqoop ERROR:  could not serialize access due to read/write dependencies among transactions
2018-09-05 07:54:15 UTC [22084-2] sqoop@sqoop DETAIL:  Reason code: Canceled on identification as a pivot, during conflict out checking.
2018-09-05 07:54:15 UTC [22084-3] sqoop@sqoop HINT:  The transaction might succeed if retried.
2018-09-05 07:54:15 UTC [22084-4] sqoop@sqoop STATEMENT:  SELECT propval FROM SQOOP_SESSIONS WHERE job_name = $1 AND propclass = $2 AND propname = $3

Investigations:

We tried to re-create SQOOP_SESSIONS table with fillfactor=50 but it didn't fix the issue.

sqoop=> show default_transaction_isolation ;
 default_transaction_isolation
-------------------------------
 read committed
(1 row)
sqoop=> show block_size ;
 block_size
------------
 8192
(1 row)

-- We re-created SQOOP_SESSIONS with fillfactor=50 
CREATE TABLE SQOOP_SESSIONS(
    JOB_NAME VARCHAR(64) ,
    PROPNAME VARCHAR(128) ,
    PROPVAL VARCHAR(1024) ,
    PROPCLASS VARCHAR(32) ,
    CONSTRAINT SQOOP_SESSIONS_unq UNIQUE
    (JOB_NAME ,PROPNAME ,PROPCLASS) WITH (fillfactor=50)
)
WITH (fillfactor=50);

-- Execution plan validation 
sqoop=> select job_name, propclass,propname , propval from public.sqoop_sessions where job_name='bacon-import-into-stg_aclu4_user' and propname='incremental.last.value' order by 1,2,3;
             job_name             |  propclass   |        propname        |    propval
----------------------------------+--------------+------------------------+---------------
 bacon-import-into-stg_aclu4_user | SqoopOptions | incremental.last.value | 1535702286300
(1 row)

sqoop=> PREPARE stmt(varchar,varchar,varchar,varchar) AS  UPDATE SQOOP_SESSIONS SET propval = $1 WHERE job_name = $2 AND propclass = $3 AND propname = $4;
PREPARE
sqoop=> EXPLAIN EXECUTE stmt('1535702286300','bacon-import-into-stg_aclu4_user','SqoopOptions','incremental.last.value');
 Update on sqoop_sessions  (cost=0.41..8.43 rows=1 width=78)
   ->  Index Scan using sqoop_sessions_unq on sqoop_sessions  (cost=0.41..8.43 rows=1 width=78)
         Index Cond: (((job_name)::text = 'bacon-import-into-stg_aclu4_user'::text) AND ((propname)::text = 'incremental.last.value'::text) AND ((propclass)::text = 'SqoopOptions'::text))
(3 rows)
DEALLOCATE stmt ;

PostgreSQL package version: 9.5.14-0ubuntu0.16.04

It is clear the issue connected with PostgreSQL and transaction isolation level and its functionality.

Any ideas how fix it?