Support Questions
Find answers, ask questions, and share your expertise

set num_nodes=0 not working from jdbc connection

Hi,

 

Our cluster 11 node cluster with 9 data nodes . Impala is running on 9 nodes. I'm trying to create text table in impala using CTAS using executeUpdate.

Table is having 10 GB of data. It is taking lot of time because its creating one file. To create multiple files I'm using set num_nodes=0 before my CTAS.

But still its creating one file. Will set option wont work from JDBC? its confusing because  its working fine from beeline too. Or is there any way can apply

parallelism in Impala to create multiple files under the table.

 

or Am I missing something please advice.

 

sample code like this.

 

execute("set num_nodes=0");

executeUpdate("create table abc_txt as select * from test');

 

Regards,

Venkat

 

 

 

5 REPLIES 5

Re: set num_nodes=0 not working from jdbc connection

Master Collaborator

Hi Venkat,

 

would you be able to provide the explain plan for that CTAS?

There are several possible reasons why your CTAS is not behaving as expected.

 

Alex

 

Re: set num_nodes=0 not working from jdbc connection

Alex,

 

As said earlier, I'm able to run it from Impala Shell & Beeline. But from JDBC it is not working as expected. Can we run the explain plan using

executeUpdate? Please advice.

 

Please find the sql.

 

1.We are reading avro table and creating the table in text format.

2.Below table created using executeUpdate.

 

create table SNAP_210210070704 row format delimited fields terminated by '\21' lines terminated by '\6' stored as textfile LOCATION '/staging/tmp/SNAP_210210070704' as select 210210070704, a.feed_srce,'B',a.ln_no,a.fee_tran_cd,a.fee_tran_ty,a.fee_cd,a.fee_tran_dt,a.fee_tran_null,a.fee_tran_am,a.fee_batch_cd,a.fee_sub_cd,a.client_no,'20150731','SNAP_210210070704' from fee a where cob_date='20150730' and feed_type like 'FULL%' and context_key=210283860704 and not exists (select 1 from SNA_210210070704 b where cob_date='20150731' and feed_type like 'FULL%210210070704%' and b.delta_file_byte in ('C', 'D') and a.LN_NO=b.LN_NO)

 

Regards,

Venkat

Re: set num_nodes=0 not working from jdbc connection

Master Collaborator

Hi Venkat,

 

could you clarify what you mean by "able to run it frm Impala Shell & Beeline". Does the query produce the expected number of files when run from the Impala Shell or Beeline?

 

The runtime profile has a lot of helpful information that may help me explain or identify the issue.

First run your CTAS from JDBC, then go to the WebUI of the Impala you are connected to (http://impalad-host:25000).

Go to the "queries" tab, find your CTAS, go to "profile". Attach the output here so I can see it. It may help to append a "&raw" to the URL to get the raw runtime profile output.

 

Thanks,

 

Alex

 

Re: set num_nodes=0 not working from jdbc connection

Alex,

 

Thanks for the reply. I have maksed servernames.  Please find  profile. I'm unable to paste everything.

 

Query (id=364c824a9605c289:10b6869ee8769dbc)
  Summary
    Session ID: 684348b0bdffe689:b743ccb91787838c
    Session Type: HIVESERVER2
    HiveServer2 Protocol Version: V6
    Start Time: 2015-10-05 18:54:17.929771000
    End Time: 2015-10-05 18:58:03.884054000
    Query Type: DDL
    Query State: EXCEPTION
    Query Status: Session closed
    Impala Version: impalad version 2.1.3-cdh5 RELEASE (build 20816e26a150d20d2c92c470aa40b342521b243e)
    User: ********
    Connected User: ********
    Delegated User:
    Network Address: 169.99.12.90:35482
    Default Db: msp
    Sql Statement: create table fee_FULL_SNAP_210310080704 row format delimited fields terminated by '\21' lines terminated by '\6' stored as textfile LOCATION '/data/risk/msp/staging/tmp/fee_FULL_SNAP_210310080704' as select 210310080704, a.feed_srce,'B',a.ln_no,a.fee_tran_cd,a.fee_tran_ty,a.fee_cd,a.fee_tran_dt,a.fee_tran_null,a.fee_tran_am,a.fee_batch_cd,a.fee_sub_cd,a.client_no,'20150725','FULL_SNAP_210310080704' from fee a where cob_date='20150724' and feed_type like 'FULL%' and context_key=210333970704 and not exists (select 1 from fee_210310080704 b where cob_date='20150725' and feed_type like 'FULL%210310080704%' and b.delta_file_byte in ('C', 'D') and a.LN_NO=b.LN_NO)
    Coordinator: XXXXXXXX:22000
    DDL Type: CREATE_TABLE_AS_SELECT
    Plan:
----------------
Estimated Per-Host Requirements: Memory=2.60GB VCores=2
WARNING: The following tables are missing relevant table and/or column statistics.
msp.fee

F02:PLAN FRAGMENT [HASH(b.LN_NO)]
  WRITE TO HDFS [msp.fee_full_snap_210310080704, OVERWRITE=false]
  |  partitions=1
  |  hosts=3 per-host-mem=100.00KB
  |
  02:HASH JOIN [RIGHT ANTI JOIN, PARTITIONED]
  |  hash predicates: b.LN_NO = a.LN_NO
  |  hosts=3 per-host-mem=2.00GB
  |  tuple-ids=0 row-size=159B cardinality=unavailable
  |
  |--04:EXCHANGE [HASH(a.LN_NO)]
  |     hosts=9 per-host-mem=0B
  |     tuple-ids=0 row-size=159B cardinality=unavailable
  |
  03:EXCHANGE [HASH(b.LN_NO)]
     hosts=3 per-host-mem=0B
     tuple-ids=1 row-size=43B cardinality=4486

F01:PLAN FRAGMENT [RANDOM]
  DATASTREAM SINK [FRAGMENT=F02, EXCHANGE=04, HASH(a.LN_NO)]
  00:SCAN HDFS [msp.fee a, RANDOM]
     partitions=8/76 files=48 size=5.86GB
     predicates: context_key = 210333970704
     table stats: unavailable
     columns missing stats: LN_NO, FEED_SRCE, FEE_TRAN_CD, FEE_TRAN_TY, FEE_CD, FEE_TRAN_DT, FEE_TRAN_NULL, FEE_TRAN_AM, FEE_BATCH_CD, FEE_SUB_CD, CLIENT_NO, CONTEXT_KEY
     hosts=9 per-host-mem=616.00MB
     tuple-ids=0 row-size=159B cardinality=unavailable

F00:PLAN FRAGMENT [RANDOM]
  DATASTREAM SINK [FRAGMENT=F02, EXCHANGE=03, HASH(b.LN_NO)]
  01:SCAN HDFS [msp.fee_210310080704 b, RANDOM]
     partitions=2/2 files=2 size=75.72KB
     predicates: b.delta_file_byte IN ('C', 'D')
     table stats: 4486 rows total
     column stats: all
     hosts=3 per-host-mem=32.00MB
     tuple-ids=1 row-size=43B cardinality=4486
----------------
    Estimated Per-Host Mem: 2793406464
    Estimated Per-Host VCores: 2
    Tables Missing Stats: msp.fee
    Admission result: Admitted immediately
    Request Pool: root.default
    ExecSummary:
Operator          #Hosts  Avg Time  Max Time   #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                      
-------------------------------------------------------------------------------------------------------------------------
02:HASH JOIN           2     1m40s     1m41s  64.41M          -1    7.91 GB        2.00 GB  RIGHT ANTI JOIN, PARTITIONED
|--04:EXCHANGE         2   9s534ms   9s585ms  64.41M          -1          0              0  HASH(a.LN_NO)               
|  00:SCAN HDFS        9   1s229ms   2s945ms  64.41M          -1    1.33 GB      616.00 MB  msp.fee a                   
03:EXCHANGE            2  67.719us  72.586us       3       4.49K          0              0  HASH(b.LN_NO)               
01:SCAN HDFS           2   1s173ms   1s901ms       3       4.49K  487.34 KB       32.00 MB  msp.fee_210310080704 b      
    Query Timeline
      Start execution: 47409
      Planning finished: 58537357
      Submit for admission: 106684171
      Completed admission: 106772551
      Ready to start remote fragments: 107207737
      Remote fragments started: 2515806710
      DML data written: 225846952360
      DML Metastore update finished: 225950685122
      Rows available: 225950820646
      Cancelled: 225952257049
      Unregister query: 225954306849
  ImpalaServer
    - AsyncTotalTime: 0
    - CatalogOpExecTimer: 47438331
    - ClientFetchWaitTimer: 3402362
    - InactiveTotalTime: 0
    - MetastoreUpdateTimer: 103800218
    - RowMaterializationTimer: 0
    - TotalTime: 0
  Execution Profile 364c824a9605c289:10b6869ee8769dbc
    Fragment start latencies:  count: 13, last: 0.959884, min: 0.197319, max: 1.16632, mean: 0.381671, stddev: 0.295418
    Insert Stats:
Partition: Default
  BytesWritten: 5.95 GB
    Per Node Peak Memory Usage: rkqcrl-odshadn09:22000(1.31 GB) rkqcrl-odshadn02 :22000(930.86 MB) rkqcrl-odshadn08 :22000(424.28 MB) rkqcrl-odshadn06 :22000(1.33 GB) rkqcrl-odshadn03********:22000(1.05 GB) rkqcrl-odshadn01********:22000(7.91 GB) rkqcrl-odshadn07********:22000(404.51 MB) rkqcrl-odshadn04********:22000(474.71 MB) rkqcrl-odshadn05********:22000(7.91 GB)
    - AsyncTotalTime: 0
    - FinalizationTimer: 79872153
      - FileDeletionTimer: 21965240
      - FileMoveTimer: 20832008
      - Overwrite/PartitionCreationTimer: 20704356
    - InactiveTotalTime: 0
    - TotalTime: 225727240905
    Averaged Fragment F02
      split sizes:  min: 0, max: 0, avg: 0, stddev: 0
      completion times: min:3m42s  max:3m45s  mean: 3m44s  stddev:1s318ms
      execution rates: min:0.00 /sec  max:0.00 /sec  mean:0.00 /sec  stddev:0.00 /sec
      num instances: 2
      - AsyncTotalTime: 0
      - AverageThreadTokens: 1.348993993993994
      - InactiveTotalTime: 0
      - PeakMemoryUsage: 8495599616
      - PerHostPeakMemUsage: 8495599616
      - PrepareTime: 254750154
      - RowsProduced: 32205693
      - TotalCpuTime: 247088779662
      - TotalNetworkReceiveTime: 55801618506
      - TotalNetworkSendTime: 0
      - TotalStorageWaitTime: 0
      - TotalTime: 223865616620
      BlockMgr
        - AsyncTotalTime: 0
        - BlockWritesOutstanding: 0
        - BlocksCreated: 1104
        - BlocksRecycled: 0
        - BufferedPins: 0
        - BytesWritten: 0
        - InactiveTotalTime: 0
        - MaxBlockSize: 8388608
        - MemoryLimit: 65529290752
        - PeakMemoryUsage: 8455716864
        - TotalBufferWaitTime: 0
        - TotalEncryptionTime: 0
        - TotalIntegrityCheckTime: 0
        - TotalReadBlockTime: 0
        - TotalTime: 0
      CodeGen
        - AsyncTotalTime: 0
        - CodegenTime: 0
        - CompileTime: 73989298
        - InactiveTotalTime: 0
        - LoadTime: 207238
        - ModuleFileSize: 2047308
        - OptimizationTime: 312771598
        - PrepareTime: 247806980
        - TotalTime: 636470695
      HdfsTableSink
        - AsyncTotalTime: 0
        - BytesWritten: 3192291894
        - CompressTimer: 0
        - EncodeTimer: 139225016987
        - FilesCreated: 1
        - FinalizePartitionFileTimer: 18935582
        - HdfsWriteTimer: 3838403571
        - InactiveTotalTime: 0
        - PartitionsCreated: 1
        - PeakMemoryUsage: 51200
        - RowsInserted: 32205693
        - TmpFileCreateTimer: 12738811
        - TotalTime: 143496036072
      HASH_JOIN_NODE (id=2)
        - AsyncTotalTime: 0
        - BuildPartitionTime: 15943875294
        - BuildRows: 32205693
        - BuildRowsPartitioned: 32205693
        - BuildTime: 2607941533
        - GetNewBlockTime: 93093564
        - HashBuckets: 67108864
        - InactiveTotalTime: 0
        - LargestPartitionPercent: 3
        - MaxPartitionLevel: 0
        - NumRepartitions: 0
        - PartitionsCreated: 32
        - PeakMemoryUsage: 8495587328
        - PinTime: 0
        - ProbeRows: 1
        - ProbeRowsPartitioned: 0
        - ProbeTime: 1477885010
        - RowsReturned: 32205693
        - RowsReturnedRate: 401023
        - SpilledPartitions: 0
        - TotalTime: 80309229560
        - UnpinTime: 2626
        EXCHANGE_NODE (id=4)
          - AsyncTotalTime: 0
          - BytesReceived: 3203511683
          - ConvertRowBatchTime: 1319448768
          - DeserializeRowBatchTimer: 21035545814
          - FirstBatchArrivalWaitTime: 41907820
          - InactiveTotalTime: 0
          - PeakMemoryUsage: 0
          - RowsReturned: 32205693
          - RowsReturnedRate: 568293
          - SendersBlockedTimer: 0
          - SendersBlockedTotalTimer(*): 0
          - TotalTime: 56680958901
        EXCHANGE_NODE (id=3)
          - AsyncTotalTime: 0
          - BytesReceived: 64
          - ConvertRowBatchTime: 10232
          - DeserializeRowBatchTimer: 317463
          - FirstBatchArrivalWaitTime: 1390250417
          - InactiveTotalTime: 0
          - PeakMemoryUsage: 0
          - RowsReturned: 1
          - RowsReturnedRate: 0
          - SendersBlockedTimer: 0
          - SendersBlockedTotalTimer(*): 0
          - TotalTime: 1390318737

Re: set num_nodes=0 not working from jdbc connection

Master Collaborator

Thanks, Venkat.

 

If my understanding is correct, then the plan you posted should produce exactly two output files, correct?

 

Let me explain briefly explain the issue. Because your larger table does not have stats, Impala choses a suboptimal plan with the large table on the build side of the anti join.

 

The degree of paralellism in Impala is determined by the leftmost scan. Since your smaller table only lives on two nodes, the series of joins and the final table sink (producing the rows for the new table), are run on 2 nodes. Each node will produce one file.

 

What you want is to have the large table on the left (probe) side of the join. Solutions:

 

1. Run compute stats on the large table (msp.fee) and try EXPLAIN again. The RIGHT ANTI JOIN should turn into a LEFT ANTI JOIN. The plan should produce 9 files (one for each host executing the sink).

2. Manually prevent the join inversion from LEFT ANTI JOIN to RIGHT ANTI JOIN. You can do this by adding the "straight_join" hint to the select portion of your CTAS like this:

CREATE TABLE T AS SELECT STRAIGHT_JOIN COL1 .... FROM T1, T2...