Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

only 1 map task while extracting data from salesforce

Highlighted

only 1 map task while extracting data from salesforce

Contributor

Several Extracts Job in Datameer (Rapid ETL/BI tool, sits on top of hadoop) are reading data out of salesforce objects. The largest extract is 1.4 GB(Task object) and the smallest extract is 96 MB(account object). Datameer uses REST API Based connector , a SOQL query is supplied to the connector and records are fetched accordingly (https://documentation.datameer.com/documentation/display/DAS60/Salesforce).

Datameer compiles the job and hands over the execution to the execution framework (Tez)

All the saleforce extract jobs run with 1 Map tasks.

But,

There are other extract jobs in datameer that read data from flat files(50 - 200 MB) on a sftp server and use between 3-5 map tasks.

About SOQL: https://developer.salesforce.com/docs/atlas.en-us.soql_sosl.meta/soql_sosl/sforce_api_calls_soql_cha... SOQL pulls a max of 2000 records per batch My question :

  1. Considering that data from flat file is running with multiple map tasks, does the issue corresponds to SOQL batch size which only pulls 2000 records per request hence resulting in allocation of only 1 mapper.
  2. How does MR program determine total size of the input extract when dealing with source like salesforce or for that matter even a database.

Environment Information: Hortonwork 2.7.1

Cores Per Data node=8

RAM per Data node=64GB

No of datanodes = 6

Block Size : 128 MB

Input Split info:

mapreduce.input.fileinputformat.split.maxsize=5368709120 (5 GB)

mapreduce.input.fileinputformat.split.minsize=16777216 (16 MB)

Execution Framework: Tez

Memory Sizes: <property> <name>mapreduce.map.memory.mb</name> <value>1536</value> </property><property> <name>mapreduce.reduce.memory.mb</name> <value>2048</value> </property><property> <name>mapreduce.map.java.opts</name> <value>-Xmx1228m</value> </property><property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx1638m</value> </property>

<property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>1024</value> </property><property> <name>yarn.app.mapreduce.am.command-opts</name> <value>-Xmx819m -Dhdp.version=${hdp.version}</value> </property>

Compression is enabled:

<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value> </property> mapreduce.output.fileoutputformat.compress=true

mapreduce.output.fileoutputformat.compress.type=BLOCK

mapreduce.map.output.compress=true

mapred.map.output.compression.type=BLOCK

Don't have an account?
Coming from Hortonworks? Activate your account here