Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Distcp job after Hive job

avatar
New Member

Hello, I currently have a very simple workflow with a Hive script. When I run the workflow, everything is running properly but at the end of each hive query inside my Hive action, I have a job "distcp" that starts.

This is not a part of my workflow, I do not understand why I have this job?

If I run my Hive request inside Hue or anything else I doesn't have a distcp job at the end...

Update :

The bug occurs even if I execute Oozie by the command line.

The coordinator :

<coordinator-app 
        name="coord_l****" 
        frequency="0 4 * * *" 
        start="${startTime}" 
        end="${endTime}" 
        timezone="UTC" 
        xmlns="uri:oozie:coordinator:0.2">
    <controls>
        <timeout>${my_timeout}</timeout>
        <concurrency>${my_concurrency}</concurrency>
        <execution>${execution_order}</execution>
        <throttle>${materialization_throttle}</throttle>
    </controls>
    <action>
        <workflow>
            <app-path>${nameNode}/**/workflow.xml</app-path>
        <configuration>
        <property>
            <name>year</name>
            <value>${coord:formatTime(coord:actualTime(),'yyyy')}</value>
        </property>
        <property>
            <name>month</name>
            <value>${coord:formatTime(coord:actualTime(),'MM')}</value>
        </property>
        <property>
            <name>day</name>
            <value>${coord:formatTime(coord:actualTime(),'dd')}</value>
        </property>
        <property>
            <name>j_30_mprec_year</name>
            <value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -30, 'DAY'), 'yyyy')}</value>
        </property>
        <property>
            <name>j_30_mprec_month</name>
            <value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -30, 'DAY'), 'MM')}</value>
        </property>
        <property>
            <name>j_30_mprec_day</name>
            <value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -30, 'DAY'), 'dd')}</value>
        </property>
        <property>
            <name>j_7_mprec_year</name>
            <value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -7, 'DAY'), 'yyyy')}</value>
        </property>
        <property>
            <name>j_7_mprec_month</name>
            <value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -7, 'DAY'), 'MM')}</value>
        </property>
        <property>
            <name>j_7_mprec_day</name>
            <value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -7, 'DAY'), 'dd')}</value>
        </property>
        <property>
            <name>j_3_mprec_year</name>
            <value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -3, 'DAY'), 'yyyy')}</value>
        </property>
        <property>
            <name>j_3_mprec_month</name>
            <value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -3, 'DAY'), 'MM')}</value>
        </property>
        <property>
            <name>j_3_mprec_day</name>
            <value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -3, 'DAY'), 'dd')}</value>
        </property>
        </configuration>
        </workflow>
    </action>
</coordinator-app>


The workflow :

<workflow-app name="wf_lab" xmlns="uri:oozie:workflow:0.4">
  <credentials>
    <credential name="hcat" type="hcat">
      <property>
        <name>hcat.metastore.uri</name>
        <value>thrift://****</value>
      </property>
      <property>
        <name>hcat.metastore.principal</name>
        <value></value>
      </property>
    </credential>
  </credentials>
    <start to="shell_date"/>
    
    <action name="shell_date" cred="hcat">
        <shell xmlns="uri:oozie:shell-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
    <configuration>
        <property>
            <name>mapred.job.queue.name</name>
            <value>${queueName}</value>
        </property>
    </configuration>
            <exec>**.sh</exec>
            <file>**.sh</file>
              <capture-output/>
        </shell>
        <ok to="maj_t"/>
        <error to="kill"/>
    </action>
    
    <action name="maj_t" cred="hcat">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
              <job-xml>/apps/hive/conf/hive-site.xml</job-xml>
            <configuration>
                <property>
                    <name>oozie.hive.defaults</name>
                    <value>/apps/hive/conf/hive-site.xml</value>
                </property>
                <property>
                    <name>tez.queue.name</name>
                    <value>${queueName}</value>
                </property>
                <property>
                    <name>oozie.hive.log.level</name>
                    <value>INFO</value>
                </property>
                <property>
                    <name>hive.execution.engine</name>
                    <value>tez</value>
                </property>
                <property>
                    <name>mapreduce.job.queuename</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <script>**.hql</script>
            <param>workflowStartYearDate=${year}</param>
            <param>workflowStartMonthDate=${month}</param>
            <param>workflowStartDayDate=${day}</param>
            <param>j_30_mprec_year=${j_30_mprec_year}</param>
            <param>j_30_mprec_month=${j_30_mprec_month}</param>
            <param>j_30_mprec_day=${j_30_mprec_day}</param>  
            <param>j_7_mprec_year=${j_7_mprec_year}</param>
            <param>j_7_mprec_month=${j_7_mprec_month}</param>
            <param>j_7_mprec_day=${j_7_mprec_day}</param>  
            <param>j_3_mprec_year=${j_3_mprec_year}</param>
            <param>j_3_mprec_month=${j_3_mprec_month}</param>
            <param>j_3_mprec_day=${j_3_mprec_day}</param>  
            <param>workflowOldDay7=${wf:actionData('shell_date')['sub_7']}</param>
            <param>workflowOldDay3=${wf:actionData('shell_date')['sub_3']}</param>
        </hive>
        <ok to="maj_after"/>
        <error to="kill"/>
    </action>
    <action name="maj_after" cred="hcat">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <job-xml>/apps/hive/conf/hive-site.xml</job-xml>
            <configuration>
                <property>
                    <name>oozie.hive.defaults</name>
                    <value>/apps/hive/conf/hive-site.xml</value>
                </property>
                <property>
                    <name>tez.queue.name</name>
                    <value>${queueName}</value>
                </property>
                <property>
                    <name>oozie.hive.log.level</name>
                    <value>INFO</value>
                </property>
                <property>
                    <name>hive.execution.engine</name>
                    <value>tez</value>
                </property>
                <property>
                    <name>mapreduce.job.queuename</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <script>**.hql</script>
            <param>workflowStartYearDate=${year}</param>
            <param>workflowStartMonthDate=${month}</param>
            <param>workflowStartDayDate=${day}</param>
        </hive>
        <ok to="maj_to"/>
        <error to="kill"/>
    </action>
    <action name="maj_to" cred="hcat">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <job-xml>/apps/hive/conf/hive-site.xml</job-xml>
            <configuration>
                <property>
                    <name>oozie.hive.defaults</name>
                    <value>/apps/hive/conf/hive-site.xml</value>
                </property>
                <property>
                    <name>tez.queue.name</name>
                    <value>${queueName}</value>
                </property>
                <property>
                    <name>oozie.hive.log.level</name>
                    <value>INFO</value>
                </property>
                <property>
                    <name>hive.execution.engine</name>
                    <value>tez</value>
                </property>
                <property>
                    <name>mapreduce.job.queuename</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <script>***.hql</script>
            <param>workflowStartYearDate=${year}</param>
            <param>workflowStartMonthDate=${month}</param>
            <param>workflowStartDayDate=${day}</param>
        </hive>
        <ok to="end"/>
        <error to="kill"/>
    </action>
    <kill name="kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

Picture of the job browser :

3814-workflow-job.jpg

As we can see on this picture, the "distcp" job is executed during my Hive Action and starts at the end of each Hive query that I have inside my hive script.

Thanks

1 ACCEPTED SOLUTION

avatar

Hi @Ludovic Rouleau. There is one known bug that could be causing this. Can you rule this out:

COMPONENT: Hive

VERSION: HDP 2.2.4 (Hive 0.14 + patches)

REFERENCE: BUG-35305

PROBLEM: With CTAS query or INSERT INTO TABLE query, after job finishes, data is moved into destination table with hadoop distcp job.

IMPACT: Hive insert queries get slow

SYMPTOMS: Hive insert queries get slow

WORK AROUND: N/A

SOLUTION: By default this is set to false in HDP 2.2.4 onward.

This issue is observed on upgrades to HDP 2.2.4 if the following configuration is set true in hive-site.xml, set it to false

fs.hdfs.impl.disable.cache=false

The above value is recommended true for HDP 2.2.0 to avoid HiveServer2 OutOfMemory issue

View solution in original post

4 REPLIES 4

avatar
Expert Contributor

Hi @Ludovic Rouleau,

I think the only place where it could do the distcp is on the first action: action name = "shell_date"

Look the script is loaded, you should find something like: hadoop distcp hdfs://nn1:8020/xxx hdfs://nn2:8020/xxx

Or you could try to jump the first action and start the workflow directly from the second action:

from: <start to="shell_date"/> to: <start to = "maj_t" />

avatar
New Member

Hi @Alessio Ubaldi

Unfortunately, there is no Hadoop command in my Shell action, just a very simple date calculation. I added a photo of the job browser to better illustrate my point.

Thanks.

avatar

Hi @Ludovic Rouleau. There is one known bug that could be causing this. Can you rule this out:

COMPONENT: Hive

VERSION: HDP 2.2.4 (Hive 0.14 + patches)

REFERENCE: BUG-35305

PROBLEM: With CTAS query or INSERT INTO TABLE query, after job finishes, data is moved into destination table with hadoop distcp job.

IMPACT: Hive insert queries get slow

SYMPTOMS: Hive insert queries get slow

WORK AROUND: N/A

SOLUTION: By default this is set to false in HDP 2.2.4 onward.

This issue is observed on upgrades to HDP 2.2.4 if the following configuration is set true in hive-site.xml, set it to false

fs.hdfs.impl.disable.cache=false

The above value is recommended true for HDP 2.2.0 to avoid HiveServer2 OutOfMemory issue

avatar
New Member

Thanks for your reply @bpreachuk I have exactly the behavior you wrote.

I checked into the hive-site.xml file (that I specified in my workflow ==> <job-xml>) and the settings was good.

        <property>
          <name>fs.hdfs.impl.disable.cache</name>
          <value>false</value>
        </property>

I also specified in my HQL script "SET fs.hdfs.impl.disable.cache = false;" ( I don't know if I can..) but I still have this distcp job.

Maybe oozie use another hive-site.xml ?