Reply
New Contributor
Posts: 5
Registered: ‎03-13-2018

where is final Output being stored of job/query executed on HDFS data.

lets suppose user has executed query on hive on HDFS data. And it return an output, where is this output being stored?

 

Is this HDFS or on Local file system?

 

Posts: 1,754
Kudos: 371
Solutions: 279
Registered: ‎07-31-2013

Re: where is final Output being stored of job/query executed on HDFS data.

If you're asking of standard SELECT queries in Hive, the final results are
stored in a temporary scratch directory on HDFS and streamed to the client.
This is deleted after the Hive session terminates or when the statement is
marked closed.
New Contributor
Posts: 5
Registered: ‎03-13-2018

Re: where is final Output being stored of job/query executed on HDFS data.

Hi Harsh,

Thanks for your response.

 

That’s correct, but the point is; what if its not a select statement rather some output to be used as an input of other job/query or be presented to some stake holders.

 

Regards, Nakul Manhas

New Contributor
Posts: 5
Registered: ‎03-13-2018

Re: where is final Output being stored of job/query executed on HDFS data.

Hi Harsh,

Could you please reply on my query?

Thanks Regards,
Nakul Manhas
Posts: 1,754
Kudos: 371
Solutions: 279
Registered: ‎07-31-2013

Re: where is final Output being stored of job/query executed on HDFS data.

All query results in Hive are stored in HDFS, not local filesystem.

Once the jobs end, the local directories of containers used in YARN do not persist, so it cannot be held in local filesystems.

Not certain what you exactly mean by 'used as an input of other job/query', but if you meant inter-stage data in a single query when two or more distinct YARN jobs are involved, then that stage result data too is on HDFS.

May I ask where the question/concern stems from?
New Contributor
Posts: 5
Registered: ‎03-13-2018

Re: where is final Output being stored of job/query executed on HDFS data.

Hi Harsh,

Thanks for your input.

As you mention, all hive result and inter-stage data stored in HDFS, that mean the result whether output or inter-stage data will be replicated by 3( or as per configuration)?

If all query(hive, impala etc) output stored in HDFS, then why we configure path data/1/impala/impalad for impala, yarn?

Thanks in Advance.

Regards,
Nakul Manhas
Posts: 1,754
Kudos: 371
Solutions: 279
Registered: ‎07-31-2013

Re: where is final Output being stored of job/query executed on HDFS data.

All of the answers here were for Hive, not Impala.
Impala architecturally handles things very differently than Hive, and does
not leverage YARN for its execution, nor does it store its
spill/intermediate data on HDFS.

The local storage paths for YARN is used for job-intermediate (not
query-intermediate, which is of a higher level) data. For example, for a
map phase to sort its data, and to send its data to the reduce phase, for
the reducer to merge incoming map data and to sort it, etc.. The storage is
meant for a container's transient data that does not need to persist beyond
the life of the job.

Query stage and final results need to be persisted for finite times, so
they go on HDFS until automatically cleaned up by the query execution logic.

On the topic of replication, yes the 'temporary' HDFS data from inter-stage
phases of queries may be replicated, but is cleared up once the query
reaches any completion state. Final query results are also deleted after
the results are extracted, or when query/session is marked closed by the
user/app issuing it.

FWIW you could use higher RAM to achieve a lower disk cost. Hive with MR is
very disk-oriented because each stage is its own job and the jobs use local
storage when running, but Hive with Spark may use much lower disk space
during a query due to better stage transitions (all within the same app,
instead of separate apps). Likewise, Impala uses up RAM, but does not
impact your local disk storage unless it finds inadequate memory to execute
the query.

Hope this helps.
Announcements