Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

mapreduce or local mode of the execution

mapreduce or local mode of the execution

Expert Contributor

I use simple pig script that reads the input .txt file and for each line new field (row number) is added.

The output relation is then stored into avro.

Is there any benefit to run such a script in the mapreduce mode compare to local mode?

Thank you

7 REPLIES 7

Re: mapreduce or local mode of the execution

I didn't get what you were adding ( new filed? ) but if the input.txt file is huge a MapReduce program will execute it in parallel on the cluster. The local mode will run it in the memory of your pig client ( grunt ) .

Re: mapreduce or local mode of the execution

Mentor

@John Smith even better performance if you execute it in Tez mode "pig -x tez". I'm also a bit confused, please share script. In general in local mode you are dealing with local filesystem " file:///" in mapred or tez mode you are dealing with hdfs://. There is also tez_local. In either mode yoq need to be aware which mode you are in and you can still access either filesystem just provide its scheme.

Highlighted

Re: mapreduce or local mode of the execution

Expert Contributor

But is hadoop / mapreduce able to split the source .txt file and each node will process only specific part of it?

Re: mapreduce or local mode of the execution

Mentor

It wouldn't be different than any other mode, one just does it in memory of one node and the others distribute it @John Smith show example we can provide a draft

Re: mapreduce or local mode of the execution

Expert Contributor

but if you have a source file which is 100GB , and memory of the machine is only 32GB would it work?

In case that mapreduce is used and more nodes are processing the file ... is there any benefit of parallelism?

script is simple:

x = load 'file.txt' using PigStorage(,);
x = rank x;
store x using ....

Re: mapreduce or local mode of the execution

Once data is uploaded into HDFS it is split into 128MB blocks. MapReduce programs execute one block at a time in parallel. Thats the reason you have a hadoop cluster.

If you run it on a local file ( or in local mode ) the full 100GB are processed by one task. It will normally still not run out of memory since it normally does not keep all data in memory but it will take a loooong time

Re: mapreduce or local mode of the execution

Mentor

@John Smith you're better off processing this file in tez/mapred modes. Doing it in local mode kind of disqulifies the reason of using pig altogether. Then might as well use any scripting language to parse the file or even Spark local.

Don't have an account?
Coming from Hortonworks? Activate your account here