I use simple pig script that reads the input .txt file and for each line new field (row number) is added.
The output relation is then stored into avro.
Is there any benefit to run such a script in the mapreduce mode compare to local mode?
I didn't get what you were adding ( new filed? ) but if the input.txt file is huge a MapReduce program will execute it in parallel on the cluster. The local mode will run it in the memory of your pig client ( grunt ) .
@John Smith even better performance if you execute it in Tez mode "pig -x tez". I'm also a bit confused, please share script. In general in local mode you are dealing with local filesystem " file:///" in mapred or tez mode you are dealing with hdfs://. There is also tez_local. In either mode yoq need to be aware which mode you are in and you can still access either filesystem just provide its scheme.
but if you have a source file which is 100GB , and memory of the machine is only 32GB would it work?
In case that mapreduce is used and more nodes are processing the file ... is there any benefit of parallelism?
script is simple:
x = load 'file.txt' using PigStorage(,);
x = rank x;
store x using ....
Once data is uploaded into HDFS it is split into 128MB blocks. MapReduce programs execute one block at a time in parallel. Thats the reason you have a hadoop cluster.
If you run it on a local file ( or in local mode ) the full 100GB are processed by one task. It will normally still not run out of memory since it normally does not keep all data in memory but it will take a loooong time
@John Smith you're better off processing this file in tez/mapred modes. Doing it in local mode kind of disqulifies the reason of using pig altogether. Then might as well use any scripting language to parse the file or even Spark local.