Support Questions

Find answers, ask questions, and share your expertise

Limits of Pig LOAD statement?

avatar
Contributor
Simple question: What is the memory limitation of the Pig LOAD statement? More detailed question: Is there any way to relate available physical resources (disk, RAM, CPU) to the maximum size of a directory that a Pig LOAD statement can handle? Fact: A research project is using a Pig script that is trying to load a directory containing 12,000+ files with a total size of 891GB in a single Pig LOAD statement, copied below. The files are gzipped WAT files which describe, in raw text, a collection of web pages. When run, the job appears to crash/hang/freeze our cluster every time. Since we are all new to Hadoop, the suspicion has been on resources and configuration until I finally was able to review the code. -- load data from I_WATS_DIR Orig = LOAD '$I_WATS_DIR' USING org.archive.hadoop.ArchiveJSONViewLoader('Envelope.ARC-Header-Metadata.Target-URI','var2...','var3...','var4...{1,2,3,4} as (src:chararray,timestamp:chararray,html_base:chararray,relative:chararray,path:chararray,text:chararray,alt:chararray); Details: CLUSTER 1 front end node, 16 cores, 64GB RAM, 128GB swap, NameNode 3 compute nodes, 16 cores, 128GB RAM, 128GB swap, DataNode TEST JOB 1 Same script referenced above, loading a directory with 1 file Resident memory reported 1.2GB Input: 138MB Output: 207MB Reduce input records: 1,630,477 Duration: 4m 11s TEST JOB 2 Same script, 17 files Resident memory: 16.4GB Input: 3.5GB Output: 1.3GB Reduce input records: 10,648,807 Duration: 6m 48s TEST JOB 3 Same script, 51 files Resident memory: 41.4GB Input: 10.9GB Output: not recorded Reduce input records: 31,968,331 Duration: 6m 18s This is a 4 node cluster with nothing else running on it, fully dedicated to Cloudera Hadoop CDH4, running this 1 job only. Hoping this is all the info people need to answer my original question! I strongly suspect that some sort of file parsing loop that loads 1 file at a time is the solution, but I know even less about Pig than I do about Hadoop. I do have a programming/development background, but in this case I am the sys admin, not the researcher or programmer.
1 ACCEPTED SOLUTION

avatar
Contributor

I have no answer to this question yet, so I will resolve it with some sections from the O'Reilly book Programming Pig.

 

Pig Latin, a Parallel Dataflow Language Pig Latin is a dataflow language. This means it allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel. These data flows can be simple linear flows like the word count example given previously. They can also be complex workflows that include points where multiple inputs are joined, and where data is split into multiple streams to be processed by different operators. To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges are data flows and the nodes are operators that process the data. This means that Pig Latin looks different from many of the programming languages you have seen. There are no if statements or for loops in Pig Latin. This is because traditional procedural and object-oriented programming languages describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow. For information on how to integrate the data flow described by a Pig Latin script with control flow, see Chapter 9 .

 

Gates, Alan (2011-09-29). Programming Pig (p. 4). O'Reilly Media. Kindle Edition.

 

One point that is implicit in everything I have said so far is that Pig (like MapReduce) is oriented around the batch processing of data. If you need to process gigabytes or terabytes of data, Pig is a good choice. But it expects to read all the records of a file and write all of its output sequentially. For workloads that require writing single or small groups of records, or looking up many different records in random order, Pig (like MapReduce) is not a good choice. See NoSQL Databases for a discussion of applications that are good for these use cases.

 

Gates, Alan (2011-09-29). Programming Pig (p. 9). O'Reilly Media. Kindle Edition.

 

MEMORY REQUIREMENTS OF PIG DATA TYPES In the previous sections I often referenced the size of the value stored for each type (four bytes for integer, eight bytes for long, etc.). This tells you how large (or small) a value those types can hold. However, this does not tell you how much memory is actually used by objects of those types. Because Pig uses Java objects to represent these values internally, there is an additional overhead. This overhead depends on your JVM, but it is usually eight bytes per object. It is even worse for chararrays because Java’s String uses two bytes per character rather than one. So, if you are trying to figure out how much memory you need in Pig to hold all of your data (e.g., if you are going to do a join that needs to hold a hash table in memory), do not count the bytes on disk and assume that is how much memory you need. The multiplication factor between disk and memory is dependent on your data, whether your data is compressed on disk, your disk storage format, etc. As a rule of thumb, it takes about four times as much memory as it does disk to represent the uncompressed data.

 

Gates, Alan (2011-09-29). Programming Pig (p. 26). O'Reilly Media. Kindle Edition.

 

 

 

View solution in original post

5 REPLIES 5

avatar
Contributor
Simple question:
What is the memory limitation of the Pig LOAD statement?

More detailed question:
Is there any way to relate available physical resources (disk, RAM, CPU) to the maximum size of a directory that a Pig LOAD statement can handle?

avatar
Contributor
Scenario:

A research project is using a Pig script that is trying to load a directory containing 12,000+ files with a total size of 891GB in a single Pig LOAD statement, copied below. The files are gzipped WAT files which describe, in raw text, a collection of web pages. When run, the job appears to crash/hang/freeze our cluster every time. Since we are all new to Hadoop, the suspicion has been on resources and configuration until I finally was able to review the code.

avatar
Contributor
Orig = LOAD '$I_WATS_DIR' USING org.archive.hadoop.ArchiveJSONViewLoader('Envelope.ARC-Header- Metadata.Target-URI','var2...','var3...','var4...{1,2,3,4} as (src:chararray,timestamp:chararray,html_base:chararray,relative:chararray,path:chararray,text:chararray,alt:chararray);

avatar
Contributor

I have no answer to this question yet, so I will resolve it with some sections from the O'Reilly book Programming Pig.

 

Pig Latin, a Parallel Dataflow Language Pig Latin is a dataflow language. This means it allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel. These data flows can be simple linear flows like the word count example given previously. They can also be complex workflows that include points where multiple inputs are joined, and where data is split into multiple streams to be processed by different operators. To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges are data flows and the nodes are operators that process the data. This means that Pig Latin looks different from many of the programming languages you have seen. There are no if statements or for loops in Pig Latin. This is because traditional procedural and object-oriented programming languages describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow. For information on how to integrate the data flow described by a Pig Latin script with control flow, see Chapter 9 .

 

Gates, Alan (2011-09-29). Programming Pig (p. 4). O'Reilly Media. Kindle Edition.

 

One point that is implicit in everything I have said so far is that Pig (like MapReduce) is oriented around the batch processing of data. If you need to process gigabytes or terabytes of data, Pig is a good choice. But it expects to read all the records of a file and write all of its output sequentially. For workloads that require writing single or small groups of records, or looking up many different records in random order, Pig (like MapReduce) is not a good choice. See NoSQL Databases for a discussion of applications that are good for these use cases.

 

Gates, Alan (2011-09-29). Programming Pig (p. 9). O'Reilly Media. Kindle Edition.

 

MEMORY REQUIREMENTS OF PIG DATA TYPES In the previous sections I often referenced the size of the value stored for each type (four bytes for integer, eight bytes for long, etc.). This tells you how large (or small) a value those types can hold. However, this does not tell you how much memory is actually used by objects of those types. Because Pig uses Java objects to represent these values internally, there is an additional overhead. This overhead depends on your JVM, but it is usually eight bytes per object. It is even worse for chararrays because Java’s String uses two bytes per character rather than one. So, if you are trying to figure out how much memory you need in Pig to hold all of your data (e.g., if you are going to do a join that needs to hold a hash table in memory), do not count the bytes on disk and assume that is how much memory you need. The multiplication factor between disk and memory is dependent on your data, whether your data is compressed on disk, your disk storage format, etc. As a rule of thumb, it takes about four times as much memory as it does disk to represent the uncompressed data.

 

Gates, Alan (2011-09-29). Programming Pig (p. 26). O'Reilly Media. Kindle Edition.

 

 

 

avatar
Contributor