About benbakelaar

benbakelaar · ‎10-30-2013

You can also refer to this thread on StackOverflow. http://stackoverflow.com/questions/19480762/specific-memory-limitations-of-pig-load-statement

benbakelaar · ‎10-30-2013

I have no answer to this question yet, so I will resolve it with some sections from the O'Reilly book Programming Pig. Pig Latin, a Parallel Dataflow Language Pig Latin is a dataflow language. This means it allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel. These data flows can be simple linear flows like the word count example given previously. They can also be complex workflows that include points where multiple inputs are joined, and where data is split into multiple streams to be processed by different operators. To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges are data flows and the nodes are operators that process the data. This means that Pig Latin looks different from many of the programming languages you have seen. There are no if statements or for loops in Pig Latin. This is because traditional procedural and object-oriented programming languages describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow. For information on how to integrate the data flow described by a Pig Latin script with control flow, see Chapter 9 . Gates, Alan (2011-09-29). Programming Pig (p. 4). O'Reilly Media. Kindle Edition. One point that is implicit in everything I have said so far is that Pig (like MapReduce) is oriented around the batch processing of data. If you need to process gigabytes or terabytes of data, Pig is a good choice. But it expects to read all the records of a file and write all of its output sequentially. For workloads that require writing single or small groups of records, or looking up many different records in random order, Pig (like MapReduce) is not a good choice. See NoSQL Databases for a discussion of applications that are good for these use cases. Gates, Alan (2011-09-29). Programming Pig (p. 9). O'Reilly Media. Kindle Edition. MEMORY REQUIREMENTS OF PIG DATA TYPES In the previous sections I often referenced the size of the value stored for each type (four bytes for integer, eight bytes for long, etc.). This tells you how large (or small) a value those types can hold. However, this does not tell you how much memory is actually used by objects of those types. Because Pig uses Java objects to represent these values internally, there is an additional overhead. This overhead depends on your JVM, but it is usually eight bytes per object. It is even worse for chararrays because Java’s String uses two bytes per character rather than one. So, if you are trying to figure out how much memory you need in Pig to hold all of your data (e.g., if you are going to do a join that needs to hold a hash table in memory), do not count the bytes on disk and assume that is how much memory you need. The multiplication factor between disk and memory is dependent on your data, whether your data is compressed on disk, your disk storage format, etc. As a rule of thumb, it takes about four times as much memory as it does disk to represent the uncompressed data. Gates, Alan (2011-09-29). Programming Pig (p. 26). O'Reilly Media. Kindle Edition.

benbakelaar · ‎10-30-2013

FYI for future readers, this can be found under Services > Service hdfs1 > Configuration > View and Edit > search for "safety valve". You will find this variable in the results "HDFS Service Configuration Safety Valve for hdfs-site.xml".

benbakelaar · ‎10-24-2013

Andrew, thanks, I am currently working on purchasing Cloudera Enterprise licenses so we can get official support. I will look for the area you are indicating. For anyone else reading the thread, the reason we needed to set these variables is that MapReduce jobs (via Pig scripting) were failing at the Reduce phase. Based on system logs, I was able to trace it to these variables. In the Hadoop documentation, you can see that a 3 node cluster is considered "extremely" small and that is exactly what we are running - 1 namenode and 3 datanodes. Also, replication is set to 3, meaning all datanodes must be operational in order for HDFS to be healthy. In retrospect, with such a small cluster we would decrease replication to 2. To be clear, we had to set both variables even though the documentation indicates you only need to set the first "enable" to "never". In fact we found that our problem was not fixed until we also set "policy" to "never". dfs.client.block.write.replace-datanode-on-failure.enable true If there is a datanode/network failure in the write pipeline, DFSClient will try to remove the failed datanode from the pipeline and then continue writing with the remaining datanodes. As a result, the number of datanodes in the pipeline is decreased. The feature is to add new datanodes to the pipeline. This is a site-wide property to enable/disable the feature. When the cluster size is extremely small, e.g. 3 nodes or less, cluster administrators may want to set the policy to NEVER in the default configuration file or disable this feature. Otherwise, users may experience an unusually high rate of pipeline failures since it is impossible to find new datanodes for replacement. See also dfs.client.block.write.replace-datanode-on-failure.policy dfs.client.block.write.replace-datanode-on-failure.policy DEFAULT This property is used only if the value of dfs.client.block.write.replace-datanode-on-failure.enable is true. ALWAYS: always add a new datanode when an existing datanode is removed. NEVER: never add a new datanode. DEFAULT: Let r be the replication number. Let n be the number of existing datanodes. Add a new datanode only if r is greater than or equal to 3 and either (1) floor(r/2) is greater than or equal to n; or (2) r is greater than n and the block is hflushed/appended.

benbakelaar · ‎10-24-2013

Answer: For my cluster, I directly modified the /etc/hadoop/conf/hdfs-site.xml file on all 4 nodes, including namenode and datanodes. I am able to locate other dfs.client variables in the Cloudera Manager: host:7180/cmf/services/19/config But the variable that I added manually does not show up in Cloudera Manager as far as I can see. Another variable to set in conjunction with this is: dfs.client.block.write.replace-datanode-on-failure.policy See http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

benbakelaar · ‎10-22-2013

I have a small 3 node cluster and am experiencing total failure when running Reduce jobs. I searched through syslog and found errors pointing to this variable: dfs.client.block.write.replace-datanode-on-failure.enable If there is a datanode/network failure in the write pipeline, DFSClient will try to remove the failed datanode from the pipeline and then continue writing with the remaining datanodes. As a result, the number of datanodes in the pipeline is decreased. The feature is to add new datanodes to the pipeline. This is a site-wide property to enable/disable the feature. When the cluster size is extremely small, e.g. 3 nodes or less, cluster administrators may want to set the policy to NEVER in the default configuration file or disable this feature. Otherwise, users may experience an unusually high rate of pipeline failures since it is impossible to find new datanodes for replacement. See also dfs.client.block.write.replace-datanode-on-failure.policy How can I set this variable from the CDH4 Cloudera Manager interface? Or do I need to manually edit an XML file?

benbakelaar · ‎10-20-2013

Orig = LOAD '$I_WATS_DIR' USING org.archive.hadoop.ArchiveJSONViewLoader('Envelope.ARC-Header- Metadata.Target-URI','var2...','var3...','var4...{1,2,3,4} as (src:chararray,timestamp:chararray,html_base:chararray,relative:chararray,path:chararray,text:chararray,alt:chararray);

benbakelaar · ‎10-20-2013

Scenario: A research project is using a Pig script that is trying to load a directory containing 12,000+ files with a total size of 891GB in a single Pig LOAD statement, copied below. The files are gzipped WAT files which describe, in raw text, a collection of web pages. When run, the job appears to crash/hang/freeze our cluster every time. Since we are all new to Hadoop, the suspicion has been on resources and configuration until I finally was able to review the code.

benbakelaar · ‎10-20-2013

Simple question: What is the memory limitation of the Pig LOAD statement? More detailed question: Is there any way to relate available physical resources (disk, RAM, CPU) to the maximum size of a directory that a Pig LOAD statement can handle?

benbakelaar · ‎10-20-2013

Simple question: What is the memory limitation of the Pig LOAD statement? More detailed question: Is there any way to relate available physical resources (disk, RAM, CPU) to the maximum size of a directory that a Pig LOAD statement can handle? Fact: A research project is using a Pig script that is trying to load a directory containing 12,000+ files with a total size of 891GB in a single Pig LOAD statement, copied below. The files are gzipped WAT files which describe, in raw text, a collection of web pages. When run, the job appears to crash/hang/freeze our cluster every time. Since we are all new to Hadoop, the suspicion has been on resources and configuration until I finally was able to review the code. -- load data from I_WATS_DIR Orig = LOAD '$I_WATS_DIR' USING org.archive.hadoop.ArchiveJSONViewLoader('Envelope.ARC-Header-Metadata.Target-URI','var2...','var3...','var4...{1,2,3,4} as (src:chararray,timestamp:chararray,html_base:chararray,relative:chararray,path:chararray,text:chararray,alt:chararray); Details: CLUSTER 1 front end node, 16 cores, 64GB RAM, 128GB swap, NameNode 3 compute nodes, 16 cores, 128GB RAM, 128GB swap, DataNode TEST JOB 1 Same script referenced above, loading a directory with 1 file Resident memory reported 1.2GB Input: 138MB Output: 207MB Reduce input records: 1,630,477 Duration: 4m 11s TEST JOB 2 Same script, 17 files Resident memory: 16.4GB Input: 3.5GB Output: 1.3GB Reduce input records: 10,648,807 Duration: 6m 48s TEST JOB 3 Same script, 51 files Resident memory: 41.4GB Input: 10.9GB Output: not recorded Reduce input records: 31,968,331 Duration: 6m 18s This is a 4 node cluster with nothing else running on it, fully dedicated to Cloudera Hadoop CDH4, running this 1 job only. Hoping this is all the info people need to answer my original question! I strongly suspect that some sort of file parsing loop that loads 1 file at a time is the solution, but I know even less about Pig than I do about Hadoop. I do have a programming/development background, but in this case I am the sys admin, not the researcher or programmer.

Online	Offline
Last Visited	‎07-30-2014 10:14 AM

Member Since	‎10-20-2013 09:56 AM
Last Visited	‎07-30-2014 10:14 AM
Posts	13
Kudos received	1

Cloudera Community

Re: Limits of Pig LOAD statement?

Re: Limits of Pig LOAD statement?

Re: Limits of Pig LOAD statement?

Re: Where can I set dfs.client.block.write.replace...

Re: Where can I set dfs.client.block.write.replace...

Re: Where can I set dfs.client.block.write.replace...

Where can I set dfs.client.block.write.replace-dat...

Re: Limits of Pig LOAD statement?

Re: Limits of Pig LOAD statement?

Re: Limits of Pig LOAD statement?

Limits of Pig LOAD statement?