Member since
08-22-2014
45
Posts
9
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
250 | 08-20-2019 08:56 AM | |
16146 | 10-14-2016 10:25 AM | |
16171 | 10-11-2016 02:00 PM | |
1245 | 05-29-2015 10:43 AM | |
26034 | 05-12-2015 09:59 AM |
08-20-2019
08:56 AM
1 Kudo
Hi VikramD, Thanks for reaching out. Regarding your question: > If the rack awareness topology is changed, would HDFS kick-off immediate Block movement > or the new topology will only be effective for NEW blocks written on the cluster? The new topology will be applied when new blocks are written, and should not automatically trigger an event for all existing blocks. However, depending on your existing configuration, the HDFS Balancer will utilize the current rack awareness configuration, such that when it (the HDFS Balancer) is run, it will read the existing blocks and rebalance them based on the current rack awareness topology.
... View more
11-30-2018
12:34 PM
1 Kudo
Hi Bart, By design Altus Data Engineering and Data Warehouse clusters intend for their long-term storage to reside in the cloud provider object store, such as AWS S3 or Microsoft's ADLS. If youi're looking for the functionality of using a cloud-based cluster with the ability for traditional HDFS usage, Altus Director may be better suited for your use-case, as Director allows for creation of a full-fleged CDH cluster residing within the cloud that has the capability to utilize both S3/ADLS as well as cluster-local HDFS. Listed below are a few links regarding Altus Director in the event you would like to know more about it's features and functionality, as well as how to install and configure it: https://www.cloudera.com/products/product-components/cloudera-director.html https://www.cloudera.com/documentation/director/latest/topics/director_intro.html
... View more
10-15-2018
02:33 PM
Hi @Vik, As a follow-up to my colleagues' posts, please note that if you are trying to install 3rd party software managed by Cloudera Manager (via CSDs [Custom Service Descriptors]), please note that this is not currently supported through Altus Data Engineering, Altus SDX, or Altus Data Warehouse clusters-- It is currently supported within the Altus Director portion of our Altus offerings. If you have a particular use-case that applies to deployment of 3rd party software onto the Altus platform apart from Altus Director, please reach out to Cloudera Support or your Account manager and we can work with you to understand your use-case and help identify solutions to best fit your needs. Kind Regards, Anthony
... View more
07-25-2018
06:58 PM
Glad to hear, have a wonderful day!
... View more
07-24-2018
07:04 AM
Hi Kostas, As a follow up from my colleague's post, were you able to create a cluster on Azure successfully?
... View more
04-09-2018
10:03 AM
2 Kudos
Symptoms When attempting to create a Cloudera Altus cluster in AWS, it fails with INTERNAL_CM_CMD_FAIL
Diagnosis CM Deployment is failing at the creation / deployment of CM services due to incorrect or unexpected DNS hostnames of the AWS EC2 instances created for Cloudera Altus. Cloudera Altus requires the options DNS hostnames and DNS resolution are enabled (set to yes) within the AWS VPC settings
Solution Set and confirm DNS hostnames and DNS resolution are set to yes in the AWS VPC used for Cloudera Altus. Details for configuring AWS VPCs with DNS is available here: https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-dns.html Once the AWS VPC DNS settings of the Cloudera Altus Environment are enabled, ensure to delete the previous cluster creation attempt's EC2 instances, both within a Cloudera Altus console, as well as through the AWS EC2 interface Retry creating a new Cloudera Altus cluster with the updated settings
... View more
12-18-2017
01:02 PM
1 Kudo
Hi TldnSysadmin,
Thanks for reaching out to the Cloudera Community and raising this to our attention!
Regarding the instance types reported within the Altus Web UI where the various instance types report specs that aren't aligned to the corresponding Azure instance types is a bug. I'll go ahead and file this on your behalf to get this resolved in an upcoming update.
In the meantime, regarding the reported symptom of "compute resource(s) not available", this is a separate issue and we can certainly work with you on resolving that. I will reach out via PM (private message) so that we can further assist with troubleshooting this issue.
Kind Regards,
Anthony
... View more
08-29-2017
05:54 AM
Hi @uzubair,
Thanks for raising this to our attention.
You may be able to check the S3 bucket (if configured) for log output that may help with determining why the described symptoms occurred.
Given the nature of this issue, are you able to create a support case through the Altus Web UI (Altus Web UI -> Support -> Support Center -> Technical Support; Component = Altus Data Engineering; Sub-Component = Clusters)?
Would you be able to provide some additional details within the support case as well, in particular:
The output of the following command as well (if you have the Altus CLI installed):
$ altus dataeng describe-clusters --cluster-name <cluster_name>
Cluster Creation time
# of Workers and Computer workers created, and if spot instances were used
Was there an Instance Bootstrap Script used? If so can you attach that to the Support Case as well?
Has any other Cluster creations failed in this same manner recently, or has this only been a single manifestation?
Kind Regards,
Anthony
... View more
04-25-2017
07:27 AM
4 Kudos
Symptoms When attempting to access a running cluster via SSH, the connection times out or is rejected
Applies To All versions of clusters managed by Cloudera Altus
Cause Update needed for AWS Security group, as the incoming SSH connection originating for client machine may not be included on the defined cluster's AWS Security Group incoming connections
Troubleshooting Steps This issue can typically occur if the AWS security group configured for does not contain an updated IP from the connecting client. To alleviate this issue, please perform the following steps which will add the IP address(es) needed to connect to the appropriate AWS security group: Login to AWS Select the Appropriate Region in the top-right hand corner, (i.e. US West (Oregon)). Navigate to EC2 Under Resources, click on Security Groups Locate the corresponding Cloudera Altus Security Group configured; For a particular cluster in question, narrow down the options filtering by Group ID. Click on the Security Group, which will then list the Description of the group in the lower portion of the window. Click on the Inbound tab, then Edit. Click on Add Rule, select SSH as the Type which also sets TCP as the Protocol, and the Port Range to 22. Adjust this if needed. Under Source, select one of the options, My IP or Custom, depending on the usage requirements (see notes below) NOTE: Please review the current incoming list if there are multiple My IP entries as some of these may be outdated and may be removed. Please check with your AWS administrator if needed. Click on Save to save changes, which will be implemented immediately. NOTES: My IP is good for setting a single IP address that is identified from the existing web connection to the AWS Web UI. This will be the external address identified from your current machine. If more flexibility is needed, please consider using Custom. While My IP is useful for setting a single IP, there can be administration challenges when the incoming IP address is dynamically allocated, requiring an update to the AWS security group each time the client workstation's IP changes. If the DHCP pool or list of IP addresses is know, using a Custom setting can help reduce the amount of administration required in keeping the Security Group up-to-date. Using a Custom setting, ensure to set the CIDR block for the given range of known IP addresses. Per AWS: Specify a single IP address, or an IP address range in CIDR notation (for example, 203.0.113.5/32). If connecting from behind a firewall, you'll need the IP address range used by the client computers.
References
... View more
- Find more articles tagged with:
- Clusters
04-19-2017
03:02 PM
Question What steps should a Cloudera Altus administrator take if a Cloudera Altus end-user leaves their company, or should no longer have access to Cloudera Altus?
Answer If a user leaves their company, a Cloudera Altus administrator will need to delete the Cloudera Altus CLI Keys and remove user access from the Web UI to prevent unauthorized access.
... View more
- Find more articles tagged with:
- IAM Authentication
Labels:
04-19-2017
02:59 PM
1 Kudo
Question Where are Cloudera Altus Clusters created and how are Data Engineering jobs run?
Answer Clusters are spun up and reside within customer's public cloud account. For example, in case of AWS, EC2 instances are spun-up and managed by Cloudera Altus in the subnet and VPC specified by customer created environment; These clusters handle specific Data Engineering workloads (Hive-on-Spark, Spark-on-YARN, MR2 on YARN, etc).
... View more
- Find more articles tagged with:
- Clusters
Labels:
04-19-2017
02:53 PM
1 Kudo
Question What are the various Cloudera Altus interfaces available and how can I access them?
Answer Cloudera Altus is accessible via command line interface (CLI), console / graphical user interface (GUI) and via a software development kit (SDK) for use with Java. The Cloudera Altus Web console is available via: https://console.altus.cloudera.com/ Instructions for installing the Cloudera Altus CLI is available here (link to Cloudera Altus CLI Documentation) Additional details for the Cloudera Altus SDK and coding examples are available here (link to Cloudera Altus SDK) Additional details on interfaces are available in the Cloudera Altus Interfaces section of the Cloudera Altus Documentation.
... View more
- Find more articles tagged with:
- cli
Labels:
04-19-2017
02:43 PM
Question Are there any current limitations with Cloudera Altus?
Answer Default limit for commercial have the following default soft-limits: Up to 10 active clusters at any given time Limit of 100 users Limit of 10 defined Cloudera Altus environments If there are needs required that will utilize beyond the current soft-limits, please reach out to Cloudera Support to request an increase. Users that are given access as part of a free trial will have no ability to create additional users and may have smaller limits.
... View more
- Find more articles tagged with:
- Limits
04-19-2017
02:38 PM
Question Can I change an environment once created in Cloudera Altus?
Answer By design, Cloudera Altus environments are not modifiable once created. To make a change to the environment, we recommend to clone the environment into a new one with the appropriate changes.
... View more
Labels:
04-19-2017
12:47 PM
Question Is logging enabled by default for jobs run in Cloudera Altus Data Engineering?
Answer While we recommend enabling logging for troubleshooting Data Engineering jobs, It is not enabled by default. Details for enabling logging within the Cloudera Altus Environment Setup is available in the online doc Cloudera Altus Environment Setup: https://console.altus.cloudera.com/support/documentation.html?type=topics&page=ag_dataengr_get_start_env_setup NOTE: If logging is not enabled, cluster bundles that are automatically generated during shutdown of a Cloudera Altus-managed cluster will be lost.
... View more
02-17-2017
06:45 AM
Thanks for the update MSharma, glad to hear that you were able to resolve the issue!
... View more
02-07-2017
11:15 AM
Hi MSharma, Would you be able to provide additional context regarding the failure / permission issue that you're experiencing? If there's a specific error message or symptom that is occurring could you provide more details as to what is happening?
... View more
12-01-2016
01:50 PM
Hi Nickk, If you are looking for what features that are available for YARN resource accounting, we do have two metrics available within the YARN API, as well as a more robust reporting capability within Cloudera Manager 5.7 onward. The following are the definitions of memorySeconds and vcoreSeconds which are used to provide a very basic measurement of utilization in YARN[1]: memorySeconds = The aggregated amount of memory (in megabytes) the application has allocated times the number of seconds the application has been running. vcoreSeconds = The aggregated number of vcores that the application has allocated times the number of seconds the application has been running. The memorySeconds value can be used loosely for generically measuring the amount of resource that the job consumed; For example, job 1 used X amount of memorySeconds as compared to job 2 which used Y amount of memorySeconds. Any further calculations attempting to extrapolate further insight from this measure isn't recommended. There are some additional reporting efforts that are being worked on, one is now available with CM. Starting in CM 5.7 on, CM offers cluster utilization reporting which can help provide per tenant/user cluster usage reporting. Further details regarding Cluster Utilization reporting in CM is available here[2]. References: [1] Link to ApplicationResourceUsageReport.java (part of the YARN API) in the Apache source code for Hadoop: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationResourceUsageReport.java [2] Link to Cloudera Documentation regarding CM's Cluster Utilization Reporting functionality: http://www.cloudera.com/documentation/enterprise/5-7-x/topics/admin_cluster_util_report.html Hope this helps!
... View more
10-14-2016
10:25 AM
Regarding the questions asked:
> What characters are allowed in aliases? e.g. is "!" allowed?
In regards to Avro aliases, it follows the name rules[1], essentially:
1) Must start with [A-Za-z_]
2) Subsequently contain only [A-Za-z0-9_]
> Is there anyway to get the alias information once it's loaded to Dataframe? There's a "printSchema"
> API that lets you print the schema names, but there's not a counterpart for printing the aliases. Is
> it possible to get the mapping from name to aliases from DF?
Per our CDH 5.7 documentation[2], the spark-avro library strips all doc, aliases and other fields[3] when they are loaded into Spark.
To work around this issue, we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark.
References:
[1] http://avro.apache.org/docs/1.8.1/spec.html#names
[2] https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_avro.html
[3] Avro to Spark SQL Conversion:
The spark-avro library supports conversion for all Avro data types:
boolean -> BooleanType
int -> IntegerType
long -> LongType
float -> FloatType
double -> DoubleType
bytes -> BinaryType
string -> StringType
record -> StructType
enum -> StringType
array -> ArrayType
map -> MapType
fixed -> BinaryType
The spark-avro library supports the following union types:
union(int, long) -> LongType
union(float, double) -> DoubleType
union(any, null) -> any
The spark-avro library does not support complex union types.
All doc, aliases, and other fields are stripped when they are loaded into Spark.
... View more
10-11-2016
02:00 PM
Hi Msun, Regarding your questions: > What characters are allowed in aliases? e.g. is "!" allowed? In regards to Avro aliases, it follows the name rules[1], essentially: 1) Must start with [A-Za-z_] 2) Subsequently contain only [A-Za-z0-9_] > Is there anyway to get the alias information once it's loaded to Dataframe? There's a "printSchema" > API that lets you print the schema names, but there's not a counterpart for printing the aliases. Is > it possible to get the mapping from name to aliases from DF? Per our CDH 5.7 documentation[2], the spark-avro library strips all doc, aliases and other fields[3] when they are loaded into Spark. To work around this issue, we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark. References: [1] http://avro.apache.org/docs/1.8.1/spec.html#names [2] https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_avro.html [3] Avro to Spark SQL Conversion: The spark-avro library supports conversion for all Avro data types: boolean -> BooleanType int -> IntegerType long -> LongType float -> FloatType double -> DoubleType bytes -> BinaryType string -> StringType record -> StructType enum -> StringType array -> ArrayType map -> MapType fixed -> BinaryType The spark-avro library supports the following union types: union(int, long) -> LongType union(float, double) -> DoubleType union(any, null) -> any The spark-avro library does not support complex union types. All doc, aliases, and other fields are stripped when they are loaded into Spark.
... View more
08-02-2016
08:37 AM
Hi @acer, It appears that you're running an Ubuntu / Debian OS; Have you had a chance to run the CM upgrade wizard which upgrades the CM Agents? You can access the upgrade wizard by navigating to the following URL (please modify the hostname to suit your CM installation): http://cloudera-manager-node.yourcompany.com:7180/cmf/upgrade-wizard/welcome Please try to go through the CM Wizard to upgrade the CM Agents. If this doesn't resolve the issue, you can also try to manually upgrade the CM Agents on each node. On each node containing an outdated CM Agent, please perform the following steps: 1. Verify the current version of the CM Agent $ dpkg -s cloudera-manager-agent | grep Version 2. Update the apt-get cache $ sudo apt-get update 3. Upgrade the CM Agent package $ sudo apt-get install --only-upgrade cloudera-manager-agent 4. Verify that the CM Agent was upgraded, confirming the version of the package installed: $ dpkg -s cloudera-manager-agent | grep Version Kind Regards, Anthony
... View more
06-10-2016
08:58 AM
Hi Eyal, Regarding your questions: > But how does those resources split? Assuming there are no granular limitations being hit (i.e. max # of jobs per pool/queue/user) , the resources would be calculated by dividing by the sum up all of the weights from the queues/pools with running jobs, then reattributing the weighing. So in your example, two queues are running development and production, which sums to 6. Divided by 100, each weight is allocated 16.66667. So the development pool would get 16.6667% of all of the cluster resources, and the production pool would get 83.3333% of all the cluster resources. If the default pool receives a job submission, then the weighting is recalculated with 7 weights instead of 6; Depending on how you have the various other Fair Scheduler properties setup, then various fairness actions would be taken (i.e. preemption, preemption timeouts, etc). > Is it going to any user based on the fail scheduler? No, as mentioned above, if no jobs are running in the default pool, then it is not calculated within the resources being used; Only when there's a job submitted/running in that pool (default pool) is when it gets counted/calculated. > Or is it being split 5:1 (production:development, respectively) between the active jobs in the other resource pools? Correct 🙂 Hope this helps! -- Anthony
... View more
06-10-2016
08:45 AM
Hi Doni, That's a great question! Currently there isn't a straightforward way at the moment to verify the actual runtime value of the Capacity Scheduler's specific properties via Web UI, however it is possible to confirm what the value of the property each specific process was given at the time of YARN services/role startup. Since the Capacity Scheduler resource calculator is run on the Resource Manager and all the Node Managers, you can spot-check the property given to these roles by doing the following: For CM-managed clusters: 1) Navigate to CM -> Clusters -> YARN -> Configuration -> Search for yarn.resourcemanager.scheduler.class 2) Confirm that the yarn.resourcemanager.scheduler.class has been set to the Capacity Scheduler using the value org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler. Save changes if needed. 3) Navigate to Instances -> (Click on Resource Manager or Node Manager) -> Processes -> Click on capacity-scheduler.xml under Configuration Files. NOTE: This property needs to be set on all Resource Manager(s) and all Node Managers to be effective. 4) Search for the property yarn.scheduler.capacity.resource-calculator. If the property is not present here, it will be using the default value of org.apache.hadoop.yarn.util.resource. DefaultResourceCalculator. 5) Confirm that the intended the value that is set, i.e. org.apache.hadoop.yarn.util.resource.DominantResourceCalculator 6) If any changes were made, please Save changes and restart YARN services accordingly and reconfirm the above starting from Step 1. For CDH-only (no CM) clusters: 1) Confirm that /etc/hadoop/conf/yarn-site.xml for all Resource Manager(s) and Node Managers contain the entry: <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> 2) Verify that the yarn.scheduler.capacity.resource-calculator property sets the intended value (for example: org.apache.hadoop.yarn.util.resource.DominantResourceCalculator) in /etc/hadoop/conf/capacity-scheduler.xml on each of the Resource Manager(s) and all Node Managers. Hope this helps! -- Anthony
... View more
05-26-2016
10:22 AM
Hi JimAcxiom, Some interesting details about the counters provided: The reduce shuffle bytes difference is roughly 102GB between the two clusters, which yielded an 885 min difference as well (Newer cluster taking much longer) The number of spilled records was much higher on the old cluster around 243 million records were spilled more on the old cluster Theres a different of about 50GB for physical memory snapshot bytes, consumed more by the old (prod) cluster (which includes the spills) The old cluster used 36GB more virtual memory Description Diff Old (prod) cluster New Cluster Launched Map Tasks 171 171 Launched Red Tasks 37 37 Data local map tasks 10 150 160 Rack local map tasks 10 20 10 Time spent all maps in occupied slots (ms) 39 mins. 62483634 60096803 Time spent all reds in occupied slots (ms) 767 mins 55419788 101496628 MR Framework CPU time spent (ms) 885 mins 147935890 201093460 MR Framework Input Split bytes 2 KB 104451 102405 MR Framework Phy memory snapshot bytes 49 GB 563,320,909,824 513,454,596,096 MR Framework Red Shuffle Bytes 102 GB 50,991,891,912 153,054,745,975 MR Framework Spilled Records 243392028 731615420 488223392 MR Framework Total committed heap usage 69 GB 557672038400 626664927232 MR Framework Virt Mem bytes snapshot 36 GB 1070805831680 1034891923456 Given that the job run on the prod cluster consumed more physical memory and spilled much more records, it may be worth looking to ruling out hardware and existing settings as part of the culprit. Have you had a chance to review (or would you be able to provide some addiitonal details) of both of the clusters, in particular: The hardware differences (Number of HDD spindles, RAM, CPU type and number cores) between both cluster's worker nodes The various compression properties used in both clusters MR-specific properties, such as: mapred.reduce.slowstart.completed.maps mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum mapred.map.child.ulimit The job.xmls of both jobs for comparison Kind Regards, Anthony
... View more
04-07-2016
12:33 PM
In CM 5.0, the YARN Client configuration had a bug that did not propagate the topology.py file within the client configuration deployment or download, which was subsequently fixed in CM 5.1 onward. For those still running CM 5.0 who are hitting this issue, we recommend to copy the topology.* from a known good location (DataNode for example) to the existing node. In general, for any nodes that will be submitting Spark-On-YARN jobs, we recommend that these nodes contain Gateway roles for both Spark and Yarn, and that the Client configuration has been deployed (either by Deploy Client Configuration from the Cluster Actions drop-down, or by each individual component, ensuring that all compoents have their Client configurations deployed).
... View more
02-25-2016
08:35 AM
Hi enelso and hkumar449, That stacktrace is typically indicative of a potential version mismatch between the CM Agents and the CM Server, or by using an unsupported JDK. Can you confirm that the CM Agent versions are equal to the version of the CM Server and the JDK type and version used on the nodes?
... View more
02-19-2016
09:12 AM
Hi Grenoble, Thanks for your reply! It looks like even the simple Pi job is also stalling, so let's take Pig out of the equation and focus on YARN configuration settings to get at least a successful Pi job to run first. Judging by the information provided so far, it seems that no MR jobs are able to properly run with the current settings and will need to be reviewed for establishing a basic starting point. There's a few pieces still missing to help establish a starting point-- Can you kindly provide the following info? - How many NodeManagers are in the cluster? - How much physical memory is available on each NodeManager? - How many CPU cores are available on each NodeManager? - What are all the services configured on each NodeManager (i.e. HBase Regionserver, DataNode, etc).
... View more
02-12-2016
10:04 AM
Thanks for the additional feedback Fewcents! To clarify, the steps provided ( CM -> YARN -> Instances -> Add Role Instances -> Gateway) added a Node 1 as a YARN Gateway, which effectively pushes the YARN client configurations to Node 1, which will then allow Node 1 to properly submit YARN jobs to the cluster. Unless Node 1 was added also as a NodeManager service, adding Node 1 as a NodeManager isn't required to submit jobs from it, only that it is specified as a YARN Gateway.
... View more
02-12-2016
09:53 AM
Hi Grenloble, Can you provide some additional details regarding the behavior on your cluster, in particular: - How are all the services distributed on your cluster, and how much memory is allocated for each? - What are the values set for the following properties in YARN? ApplicationMaster Java Maximum Heap Size mapreduce.map.java.opts.max.heap mapreduce.map.memory.mb mapreduce.reduce.java.opts.max.heap mapreduce.reduce.memory.mb yarn.app.mapreduce.am.resource.mb yarn.nodemanager.resource.memory-mb yarn.scheduler.minimum-allocation-mb yarn.scheduler.increment-allocation-mb yarn.scheduler.maximum-allocation-mb - Are you able to run a simple Pi job successfully? For Parcel installs: $ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-examples.jar pi 5 5 For package-based installs: $ hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar pi 5 5 - Do all Pig jobs fail? - Can you provide the Resource Manager service log snippet along with the Application Master log containing the Pig job that is reportedly hanging?
... View more
02-12-2016
09:37 AM
Hi Grenoble, Alxrud, Thanks for bringing this to our attention. Given the details of the reported problem, I attempted to reproduce the same issue spinning up a cluster, and was able to successfully run the Pig script in Hue (as shown below, including cluster configuration steps): Repro details: 1) Setup test cluster with Cloudera Express CM 5.4.1, CDH 5.3.0 (Parcels) 2) Configured the Core Hadoop services in the following fashion (for testing): Master (16GB RAM): CM, NN, SNN, Hue, Sqoop, RM, JHS, Hive Gateway Worker 1 (8GB RAM): DN, NM, ZK, Hive Gateway, HMS Worker 2 (8GB RAM): DN, NM, ZK, Hive Gateway, Oozie Worker 3 (8GB RAM): DN, NM, ZK, Hive Gateway, HS2 3) Installed All Hue application examples as Hue admin user 4) Created regular user account in Hue 5) Logged in as regular (non-admin) user in Hue 6) Ran the test query via Hue -> Query Editors -> Pig -> Pasted the following output: data = LOAD '/user/hue/pig/examples/data/midsummer.txt' as (text:CHARARRAY); upper_case = FOREACH data GENERATE UPPER(text); STORE upper_case INTO '$output' ; 7) Clicked on Submit 😎 Output filename = test3 9) Confirmed workflow output was successful: 2016-02-12 09:02:21,226 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2016-02-12 09:02:43,143 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete Heart beat 2016-02-12 09:02:46,386 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 2016-02-12 09:02:46,435 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete More Information Needed: To better understand the cause of the reported behavior, kindly provide responses to the following: 1) How are all the services distributed on your cluster, and how much memory is allocated for each? 2) Are you able to run a simple Pi job successfully? For Parcel installs: $ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-examples.jar pi 5 5 For package-based installs: $ hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar pi 5 5 3) Can you provide the Resource Manager service log along with the Application Master log of the Pig job that is reportedly hanging? Gathering the above information will help help narrow down components where the culprit resides. If a simple Pi job does not work, then further attention is needed on the YARN configuration and ensuring that the AM, map (and reduce if applicable) containers are properly launched.
... View more