Member since
01-24-2016
47
Posts
11
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1546 | 07-02-2018 01:22 PM | |
7608 | 07-22-2016 06:04 PM |
07-22-2020
01:36 AM
Considering the amount of effort required to get Cloudera to work would it not just be easier to install and configure Hadoop on your own?
... View more
02-14-2020
07:45 AM
Hi, so I ran into the same exception on a HDP 3.1.0 for my own external tables, as well as the default and sys tables of hive. My value for fs.defaultFS was hdfs://my_host.com:8020 Using ALTER TABLE ... SET LOCATION did not work, since it needs the original location of the table to change its location, which of course isn't possible. I then went into the Hive Metastore to change the location of the tables manually. To do this, two Metastore tables are necessary: hive.DBS stores the location of databases in column DB_LOCATION_URI hive.SDS stores the location of tables in column LOCATION I then updated those values to only containing the port once using e.g. "UPDATE SDS SET LOCATION = 'example' WHERE SD_ID = 1" After that I changed fs.defaultFS to hdfs://my_host.com (without the port number) After that everything worked out. fs.defaultFS not containing a port number didn't seem to break anything, but be aware.
... View more
11-04-2018
07:21 PM
1 Kudo
please do below steps alternatives --list | awk '{print $1}' | xargs -n1 alternatives --display |grep "^/" > all_alternatives.txt awk -F ' ' -v NCOLS=4 'NF!=NCOLS{printf "Wrong number of columns at line %d\n", NR}' all_alternatives.txt cd /var/lib/alternatives/ rm -Rf libjavaplugin.so.x86_64 jre_1.8.0 jre_1.8.0_openjdk jre_openjdk jre_1.7.0 jre_1.7.0_openjdk alternatives --install /usr/bin/java java /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-0.el7_5.x86_64/jre/bin/java 1800191
... View more
07-02-2018
01:22 PM
I resolved this by installing an earlier version of Spark2 1. Deleted Spark2 Service 2. Deactivated, and removed distributed parcel 2.3.0.cloudera2-1.cdh5.13.3.p0.316101 3. Got earlier parcel 2.2.0.cloudera2-1.cdh5.12.0.p0.232957 4. Downloaded and distributed and activated this parcel 5. Got the CSD for this and put into /opt/cloudera/csd 6. Installed Spark2 from CM. Spark2 up and running on my cluster !
... View more
05-23-2018
01:09 AM
Great to hear that our platform helps on your research, Sanjay. Keep up you great work on your App and Book!
... View more
01-24-2018
05:53 AM
Could you please help me with the steps. If you have any Document, please let me know. Thanks,
... View more
10-24-2017
10:02 AM
@jjjjjjhao, The bits of errors provided don't tell enough of the story to indicate what may be wrong. I would run: service cloudera-scm-agent restart and then see what happens in the agent log. Also, what is the actual problem? What is wrong in Cloudera Manager, etc. It is unclear what you are trying to do or see and what actually happens. Once that is clarified, the community can help. Ben
... View more
09-01-2017
09:38 AM
Migrating from Hive on MR to Hive on Spark I'm wonder how hive + oozie action[oozie:hive2-action:0.1] on Spark[set hive.execution.engine=spark] based ran is much slower than Hive on MapReduce. Note: I included set hive.execution.engine=spark; in my queries and in oozie included hive2-action:0.1 in [xmlns] + provided jdbc[url]. The code is running successfully, i saw logs but it takes much clock time than usual MR.
... View more
08-08-2017
07:45 AM
Check if your ntp is synhronised
... View more
05-25-2017
02:20 PM
1 Kudo
That query probably has multiple big joins and aggregations and needs more memory to complete. A very rough rule of thumb for minimum memory in releases CDH5.9-CDH5.12 is the following. For each hash join, the minimum of 150MB or the amount of data on the right side of the node (e.g. if you have a few thousand rows on the right side, maybe a MB or two). For each merge aggregation, the minimum of 300MB or the size of grouped data in-memory (e.g. if you only have a few thousand groups, maybe a MB or two). For each sort, about 50-60MB For each analytic, about 20MB If you add all those up and add another 25% you'll get a ballpark number for how much memory the query will require to execute. I'm working on reducing those numbers and making the system give a clearer yes/no answer on whether it can run the query before it starts executing.
... View more
04-15-2017
05:13 AM
>>SSLError: unknown protocol >>[15/Apr/2017 07:59:58 +0000] 13770 MainThread agent ERROR Heartbeating to 10.32.38.218:7182 failed. the same error while initial installation progress
... View more
02-19-2017
10:26 PM
I would still love to know the solution. Decided to go for the nuclear option 😞 Walked on fire for 8 hours today but I am back again ! 1. Stopped all cron jobs 2. Backed up the 1TB of HDFS data 3. Backed up Hive metastore MySQL 4. Uninstalled CDH and CM 5. upgraded all my nodes to 14.04 LTS 6. Fresh installed CM and CDH 5.10.1 warmly sanjay
... View more
11-21-2016
02:02 PM
1 Kudo
Make sure you can connect to locahost on port 19001 (the supervisord listening port) and that the supervisor did start up and listen on port 19001. If you did an OS upgrade, perhaps things like "iptables" and "selinux" are interfering here. Defining whether the supervisor is not starting or if it starts but the agent cannot find a route to communicate with it is an important troubleshooting step. Ben
... View more
10-11-2016
07:31 PM
Hi guys I have 4 datanodes in my cluster that have 6 X 1TB each. I wanted to downsize each datanode to 3 X 1TB. So essentially move 3 X 1TB per datanode. This is the process I folowed. Please tell me if its correct or not. 1. On a running cluster , go to DN1 2. Edit /etc/fstab . Remove the disk6 mountpoint and save. 3. Reboot the DN1 4. Login back to DN1 and do "hdfs fsck /" 5. Make sure of the following Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Corrupt blocks: 0 6. Repeat process 1.2.3.4.5 on DN2 7. Then remove disk5 from DN1, DN2, DN3, DN4 by following 1.2.3.4.5 for each datanode 7. Then remove disk4 from DN1, DN2, DN3, DN4 by following 1.2.3.4.5 for each datanode 8. All good till now 9. Go to cloudera manager and change dfs.datanode.failed.volumes.tolerated = 1 (from 3) 10. Modify dfs.data.dir, dfs.datanode.data.dir (remove the three disks you removed) 11. Restart Hadoop cluster 12. This is where I observed 24 blocks corrups or missing ? Why is this happening ? Please advise a better process that will result in 0 corrupt/missing blocks warmly sanjay
... View more
08-02-2016
10:33 AM
Thats awesome thanks a ton Mike. Early this morning before your mails came in - I grew impatient 🙂 as is my nature - and did give it a shot to the Cloudera director as-is-wher-is scripts. 1. Used the Cloud Formation template here https://s3.amazonaws.com/quickstart-reference/cloudera/latest/templates/Template2-Cloudera-AWS-ExistingVPC.template 2. Created a "ClusterLauncher Instance" on AWS 3. SSH to "C lusterLauncher Instance" cloudera-director bootstrap cloudera/setup-default/aws.reference.conf Process logs can be found at /home/ec2-user/.cloudera-director/logs/application.log Plugins will be loaded from /var/lib/cloudera-director-plugins OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=256M; support was removed in 8.0 Cloudera Director 2.1.0 initializing ... The configuration file aws.reference.conf is not present or cannot be read. [ec2-user@ip-10-219-178-74 ~]$ cloudera-director bootstrap cloudera/setup-default/aws.reference.conf Process logs can be found at /home/ec2-user/.cloudera-director/logs/application.log Plugins will be loaded from /var/lib/cloudera-director-plugins OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=256M; support was removed in 8.0 Cloudera Director 2.1.0 initializing ... Installing Cloudera Manager ... * Starting ..... done * Requesting an instance for Cloudera Manager .......................... done * Installing screen package (1/1) ....... done * Running custom bootstrap script on [10.219.177.189, ip-10-219-177-189.us-west-2.compute.internal, 52.43.22.181, ec2-52-43-22-181.us-west-2.compute.amazonaws.com] .......... done * Waiting for SSH access to [10.219.177.189, ip-10-219-177-189.us-west-2.compute.internal, 52.43.22.181, ec2-52-43-22-181.us-west-2.compute.amazonaws.com], default port 22 ..... done * Inspecting capabilities of 10.219.177.189 .......... done * Normalizing a3508870-3de7-4dc0-84a5-f69c77610c89 ..... done * Installing ntp package (1/4) ..... done * Installing curl package (2/4) ..... done * Installing nscd package (3/4) ..... done * Installing gdisk package (4/4) ..................... done * Resizing instance root partition ......... done * Mounting all instance disk drives ............ done * Waiting for new external database servers to start running ........ done * Installing repositories for Cloudera Manager ....... done * Installing oracle-j2sdk1.7 package (1/3) ..... done * Installing cloudera-manager-daemons package (2/3) ..... done * Installing cloudera-manager-server package (3/3) ...... done * Setting up embedded PostgreSQL database for Cloudera Manager ...... done * Installing cloudera-manager-server-db-2 package (1/1) ..... done * Starting embedded PostgreSQL database ...... done * Starting Cloudera Manager server ... done * Waiting for Cloudera Manager server to start ..... done * Setting Cloudera Manager License ... done * Enabling Enterprise Trial ... done * Configuring Cloudera Manager ..... done * Deploying Cloudera Manager agent ...... done * Waiting for Cloudera Manager to deploy agent on 10.219.177.189 ... done * Setting up Cloudera Management Services ............ done * Backing up Cloudera Manager Server configuration ...... done * Inspecting capabilities of 10.219.177.189 ...... done * Done ... Cloudera Manager ready. Creating cluster C5-Reference-AWS ... * Starting ..... done * Requesting 11 instance(s) in 3 group(s) ....................................... done * Preparing instances in parallel (20 at a time) .............................................................. done * Waiting for Cloudera Manager installation to complete ... done * Installing Cloudera Manager agents on all instances in parallel (20 at a time) ........ done * Waiting for new external database servers to start running ... done * Creating CDH5 cluster using the new instances ... done * Creating cluster: C5-Reference-AWS .... done * Downloading parcels: CDH-5.7.2-1.cdh5.7.2.p0.18,KAFKA-2.0.2-1.2.0.2.p0.5 ... done * Distributing parcels: KAFKA-2.0.2-1.2.0.2.p0.5,CDH-5.7.2-1.cdh5.7.2.p0.18 ... done * Activating parcels: KAFKA-2.0.2-1.2.0.2.p0.5,CDH-5.7.2-1.cdh5.7.2.p0.18 ...... done * Configuring Hive to use Sentry ... done * Creating Sentry Database ... done * Calling firstRun on cluster C5-Reference-AWS ... done * Waiting for firstRun on cluster C5-Reference-AWS .... done * Running cluster post creation scripts ...... done * Adjusting health thresholds to take into account optional instances. ... done * Done ...
... View more
07-25-2016
10:25 PM
2 Kudos
Greetings my beloved friends at Cloudera and all CDH users, I am so excited to announce that I just finished upgrading our 9 node non-prod cluster (on AWS) to CM 5.8.1 and CDH parcels 5.8.0. Went off without an issue at all. As I have always proudly announced , I have been a starving developer version user of CDH for the past 5 years ! We are heavy users of Hive on Spark , HBase and Impala for creating curated datasets for our machine learning models and we can truly say we do not know where we will be without Cloudera ! So a big thank you to all Cloudera team with our folded hands and heads bent with respect and gratitude. I want to say a big thank you to the ever responsive cloudera community for answering my questions and clarifying problems. We would not have come so far without your die-hard positive attitude and hard-core community support. Warmly and appreciatively, sanjay
... View more
07-19-2016
10:10 AM
Hi guys I have a 3 node cluster (static IPs). h1, h2, h3 (all Ubuntu 12.04 LTS Trusty) FIRST TRY OF INSTALL =================== - On machine h1, I did a fresh install of Cloudera manager 5.7 using cloudera-manager-installer.bin - Thru the h1:7180 I tried installing parcels on h1 , h2 , h3 - hi and h2 succeed. - h3 reports following error Installation failed. Failed to receive heartbeat from agent. Ensure that the host's hostname is configured properly. Ensure that port 7182 is accessible on the Cloudera Manager Server (check firewall rules). Ensure that ports 9000 and 9001 are not in use on the host being added. Check agent logs in /var/log/cloudera-scm-agent/ on the host being added. (Some of the logs can be found in the installation details). If Use TLS Encryption for Agents is enabled in Cloudera Manager (Administration -> Settings -> Security), ensure that /etc/cloudera-scm-agent/config.ini has use_tls=1 on the host being added. Restart the corresponding agent and click the Retry link here. SECOND TRY OF INSTALL ===================== - I uninstalled all CDH components on h1,h2,h3 - Now On machine h3, I did a fresh install of Cloudera manager 5.7 using cloudera-manager-installer.bin - Thru the h3:7180 I tried installing parcels on h1 , h2 , h3 - Now h3 and h2 succeed and h1errors out with the same error as above Any help , suggestions ? PING, SSH work between the machines. the user on all three machines has passwordless sudo..... I have installed clusters for 4 years now 🙂 but this one is beating me ! Please put forth your ideas and recommendations...I would be super grateful warmly sanjay
... View more
07-19-2016
12:56 AM
@Deepesh Following query returns values in both TEZ and MR mode select split(ln,',')[0]as cid, split(ln,',')[1]as zip from utils.file1 where fn='foo1.csv' limit 1000;
... View more
05-11-2016
08:33 AM
As a comparison I wrote Java MR code that does exactly what the query does and that ran in 1m 30s ! So something does not seem all right with Hive
... View more
04-07-2016
09:38 AM
Hey guys On the CDH 5.6.0 (I have been happily and successfully using the CDH starving developers version since 2012) this is how you can create and use tables with data location pointing to S3 on AWS. I am sure there are possibly better and more elegant ways to do this (and guys please educate if so) - but this is one way that works successfully...so here goes [1] In the HDFS Configuration in Cloudera Manager : ===================================== SECTION = "HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml" Add the following <property> <name>fs.s3.awsAccessKeyId</name> <value>YOUR_AWS_ACCESS_KEY</value> </property> <property> <name>fs.s3.awsSecretAccessKey</name> <value>YOUR_AWS_SECRET_ACCESSKEY</value> </property> <property> <name>fs.s3a.awsAccessKeyId</name> <value> YOUR_AWS_ACCESS_KEY </value> </property> <property> <name>fs.s3a.awsSecretAccessKey</name> <value> YOUR_AWS_SECRET_ACCESSKEY </value> </property> <property> <name>fs.s3n.awsAccessKeyId</name> <value> YOUR_AWS_ACCESS_KEY </value> </property> <property> <name>fs.s3n.awsSecretAccessKey</name> <value> YOUR_AWS_SECRET_ACCESSKEY </value> </property> [2] Create Table in Hive ================== set hive.execution.engine=mr ; use openpv ; CREATE EXTERNAL TABLE IF NOT EXISTS solar_installs( zipcode STRING, state STRING, sizekw DOUBLE, cost DOUBLE, date_ STRING, lat DOUBLE, lon DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ; [3] Set Data Location (note that I am pointing to the AWS S3 bucket and sub folder and not to a specific file) ================================================================================== NOTE: If you want the data you can go here https://openpv.nrel.gov/search and hit search with no criteria defined and then download the CSV set hive.execution.engine=mr ; use openpv ; ALTER TABLE solar_installs SET LOCATION 's3a://some-aws-bucket-name/openpv' ; [4] Run a query in Hive using MR as execution engine ========================================= set hive.execution.engine=mr ; use openpv ; select zipcode, count(*) from solar_installs group by zipcode order by zipcode asc ; [5] Run a query in Hive using Spark as execution engine =========================================== use openpv ; select zipcode, count(*) from solar_installs group by zipcode order by zipcode asc ; [6] Run a query in Impala =================== impala-shell -q "invalidate metadata" impala-shell -q "use openpv; select zipcode, count(*) from solar_installs group by zipcode order by zipcode asc " [7] Results Comparison ================== These were run on a 3 node cluster runing under my cube, 1NN +1DN on one node , DN2 on node 2 and DN3 on node 3. Each node is 8 core HP 8300 32GB RAM Impala = 3.81s Hive-on-Spark = 27.582 seconds Hive-on-MR = 46.774 seconds
... View more
03-24-2016
11:35 AM
ok guys something is wrong and maybe there is a clue here Query: select count(*) from sansub01.benji_bc_q1 +------------+ | count(*) | +------------+ | 2344418707 | +------------+ Fetched 1 row(s) in 716.47s The tables I INNER JOINED have 9 million and 7 million rows....so an inner join should have less than 9 or 7 million !
... View more
03-15-2016
11:03 AM
Thanks Michalis for the clarification. Of course 🙂 I have always shouted out loud that I love the Cloudera distribution the best and the open source community is super awesomely responsive and put community version of CDH successfully in production in two placed I have worked 🙂 and always keep my clusters upto date within weeks of new version announcement....and I have a 3 node CDH cluster at home under my desk for my data crunching I do for my not-for-profit www.medicalsidefx.org ! warmly sanjay
... View more
03-03-2016
12:14 PM
You are most welcome Clint. I wish there is some way to migrate all of the google groups cloudera,cdh discussions into this community because that is still my reference 🙂 where I go to see for example - what Romain said about that Hue setting of what Robert (Kanter) said about a hive server question I had posted ? 🙂 At this point we are heavily relying on Hive(on Spark) , Impala and HBase to create multiple sliced and diced versions of input data sets which we feed into our modeling processes. We have REST services based on Hbase behind that one can now get input datasets on demand. Its pretty cool. I love it. Machine Learning is still 80% data wrangling. Just as Clapton wont sound good with rusted guitar strings or a detuned guitar, machine learning wont work with bad data ! thanks once again as always, warmly sanjay
... View more
02-01-2016
06:10 PM
Yarn needed tuning for Spark to take advantage of Yarn YARN MEMORY CALCULATIONS - 3 node cluster ======================================= Each of my node is 32GB RAM , 8 cores, 2 X 2TB 7200RPM disks (Souped up HP8300 Elite desktops) # of Containers = minimum of (2*CORES, 1.8*DISKS, (Total available RAM) / MIN_CONTAINER_SIZE) = MIN(2 * 8 , 1.8 * 2 , (32 - 8)/ 2 ) = MIN(16, 3.6, 12) = 4 MIN_CONTAINER_SIZE = 2GB RAM-per-Container = maximum of (MIN_CONTAINER_SIZE, (Total Available RAM) / Containers)) = MIN( 2, 24/4) = 2 yarn.nodemanager.resource.memory-mb = Containers * RAM-per-Container = 4 * 2 = 8 yarn.scheduler.minimum-allocation-mb = RAM-per-Container = 2 GB yarn.scheduler.maximum-allocation-mb = containers * RAM-per-Container = 8 GB mapreduce.map.memory.mb = RAM-per-Container = 2 GB mapreduce.reduce.memory.mb = 2 * RAM-per-Container = 2 X 2 = 4 GB mapreduce.map.java.opts = 0.8 * RAM-per-Container = 0.8 * 2 GB = 1.6GB mapreduce.reduce.java.opts = 0.8 * 2 * RAM-per-Container = 0.8 * 2 * 2 GB = 3.2GB yarn.app.mapreduce.am.resource.mb = 2 * RAM-per-Container = 2 X 2 = 4GB yarn.app.mapreduce.am.command-opts = 0.8 * 2 * RAM-per-Container = 0.8 X 2 X 2 = 3.2GB Hive query = 187.175 seconds set hive.execution.engine=mr; SET mapreduce.job.reduces=8; SET mapreduce.tasktracker.map.tasks.maximum=12; SET mapreduce.tasktracker.reduce.tasks.maximum=8; SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapreduce.map.output.compress=true; select count(*) from cdr.cdr_mjp_uni_srch_usr_hist_view Spark Query = 128.089 seconds set hive.execution.engine=spark; select count(*) from cdr.cdr_mjp_uni_srch_usr_hist_view
... View more
01-25-2016
06:34 PM
Based on your feedback, the two issues intially reported in this thread seems to have been resolved. For the additional "... MANAGED and EXTERNAL tables this is an issue", let's move our conversation in the thread you've linked previously.
... View more