Member since
09-02-2016
523
Posts
89
Kudos Received
42
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1634 | 08-28-2018 02:00 AM | |
1247 | 07-31-2018 06:55 AM | |
3210 | 07-26-2018 03:02 AM | |
1363 | 07-19-2018 02:30 AM | |
3750 | 05-21-2018 03:42 AM |
06-09-2018
01:46 AM
@AcharkiMed Basically this is a recommendation, still it will be considered as Mandatory or Optional depends upon the environment that you are using Ex: *) For Prod Env - It is mandatory. Otherwise it will create performance difference if there is any switch between active and standby NN *) For test/POC Env - It is optional if you don't have right choice
... View more
05-30-2018
03:40 AM
@gerasimos Option 3: I meant linux File system as local Also just want to make sure the 10TB that you have mentioned is before replication (or) after replication?
... View more
05-24-2018
02:25 AM
1 Kudo
@gerasimos There are different approaches. First let us see, the available options and their pros & cons then you can choose (or) if possible combine them as needed 1. Cloudera Manager -> Backup Menu option (or) distcp option Pros: a. Easy to take backup Cons: a. It will support between two different clusters, so it may not be suitable for your requirement 2. Export/Import Option. Step1: Execute the below command from HDFS and export the working db.table to HDFS path and move to local as needed. It will export both data & metadata > hive -S -e "export table $schema_file1.$tbl_file1 to '$HDFS_DATA_PATH/$tbl_file1';" Step2: Run the below import command twice, First import will through an error as table doesn't exist and it will create table but the Second import will import the data too > hive -S -e "import table $schema_file1.$tbl_file1 from '$HDFS_DATA_PATH/$tbl_file1';" Note: You can hard code $ with actual path/file/table Pros: a. Export/Import will take care of both data & metadata. so you don't need to handle metadata separately Cons: a. I've used it long back for non-partitioned tables, not sure how it will support partition tables, pls double check b. Need to apply import/export for each table 3. Move HDFS data to local and local to tape & take metadata backup separately. Ex: Mysql - so many links available online about how to take mysql backup pros: a. Metadatabackup is possible for entire db Cons: a. May Need to take hdfs file by file depends upon your local FS capacity There could be other options too, please update below if you/anyone find something
... View more
05-21-2018
05:08 AM
@kckrishna Please use the below link to get the latest update http://cloudera.github.io/hue/latest/sdk/sdk.html#new-application
... View more
05-21-2018
03:42 AM
1 Kudo
@sim6 I hope you have more than 3 data nodes Generally there two types of "data missing" issues are possible for many reasons a. ReplicaNotFoundException b. BlockMissingException If your issue is related to BlockMissingException and if you have backup data in your DR environment then you are good otherwise it might be a problem, but for ReplicaNotFoundException, please make sure all your datanodes are healthy and commissioned state. In fact, namenode suppose to handle this automatically whenever a hit occurs on that data.. if not, you can also try hdfs rebalance (or) NN restart may fix this issue, but you don't need to try this option unless some user report any issue on the particular data. In your case no one reported yet and you found it, so you can ignore it for now
... View more
05-16-2018
10:49 AM
@Teradil Before we go in detail, i want to make sure that you have apache sentry service configured in your cluster, if so go to hue and open sentry from top menu then 1. add the role with write access to the required db and assign the role to your UID or a group that you are part of that (or) 2. identify who already has write permission then get their group id (which has write access to the db), be part of that group
... View more
05-16-2018
07:37 AM
1 Kudo
@Teradil If you can login with keytab file then you are good with kerberos part. Also you have mentioned that you can read the table but issue with write access then it should be controlled via apache sentry, it will enforce precise levels of privileges on data. Either Sentry admin has to give write access to your user id for that particular DB (or) you have to be part of the group who already has write access to that DB
... View more
05-14-2018
03:48 AM
@voruganti_vishw Yes we can get some general ideas from various online link. But Cloudera recommends to follow the steps mentioned in this link. The below link has refers to CDH 5.6 version, you can get any suitable version. I fact, I don't see any major difference in the details between different versions The difference between the other onlink link and below one is , you need to download the spreadsheet template provided in the below link and carefully fill your current configuration, so that it will give you the recommended yarn configuration output based your current configuration. In case, if you add extra nodes or decomm any existing nodes, then you can recalculate and update your configuration. https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_yarn_tuning.html
... View more
05-13-2018
10:14 AM
@Mobula Run the command kinit and give the required password. Also please run the klist command and make sure you have a valid ticket. Run those commands before you login to hbase shell and try again
... View more
05-09-2018
10:13 AM
@hendry There could be two possibilities for this scenario 1. May be the hive and impala tables are referring to the two different files. But chances are less for this scenario unless any minor mistakes in the tables (or) some other internal error You can confirm this by > describe formatted db.tablename Run this command from both hive and impala and get the location and compare 2. Your file has duplicate records. I mean some key values are same but other columns may have different value. So it may return different value when you filter. So check your data in detail
... View more
05-08-2018
03:48 AM
@hendry Pls apply invalidate metadata and try again INVALIDATE METADATA [[db_name.]table_name]
... View more
05-08-2018
03:45 AM
1 Kudo
@krishnap It is based on the below configurations 1. CM -> HDFS -> Configuration -> DataNode Block Count Thresholds -> The default value is 500,000 2. CM -> HDFS -> Configuration -> HDFS Block Size -> It may be 64 MB or 128 MB or 256 MB As a solution, you can try the below 1. CM -> HDFS -> Action -> Rebalance 2. Increase the " DataNode Block Count Thresholds" value based on your capacity. NOTE: it require the service restart (or) 3. In some environments, you can also ignore this warning unless you are really running out of space The above block count is related to datanodes and file descriptor is for other daemons like namenode, secondary namenode, journal node, etc. I never explored the file descriptor before, so i don't want to comment
... View more
05-02-2018
04:44 AM
@balajivsn you can stop the cluster using CM -> Top left Cluster menu -> Stop Yes, It will stop all the avilable services Ex: Hue, Hive, Spark, Flume, Yarn, HDFS, Zookeeper, etc. And it won't disturb your host & Cloudera Management Service. Note: You don't need to separately handle daemons like namenode on this
... View more
04-30-2018
12:31 PM
@balajivsn The link that you are referring is belongs to 5.4.x, please refer the below link (5.14.x) for little more details There are two types of backup 1. HDFS Metadata backup https://www.cloudera.com/documentation/enterprise/5-14-x/topics/cm_mc_hdfs_metadata_backup.html Need to follow all the steps including " Stop the cluster. It is particularly important that the NameNode role process is not running so that you can make a consistent backup" 2. NameNode Metadata backup https://www.cloudera.com/documentation/enterprise/5-14-x/topics/cm_mc_nn_metadata_backup.html can be done using $ hdfs dfsadmin -fetchImage backup_dir Now to answer your question, If you see the first link, it says "Cloudera recommends backing up HDFS metadata before a major upgrade". So In the real-time production cluster, we perform the HDFS metadata backup, major upgrade during the downtime. So the given steps are recommended way for consistent backup. But if your situation is just a mater of namenode back-up in a regular interval, then I belive you are correct.. you can switch-on the safe mode and take a backup and leave the safe mode. (or) you can try the option from the 2nd link Note: Please make sure to test it in lower environments before apply in prod
... View more
04-30-2018
06:05 AM
@Alan-H There could be different ways, but I tried the below steps and it is working for me Step1: using select class with hardcoded value create table default.mytest(col1 string, col2 int);
insert into default.mytest
select 'For testing single quote\'s', 1;
insert into default.mytest
select 'For testing double quote\"s', 2;
select * from default.mytest; Step2: using select class by passing value in parameter set hivevar:col1 = 'For testing single quote\'s';
set hivevar:col2 = 3;
insert into default.mytest
select ${hivevar:col1}, ${hivevar:col2};
select * from default.mytest; Step3: using select class by passing value in parameter set hivevar:col1 = 'For testing double quote\"s';
set hivevar:col2 = 4;
insert into default.mytest
select ${hivevar:col1}, ${hivevar:col2};
select * from default.mytest; Step4: drop table default.mytest;
... View more
04-30-2018
04:10 AM
@bhaveshsharma03 In fact there is no standard answer for this question as it is purly based on your business model, cluster size, sqoop export/import frequency, data volume, hardware capacity, etc I can give few points based my experience, hope it may help you 1. 75% of the sqoop scripts (non-priority) will use the default mappers for various reasons as we don't want to use all the available resources for just sqoop alone. 2. Also we don't want to apply all the possible performance tuning methods on those non-priority jobs, as it may disturb the RDBMS (source/target) too. 3. Get in touch with RDBMS owner to see their non-busy hours, identify the priority sqoop scripts (based on your business model), apply the performance tuning methods on the priroity scripts based on data volume (not only rows, 100s of column also matters). Repeat it if you have more than one Databases. 4. Regarding who is responsible... in most of the cases, If you have small cluster being used by very few teams, then developers and admin can work together but if you have a very large cluster being used by so many teams, then it is out of admin's scope.... again it depends
... View more
04-24-2018
02:25 AM
@ps40 The below link is for enterprise edision, I believe it should be same for other edisions too https://www.cloudera.com/documentation/enterprise/release-notes/topics/cm_vd.html 1. so the first point is, According to the above link Ubuntu Xenial 16.04 will be supported by CDH 5.12.2 or above. So if you have decided to upgrade Ubuntu then you have to upgarde CDH/CM as well 2. the second point is, according to the below link, " If you are upgrading CDH or Cloudera Manager as well as the OS, upgrade the OS first" https://www.cloudera.com/documentation/enterprise/5-11-x/topics/cm_ag_upgrading_os.html hope it may give some insights!!
... View more
04-23-2018
03:00 AM
@s_l There are two possibilities for this issue 1. Kerberos - but you are sure that it is not related to kerberos 2. Environment variable set to a wrong path (or) old version - I can see from your code that you have used few environment variables, go to the below mentioned path and make sure the (binary) file that you are referring is available in the path, if you have upgraded any of your software then it may maintain multiple versions, so speicify the correct one. I've included the JAVA_HOME as well 'SPARK_HOME' = "/cloudera/parcels/SPARK2/lib/spark2/"
'PYSPARK_PYTHON' = "./xxx/bin/python"
'PYSPARK_PYTHON_DRIVER' = "/home/xxx/python/xxx/bin/python"
'PYTHONPATH' = "/cloudera/parcels/SPARK2/lib/spark2/python/lib/py4j-0.10.4-src.zip:/hadoop/cloudera/parcels/SPARK2/lib/spark2/python/"
JAVA_HOME=/usr/java
... View more
04-18-2018
03:37 AM
@Apoorva06 Have you disabled SELinux? if not, pls do it, it may help you
... View more
04-17-2018
07:56 AM
@dpugazhe Below are the usual excercise that we follow to reduce the log history, but ... a. it is purly depending upon your client's business, if they are not demanding to keep longer log hisotry then you can try this b. i've given below few samples, you don't need to reduce history for all the logs, pls do your own research to see which history file is taking more space and take action by reducing the max limit and size of the history file CM -> HDFS -> Configuration -> search for the below 1. navigator.client.max_num_audit_log -> 'The default value is 10' - you can reduce it to 8 or 6 (it is recommended to have more history in general) 2. navigator.audit_log_max_file_size -> 'The default value is 100 MB' - you can reduce it to 80MB or 50MB Note: You can try both --or-- any one 3. DataNode Max Log Size -> ' the default value is 200 MB' - you can reduce as needed 4. DataNode Maximum Log File Backups -> ' the default value is 10' - you can reduce as needed 5. NameNode Max Log Size -> 'the default value is 200 MB' - you can reduce as needed 6. NameNode Maximum Log File Backups -> 'the default value is 300' - you can reduce as needed NOTE: I am repeating again, please consider the point a & b before you take action
... View more
04-17-2018
06:01 AM
1 Kudo
@ronnie10 The issue that you are getting is not related to kerberos I think you don't have access for the /user/root under the below path, please try to access your own home dir, it may help you ' http://192.168.1.7:14000/webhdfs/v1/user/root/t?op=LISTSTATUS '
... View more
04-16-2018
05:21 AM
@ludof yes, in general developers will not have access to create a keytab.. you have to contact your admin for the same (mostly admin should have permission to create the one for you, but there are some organization with a dedicated security team to handle LDAP, AD, Kerberos, etc.. it depends upon your organization, but you have to start with your admin)
... View more
04-16-2018
05:12 AM
@Johnny_Bach As mentioned in this link https://stackoverflow.com/questions/6532273/unrecognized-ssl-message-plaintext-connection-exception pls try to swap between, it may help you http://<url>:7180 https://<url>:7183
... View more
04-16-2018
05:00 AM
@ludof Pls try to follow the input from the below link (read all the comment till end by pressing show more).. a similar issue has been discussed here.. it may help you https://stackoverflow.com/questions/44376334/how-to-fix-delegation-token-can-be-issued-only-with-kerberos-or-web-authenticat
... View more
04-16-2018
04:05 AM
@null_pointer For some reason I cannot see the image that you have uploaded, still i got your point and trying to answer your question we cannot always match/compare the memory usage from CM vs linux for various reasons 1. Yes, as you said CM only takes count of memory used by Hadoop components and it won't count consider if you have any other applications running on your local linux as CM designed to monitor only Hadoop and dependent services 2. (I am not sure you are getting the CM report from host monitor) There are practical difficulties to get memory usage of each client node in a single report. Ex: Consider you have 100+ nodes and each node has different memory capacity like 100 GB, 200 GB, 250 GB, 300 GB, etc, it is difficult to generate a single report to get memory usage of each client still if the default report available in CM is not meeting your requirement, may be you can try to build custom chart from CM -> Chart (menu) -> your tsquery https://www.cloudera.com/documentation/enterprise/5-9-x/topics/admin_cluster_util_custom.html
... View more
04-15-2018
09:19 AM
1 Kudo
@Aedulla here you go,.... http://www.bayareabikeshare.com/open-data https://grouplens.org/datasets/movielens/ https://www.nyse.com/market-data/historical also you can use the below free hue access (login uid: demo, pwd: demo) where you can get some pre-existing data for hive, impala, hbase, etc. Note: if you are getting any exception after login, pls try after sometime or raise a ticket, so that someone from hue team will fix the issue http://demo.gethue.com
... View more
04-11-2018
05:11 AM
@nandakumar you can use adquery commands adquery <user> adquery <group> etc
... View more
04-11-2018
05:08 AM
@bukangarii as long as you have jdbc connectivity to your legacy system, it is possible to export the parquet hive table to your legacy system please check the sqoop guide document to understand the supporting data types
... View more
04-10-2018
11:37 PM
@nandakumar it looks like sentry issue, have you recently added/enabled the sentry service? if so, you can try this then you may have to grant the necessary access of your dbs to user group. this can be done via hue or you can login to hive as admin and try the below commands Ex: Consider your user belongs to <my_group> ## role creation: create role <my_role>; ## grant access to my_role grant all on database <my_db1> to role <my_role>; grant select on database <my_db2> to role <my_role>; ## grant role to group grant <my_role> to group <my_group>;
... View more