Member since
01-19-2017
3598
Posts
593
Kudos Received
359
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
119 | 10-26-2022 12:35 PM | |
273 | 09-27-2022 12:49 PM | |
341 | 05-27-2022 12:02 AM | |
276 | 05-26-2022 12:07 AM | |
467 | 01-16-2022 09:53 AM |
07-26-2021
02:53 PM
1 Kudo
@sipocootap2 Here is a walkthrough on how to delete a snapshot Created a directory $ hdfs dfs -mkdir -p /app/tomtest Changed the owner $ hdfs dfs -chown -R tom:developer /app/tomtest To be able to create a snapshot the directory has to be snapshottable $ hdfs dfsadmin -allowSnapshot /app/tomtest
Allowing snaphot on /app/tomtest succeeded Now I created 3 snapshots $ hdfs dfs -createSnapshot /app/tomtest sipo
Created snapshot /app/tomtest/.snapshot/sipo
$ hdfs dfs -createSnapshot /app/tomtest coo
Created snapshot /app/tomtest/.snapshot/coo
$ hdfs dfs -createSnapshot /app/tomtest tap2
Created snapshot /app/tomtest/.snapshot/tap2 Confirm the directory is snapshottable $ hdfs lsSnapshottableDir
drwxr-xr-x 0 tom developer 0 2021-07-26 23:14 3 65536 /app/tomtest List all the snapshots in the directory $ hdfs dfs -ls /app/tomtest/.snapshot
Found 3 items
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/coo
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/sipo
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/tap2 Now I need to delete the snapshot coo $ hdfs dfs -deleteSnapshot /app/tomtest/ coo Confirm the snapshot is gone $ hdfs dfs -ls /app/tomtest/.snapshot
Found 2 items
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/sipo
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/tap2 Voila To delete a snapshot the format is hdfs dfs -deleteSnapshot <path> <snapshotName> i.e hdfs dfs -deleteSnapshot /app/tomtest/ coo notice the space and omittion of the .snapshot as all .(dot) files the snapshot directory is not visible with normal hdfs command The -ls command gives 0 results $ hdfs dfs -ls /app/tomtest/ The special command shows the 2 remaining snapshots $ hdfs dfs -ls /app/tomtest/.snapshot
Found 2 items
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/sipo
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/tap2 Is there a command to disallow snapshots for all the subdirectories? Yes there is only after you have deleted all the snapshots therein demo, or better at directory creation time you can disallow snapshots $ hdfs dfsadmin -disallowSnapshot /app/tomtest/
disallowSnapshot: The directory /app/tomtest has snapshot(s). Please redo the operation after removing all the snapshots. The only way I have found which works when for me and permits me to have a cup of coffee is to first list all the snapshots and copy-paste the delete even if there are 60 snapshots it works and I only get back when the snapshots are gone or better still do something else while the deletion is going on not automated though the example The below would run concurrently hdfs dfs -deleteSnapshot /app/tomtest/ sipo
.....
....
hdfs dfs -deleteSnapshot /app/tomtest/ tap2 -deleteSnapshot skips trash by default! Happy hadooping
... View more
07-26-2021
01:20 PM
@enirys As suggested we need more details and there is no silver bullet a piece of advance from experience it's better you open a new thread and give as much details as possible. OS HDP version Ambari Mit or AD kerberos Documented steps or official document reference Your Kerberos config krb5.conf, kdc.conf kadm5.acl Hosts files Node number [Single or Multi node] Just any information that will reduce the too many exchange of posts but gives members the info needed to help. Cheers
... View more
07-26-2021
02:27 AM
@ambari275 Great please accept the answer so the thread can be closed and referenced byother users Happy hadooping !!!
... View more
07-26-2021
01:24 AM
@ambari275 These are the steps to follow see below Assumptions logged as root clustername=test REALM= DOMAIN.COM Hostname = host1 logged in as root [root@host1]# Switch to user HDFS the HDFS superuser [root@host1]# su - hdfs Check the HDFS associated keytab generated [hdfs@host1 ~]$ cd /etc/security/keytabs/
[hdfs@host1 keytabs]$ ls Sample output atlas.service.keytab hdfs.headless.keytab knox.service.keytab oozie.service.keytab Now use the hdfs.headless.keytab to get the associated principal [hdfs@host1 keytabs]$ klist -kt /etc/security/keytabs/hdfs.headless.keytab Expected output Keytab name: FILE:/etc/security/keytabs/hdfs.headless.keytab
KVNO Timestamp Principal
---- ------------------- ------------------------------------------------------
1 07/26/2021 00:34:03 hdfs-test@DOMAIN.COM
1 07/26/2021 00:34:03 hdfs-test@DOMAIN.COM
1 07/26/2021 00:34:03 hdfs-test@DOMAIN.COM
1 07/26/2021 00:34:03 hdfs-test@DOMAIN.COM
1 07/26/2021 00:34:03 hdfs-test@DOMAIN.COM Grab a Kerberos ticket by using the keytab+ principal like username/pèassword to authenticate to KDC [hdfs@host1 keytabs]$ kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs-test@DOMAIN.COM Check you no have a valid Kerberos ticket [hdfs@host1 keytabs]$ klist Sample output Ticket cache: FILE:/tmp/krb5cc_1013
Default principal: hdfs-test@DOMAIN.COM
Valid starting Expires Service principal
07/26/2021 10:03:17 07/27/2021 10:03:17 krbtgt/DOMAIN.COM@DOMAIN.COM Now you can list successfully the HDFS directories, remember to -ls it seems you forgot it in your earlier command [hdfs@host1 keytabs]$ hdfs dfs -ls /
Found 9 items
drwxrwxrwx - yarn hadoop 0 2018-09-24 00:31 /app-logs
drwxr-xr-x - hdfs hdfs 0 2018-09-24 00:22 /apps
drwxr-xr-x - yarn hadoop 0 2018-09-24 00:12 /ats
drwxr-xr-x - hdfs hdfs 0 2018-09-24 00:12 /hdp
drwxr-xr-x - mapred hdfs 0 2018-09-24 00:12 /mapred
drwxrwxrwx - mapred hadoop 0 2018-09-24 00:12 /mr-history
drwxrwxrwx - spark hadoop 0 2021-07-26 10:04 /spark2-history
drwxrwxrwx - hdfs hdfs 0 2021-07-26 00:57 /tmp
drwxr-xr-x - hdfs hdfs 0 2018-09-24 00:23 /user Voila happy hadooping and remember to accept the best response so other users could reference it
... View more
07-25-2021
02:15 PM
@ambari275 I have gone through the logs and here are my observations Error: WARNING: A HTTP GET method, public javax.ws.rs.core.Response org.apache.ambari.server.api.services.ExtensionsService.getExtensionVersions(java.lang.String,javax.ws.rs.core.HttpHeaders,javax.ws.rs.core.UriInfo,java.lang.String), should not consume any entity. Solution: To fix the issue: # cat /etc/ambari-server/conf/ambari.properties | grep client.threadpool.size.max
client.threadpool.size.max=25 The client.threadpool.size.max property indicates a number of parallel threads servicing client requests. To find the number of cores on the server, issue Linux command nproc # nproc
25 1) Edit /etc/ambari-server/conf/ambari.properties file and change the default value of client.threadpool.size.max to have the number of cores on your machine. client.threadpool.size.max=25 2) Restart ambari-server # ambari-server restart Error 2021-07-23 12:43:42,673 WARN [Stack Version Loading Thread] RepoVdfCallable:142 - Could not load version definition for HDP-3.0 identified by https://archive.cloudera.com/p/HDP/centos7/3.x/3.0.1.0/HDP-3.0.1.0-187.xml. Server returned HTTP response code: 401 for URL: https://archive.cloudera.com/p/HDP/centos7/3.x/3.0.1.0/HDP-3.0.1.0-187.xml java.io.IOException: Server returned HTTP response code: 401 for URL: https://archive.cloudera.com/p/HDP/centos7/3.x/3.0.1.0/HDP-3.0.1.0-187.xml Reason: 401 means "Unauthorized", so there must be something with your credentials this is purely an authorization issue. It seems your access to the HDP repos is an issue. Your krb5.conf should look something like this # cat /etc/krb5.conf # Configuration snippets may be placed in this directory as well
includedir /etc/krb5.conf.d/
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
dns_lookup_realm = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
rdns = false
default_realm = DOMAIN.COM
default_ccache_name = KEYRING:persistent:%{uid}
[realms]
DOMAIN.COM = {
kdc = [FQDN 10.1.1.150]
admin_server =[FQDN 10.1.1.150]
}
[domain_realm]
.domain.com = DOMAIN.COM
domain.com = DOMAIN.COM Your /etc/host I think I remember once having issues with hostnames with - try using host1 for ESXI-host2 etc and please don't comment out the IPV6 entry it can cause network connectivity issue so please remove the be # on the second line x.x.x.x localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
x.x.x.x FQDN server
x.x.x.x host1
x.x.x.x host2
x.x.x.x host3 Kerberos service uses DNS to resolve hostnames. Therefore, DNS must be enabled on all hosts. With DNS, the principal must contain the fully qualified domain name (FQDN) of each host. For example, if the hostname is host1, the DNS domain name is domain.com, and the realm name is DOMAIN.COM, then the principal name for the host would be host/host1.domain.com@DOMAIN.COM. The examples in this guide require that DNS is configured and that the FQDN is used for each host. Also, ensure ambari agents is installed on all hosts including the ambari-server! Ensure on all the hosts the hostname point to the Ambari server [server]
hostname=<FQDN_oF_Ambari_server>
url_port=8440
secured_url_port=8441
connect_retry_delay=10
max_reconnect_retry_delay=30 Please revert
... View more
07-25-2021
01:34 PM
@USMAN_HAIDER There is this step below did you perform that? Kerberos must be specified as the security mechanism for Hadoop infrastructure, starting with the HDFS service. Enable Cloudera Manager Server security for the cluster on an HDFS service. After you do so, the Cloudera Manager Server automatically enables Hadoop security on the MapReduce and YARN services associated with that HDFS service. In the Cloudera Manager Admin Console:
Select Clusters > HDFS-n.
1.Click the Configuration tab.
2.Select HDFS-n for the Scope filter.
3.Select Security for the Category filter.
4.Scroll (or search) to find the Hadoop Secure Authentication property.
5.Click the Kerberos button to select Kerberos: Please revert
... View more
07-23-2021
10:57 AM
@ambari275 You can set up the kerberos server anywhere on the network provided it can be accessed by the hosts in your cluster. I suspect there is d^something wrong with yor Ambari server. Can you share your /var/log/ambari-server/ambari-server.log I asked for a couple of files but you only shared the krb5.conf. I will need the rest of the files to be able to understand and determine what could be the issue. Can describe your setup? Number of Nodes,network, OS etc
... View more
07-22-2021
12:38 PM
@ambari275 From the onset, I see you left the defaults and I doubt whether that really maps to your cluster. Here is a list of outputs I need to validate $ hostname -f [Where you installed the kerberos server]
/etc/hosts
/var/kerberos/krb5kdc/kadm5.acl
/var/kerberos/krb5kdc/kdc.conf On the Kerberos server can you run # kadmin.local Then list_principals q to quit The hostname -f output on the Kerberos server should replace kdc and admin_server in krb5.conf Here is an example OS: Centos 7 Cluster Realm HOTEL.COM My hosts entry is for a class C network so yours could be different but your host name must be resolved by DNS [root@test ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.0.153 test.hotel.com test [root@test ~]# hostname -f test.hotel.com [root@test ~]# cat /var/kerberos/krb5kdc/kadm5.acl */admin@HOTEL.COM * [root@test ~]# cat /etc/krb5.conf # Configuration snippets may be placed in this directory as well
includedir /etc/krb5.conf.d/
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
dns_lookup_realm = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
rdns = false
default_realm = HOTEL.COM
default_ccache_name = KEYRING:persistent:%{uid}
[realms]
HOTEL.COM = {
kdc = test.hotel.com
admin_server =test.hotel.com
}
[domain_realm]
.hotel.com = HOTEL.COM
hotel.com = HOTEL.COM [root@test ~]# cat /var/kerberos/krb5kdc/kdc.conf [kdcdefaults]
kdc_ports = 88
kdc_tcp_ports = 88
[realms]
HOTEL.COM = {
#master_key_type = aes256-cts
acl_file = /var/kerberos/krb5kdc/kadm5.acl
dict_file = /usr/share/dict/words
admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal
}
[realms]
HOTEL.COM = {
master_key_type = des-cbc-crc
database_name = /var/kerberos/krb5kdc/principal
admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
supported_enctypes = des-cbc-crc:normal des3-cbc-raw:normal des3-cbc-sha1:norm
al des-cbc-crc:v4 des-cbc-crc:afs3
kadmind_port = 749
acl_file = /var/kerberos/krb5kdc/kadm5.acl
dict_file = /usr/dict/words
} Once you share the above then I could figure out where the issue could be. Happy hadooping
... View more
07-22-2021
10:46 AM
@USMAN_HAIDER When you create a new Principal in the slave KDC you should also have a crontab that will propagate it to the master #!/bin/sh
#/var/kerberos/kdc-master-propogate.sh
kdclist = "slave-kdc.customer.com"
/sbin/kdb5_util dump /usr/local/var/krb5kdc/master_datatrans
for kdc in $kdclist
do
/sbin/kprop -f /usr/local/var/krb5kdc/master_datatrans $kdc
done This way the principals will be sync'ed
... View more
07-18-2021
02:10 PM
@mike_bronson7 Are you using the default capacity schedule settings? No queues/leafs created? Is what you shared the current seeting?
... View more
07-11-2021
10:32 AM
@srinivasp I am wondering whether your Ranger policies are also in place. Please explicitly give the correct permissions to the group/user in Ranger as the beeline authorization depends now on Ranger 🙂 Happy hadooping.
... View more
07-09-2021
03:51 AM
@enirys That's correct to successfully set up an HMS HA you MUST ensure the metadata DB should have followed the steps mention in this official document: Configuring High Availability for the Hive Metastore High Availability for Hive Metastore That's should help you sort of the stale metadata issue
... View more
07-08-2021
02:33 PM
@SparkNewbie Bingo you are using the derby DB, which is only recommended for testing. There are three modes for Hive Metastore deployment: Embedded Metastore Local Metastore Remote Metastore In Hive by default, metastore service runs in the same JVM as the Hive service. It uses embedded derby database stored on the local file system in this mode. Thus both metastore service and hive service runs in the same JVM by using embedded Derby Database. But, this mode also has its limitation that, as only one embedded Derby database can access the database files on disk at any one time, so only one Hive session could be open at a time. 21/07/07 23:07:56 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY Derby is an embedded relational database in Java program and used for online transaction processing and has a 3.5 MB disk-space footprint. Depending on your software HDP or Cloudera ensure the hive DB is plugged to an external Mysql database For CDH using Mysql For HDP using mysql Check your current hive UI backend metadata databases !! After installing MySQL then you should toggle hive config to point to the external Mysql database .. Once done your commands and the refresh should succeed Please let me know if you need help Happy hadooping
... View more
07-07-2021
11:12 AM
@SparkNewbie Can you add the redcline between your 2 commands spark.sql("create table test1_0522 location 's3a://<test-bucket>/data/test1_0522' stored as PARQUET as select * from my_temp_table") REFRESH TABLE test1_0522; spark.sql("SHOW TABLES").show() That should resolve the problem. Happy hadooping
... View more
07-05-2021
10:51 AM
@t1 I tried out the sqoop list-databases and my output looks correct [root@bern ~]# sqoop list-databases \
> --connect jdbc:mysql://localhost:3306/ \
> --username root \
> --password welcome
.........
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/07/05 17:28:04 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7.3.1.4.0-315
21/07/05 17:28:05 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
21/07/05 17:28:05 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
ambari
druid
harsh8
hive
mysql
oozie
performance_schema
ranger
rangerkms
superset Then I run exactly the same sqoop import and it succeeded but I think its the underlying table format [hdfs@bern ~]$ sqoop import --connect jdbc:mysql://localhost:3306/harsh8 --username root --table staff2 --hive-import --fields-terminated-by "," --hive-import --create-hive-table --hive-table staff2_backup --m 1
SLF4J: Class path contains multiple SLF4J bindings.
.....
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/07/05 18:36:40 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7.3.1.4.0-315
21/07/05 18:36:41 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
21/07/05 18:36:41 INFO tool.CodeGenTool: Beginning code generation
21/07/05 18:36:43 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff2` AS t LIMIT 1
21/07/05 18:36:43 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff2` AS t LIMIT 1
21/07/05 18:36:43 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/3.1.4.0-315/hadoop-mapreduce
21/07/05 18:36:48 WARN orm.CompilationManager: Could not rename /tmp/sqoop-hdfs/compile/358a7be0c1aae1ac531284e68ae3679e/staff2.java to /home/hdfs/./staff2.java. Error: Destination '/home/hdfs/./staff2.java' already exists
21/07/05 18:36:48 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/358a7be0c1aae1ac531284e68ae3679e/staff2.jar
21/07/05 18:36:49 WARN manager.MySQLManager: It looks like you are importing from mysql.
21/07/05 18:36:49 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
21/07/05 18:36:49 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
21/07/05 18:36:49 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
21/07/05 18:36:49 INFO mapreduce.ImportJobBase: Beginning import of staff2
21/07/05 18:36:59 INFO client.RMProxy: Connecting to ResourceManager at bern.swiss.ch/192.168.0.139:8050
21/07/05 18:37:09 INFO client.AHSProxy: Connecting to Application History server at bern.swiss.ch/192.168.0.139:10200
21/07/05 18:38:28 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/hdfs/.staging/job_1625500722080_0001
21/07/05 18:40:05 INFO db.DBInputFormat: Using read commited transaction isolation
21/07/05 18:40:23 INFO mapreduce.JobSubmitter: number of splits:1
21/07/05 18:40:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1625500722080_0001
21/07/05 18:40:32 INFO mapreduce.JobSubmitter: Executing with tokens: []
21/07/05 18:40:34 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.4.0-315/0/resource-types.xml
21/07/05 18:40:37 INFO impl.YarnClientImpl: Submitted application application_1625500722080_0001
21/07/05 18:40:37 INFO mapreduce.Job: The url to track the job: http://bern.swiss.ch:8088/proxy/application_1625500722080_0001/
21/07/05 18:40:37 INFO mapreduce.Job: Running job: job_1625500722080_0001
21/07/05 18:46:55 INFO mapreduce.Job: Job job_1625500722080_0001 running in uber mode : false
21/07/05 18:46:55 INFO mapreduce.Job: map 0% reduce 0%
21/07/05 18:50:56 INFO mapreduce.Job: map 100% reduce 0%
21/07/05 18:51:09 INFO mapreduce.Job: Job job_1625500722080_0001 completed successfully
21/07/05 18:51:10 INFO mapreduce.Job: Counters: 32
File System Counters
FILE: Number of bytes read=0
.............
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=385416
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=192708
Total vcore-milliseconds taken by all map tasks=192708
Total megabyte-milliseconds taken by all map tasks=394665984
Map-Reduce Framework
Map input records=6
Map output records=6
............
Physical memory (bytes) snapshot=152813568
Virtual memory (bytes) snapshot=3237081088
Total committed heap usage (bytes)=81788928
Peak Map Physical memory (bytes)=152813568
Peak Map Virtual memory (bytes)=3237081088
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=223
21/07/05 18:51:10 INFO mapreduce.ImportJobBase: Transferred 223 bytes in 852.4312 seconds (0.2616 bytes/sec)
21/07/05 18:51:10 INFO mapreduce.ImportJobBase: Retrieved 6 records.
21/07/05 18:51:10 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff2` AS t LIMIT 1
21/07/05 18:51:10 INFO hive.HiveImport: Loading uploaded data into Hive My table structure MariaDB [harsh8]> describe staff2;
+------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+-------+
| id | int(11) | NO | PRI | NULL | |
| Name | varchar(20) | YES | | NULL | |
| Position | varchar(20) | YES | | NULL | |
| Salary | int(11) | YES | | NULL | |
| Department | varchar(10) | YES | | NULL | |
+------------+-------------+------+-----+---------+-------+ My test table contents MariaDB [harsh8]> select * from staff2;
+-----+------------+-------------------+--------+------------+
| id | Name | Position | Salary | Department |
+-----+------------+-------------------+--------+------------+
| 100 | Geoffrey | manager | 50000 | Admin |
| 101 | Thomas | Oracle Consultant | 15000 | IT |
| 102 | Biden | Project Manager | 28000 | PM |
| 103 | Carmicheal | Bigdata developer | 30000 | BDS |
| 104 | Johnson | Treasurer | 21000 | Accounts |
| 105 | Gerald | Director | 30000 | Management |
+-----+------------+-------------------+--------+------------+
6 rows in set (0.09 sec) This is how my comma-delimited [hdfs@bern ~]$ hdfs dfs -cat /tmp/sqoop/hr.txt
100,Geoffrey,manager,50000,Admin
101,Thomas,Oracle Consultant,15000,IT
102,Biden,Project Manager,28000,PM
103,Carmicheal,Bigdata developer,30000,BDS
104,Johnson,Treasurer,21000,Accounts
105,Gerald,Director,30000,Management
106,Paul,Director,30000,Management
105,Mark,CEO,90000,Management
105,Edward,Janitor,30000,Housing
105,Richard,Farm Manager,31000,Agriculture
105,Albert,System Engineer,21000,IT The dataset looks like the above is your format AVRO? Happy hadooping
... View more
07-05-2021
08:10 AM
1 Kudo
@t1 Is there a way I can re-create your tables? I could try the same commands I also have MySQL/MariaDB. And keep you posted!
... View more
07-05-2021
06:14 AM
@t1 How is the root user authentication against the databases? If username password then I don't see the prompt for the password. Can you run the below and re-share the out put Added sqoop import \ --connect jdbc:mysql://localhost:3306/sample \ --username root -p \ --table test \ --hive-import \ --fields-terminated-by "," --hive-import --create-hive-table --hive-table sample.tesr --m 4
... View more
07-05-2021
06:08 AM
@Guarupe Can you share your steps, please? Are you suing HUE to run your cmds? Did you use the Impala editor?
... View more
07-04-2021
11:03 PM
@ask_bill_brooks Thanks for the addendum and official context. Happy hadooping
... View more
07-04-2021
10:54 AM
@t1 Can you share the whole stack command +plus output?
... View more
07-04-2021
04:07 AM
@Guarupe I responded to a similar question Warm up Impala You will need to run the INVALIDATE METADATA [[db_name.]table_name] The error is precise Impala uses the Hive Metastore [HMS] to build efficient queries CAUSED BY: MetaException: Column mycolumn doesn't exist in table mytable in database myschema In your case INVALIDATE METADATA [[myschema.]mytable] The INVALIDATE METADATA is an asynchronous operation that simply discards the loaded metadata from the catalog and coordinator caches. After that operation, the catalog and all the Impala coordinators only know about the existence of databases and tables and nothing more. Metadata loading for tables is triggered by any subsequent queries. After running this in the impala-shell you should compute statistics successfully Happy hadooping
... View more
07-03-2021
02:39 PM
@harsh8 Happy to help with that question. The simple answer is YES below I am demonstrating with the table staff created in the previous post! Before the import [hdfs@bern ~]$ hdfs dfs -ls /tmp
Found 5 items
drwxrwxr-x - druid hadoop 0 2020-07-06 02:04 /tmp/druid-indexing
drwxr-xr-x - hdfs hdfs 0 2020-07-06 01:50 /tmp/entity-file-history
drwx-wx-wx - hive hdfs 0 2020-07-06 01:59 /tmp/hive
-rw-r--r-- 3 hdfs hdfs 1024 2020-07-06 01:57 /tmp/ida8c04300_date570620
drwxr-xr-x - hdfs hdfs 0 2021-06-29 10:14 /tmp/sqoop When you run the sqoop import to ensure the destination directory sqoop_harsh8 doesn't already exist in HDFS $ sqoop import --connect jdbc:mysql://localhost/harsh8 --table staff --username root -P --target-dir /tmp/sqoop_harsh8 -m 1 Here I am importing the table harsh8.staff I created in the previous session. The sqoop export will create 2 files _SUCCESS and part-m-0000 in the HDFS directory as shown below. After the export the directory /tmp/sqoop_harsh8 is newly created [hdfs@bern ~]$ hdfs dfs -ls /tmp
Found 5 items
drwxrwxr-x - druid hadoop 0 2020-07-06 02:04 /tmp/druid-indexing
drwxr-xr-x - hdfs hdfs 0 2020-07-06 01:50 /tmp/entity-file-history
drwx-wx-wx - hive hdfs 0 2020-07-06 01:59 /tmp/hive
-rw-r--r-- 3 hdfs hdfs 1024 2020-07-06 01:57 /tmp/ida8c04300_date570620
drwxr-xr-x - hdfs hdfs 0 2021-06-29 10:14 /tmp/sqoop
-rw-r--r-- 3 hdfs hdfs 0 2021-07-03 22:04 /tmp/sqoop_harsh8 Check the contents of /tmp/sqoop_harsh8 [hdfs@bern ~]$ hdfs dfs -ls /tmp/sqoop_harsh8
Found 2 items
-rw-r--r-- 3 hdfs hdfs 0 2021-07-03 22:04 /tmp/sqoop_harsh8/_SUCCESS
-rw-r--r-- 3 hdfs hdfs 223 2021-07-03 22:04 /tmp/sqoop_harsh8/part-m-00000 The _SUCCESS is just a log file so cat the contents of part-m-00000 this is the data from our table harsh8.staff [hdfs@bern ~]$ hdfs dfs -cat /tmp/sqoop_harsh8/part-m-00000
100,Geoffrey,manager,50000,Admin
101,Thomas,Oracle Consultant,15000,IT
102,Biden,Project Manager,28000,PM
103,Carmicheal,Bigdata developer,30000,BDS
104,Johnson,Treasurer,21000,Accounts
105,Gerald,Director,30000,Management I piped the contents to a text file hr2.txt in my local tmp directory to enable me run a sqoop import with an acceptable format [hdfs@bern ~]$ hdfs dfs -cat /tmp/sqoop_harsh8/part-m-00000 > /tmp/hr2.txt Validate the hr2.txt contents [hdfs@bern ~]$ cat /tmp/hr2.txt
100,Geoffrey,manager,50000,Admin
101,Thomas,Oracle Consultant,15000,IT
102,Biden,Project Manager,28000,PM
103,Carmicheal,Bigdata developer,30000,BDS
104,Johnson,Treasurer,21000,Accounts
105,Gerald,Director,30000,Management Copied the hr2.txt to HDFS and validated the file was copied [hdfs@bern ~]$ hdfs dfs -copyFromLocal /tmp/hr2.txt /tmp Validation [hdfs@bern ~]$ hdfs dfs -ls /tmp
Found 7 items
drwxrwxr-x - druid hadoop 0 2020-07-06 02:04 /tmp/druid-indexing
drwxr-xr-x - hdfs hdfs 0 2020-07-06 01:50 /tmp/entity-file-history
drwx-wx-wx - hive hdfs 0 2020-07-06 01:59 /tmp/hive
-rw-r--r-- 3 hdfs hdfs 223 2021-07-03 22:41 /tmp/hr2.txt
-rw-r--r-- 3 hdfs hdfs 1024 2020-07-06 01:57 /tmp/ida8c04300_date570620
drwxr-xr-x - hdfs hdfs 0 2021-06-29 10:14 /tmp/sqoop
drwxr-xr-x - hdfs hdfs 0 2021-07-03 22:04 /tmp/sqoop_harsh8 Connected to MySQL and switch to harsh8 database [root@bern ~]# mysql -uroot -p
Enter password:
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 179
Server version: 5.5.65-MariaDB MariaDB Server
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [(none)]> use harsh8;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed Check the existing tables in the harsh8 database before the export MariaDB [harsh8]> show tables;
+------------------+
| Tables_in_harsh8 |
+------------------+
| staff |
+------------------+
1 row in set (0.00 sec) Pre-create a table staff2 to receive the hr2.txt data MariaDB [harsh8]> CREATE TABLE staff2 ( id INT NOT NULL PRIMARY KEY, Name VARCHAR(20), Position VARCHAR(20),Salary INT,Department VARCHAR(10));
Query OK, 0 rows affected (0.57 sec)
MariaDB [harsh8]> show tables;
+------------------+
| Tables_in_harsh8 |
+------------------+
| staff |
| staff2 |
+------------------+
2 rows in set (0.00 sec) Load data into staff2 from a sqoop export ! sqoop]$ sqoop export --connect jdbc:mysql://localhost/harsh8 --username root --password 'w3lc0m31' --table staff2 --export-dir /tmp/hr2.txt
[hdfs@bern ~]$ sqoop export --connect jdbc:mysql://localhost/harsh8 --username root --password 'w3lc0m31' --table staff2 --export-dir /tmp/hr2.txt
...
21/07/03 22:44:36 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7.3.1.4.0-315
21/07/03 22:44:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
21/07/03 22:44:37 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
21/07/03 22:44:37 INFO tool.CodeGenTool: Beginning code generation
21/07/03 22:44:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff2` AS t LIMIT 1
21/07/03 22:44:40 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff2` AS t LIMIT 1
21/07/03 22:44:40 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/3.1.4.0-315/hadoop-mapreduce
21/07/03 22:45:48 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/a53bb813b88ab155201196658f3ee001/staff2.jar
21/07/03 22:45:48 INFO mapreduce.ExportJobBase: Beginning export of staff2
21/07/03 22:47:59 INFO client.RMProxy: Connecting to ResourceManager at bern.swiss.ch/192.168.0.139:8050
21/07/03 22:48:07 INFO client.AHSProxy: Connecting to Application History server at bern.swiss.ch/192.168.0.139:10200
21/07/03 22:48:18 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/hdfs/.staging/job_1625340048377_0003
21/07/03 22:49:05 INFO input.FileInputFormat: Total input files to process : 1
21/07/03 22:49:05 INFO input.FileInputFormat: Total input files to process : 1
21/07/03 22:49:12 INFO mapreduce.JobSubmitter: number of splits:4
21/07/03 22:49:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1625340048377_0003
21/07/03 22:49:24 INFO mapreduce.JobSubmitter: Executing with tokens: []
21/07/03 22:49:26 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.4.0-315/0/resource-types.xml
21/07/03 22:49:32 INFO impl.YarnClientImpl: Submitted application application_1625340048377_0003
21/07/03 22:49:33 INFO mapreduce.Job: The url to track the job: http://bern.swiss.ch:8088/proxy/application_1625340048377_0003/
21/07/03 22:49:33 INFO mapreduce.Job: Running job: job_1625340048377_0003
21/07/03 22:52:15 INFO mapreduce.Job: Job job_1625340048377_0003 running in uber mode : false
21/07/03 22:52:15 INFO mapreduce.Job: map 0% reduce 0%
21/07/03 22:56:45 INFO mapreduce.Job: map 75% reduce 0%
21/07/03 22:58:10 INFO mapreduce.Job: map 100% reduce 0%
21/07/03 22:58:13 INFO mapreduce.Job: Job job_1625340048377_0003 completed successfully
21/07/03 22:58:14 INFO mapreduce.Job: Counters: 32
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=971832
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1132
HDFS: Number of bytes written=0
HDFS: Number of read operations=19
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=4
Data-local map tasks=4
Total time spent by all maps in occupied slots (ms)=1733674
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=866837
Total vcore-milliseconds taken by all map tasks=866837
Total megabyte-milliseconds taken by all map tasks=1775282176
Map-Reduce Framework
Map input records=6
Map output records=6
Input split bytes=526
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=1565
CPU time spent (ms)=6710
Physical memory (bytes) snapshot=661999616
Virtual memory (bytes) snapshot=12958916608
Total committed heap usage (bytes)=462422016
Peak Map Physical memory (bytes)=202506240
Peak Map Virtual memory (bytes)=3244965888
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
21/07/03 22:58:14 INFO mapreduce.ExportJobBase: Transferred 1.1055 KB in 630.3928 seconds (1.7957 bytes/sec)
21/07/03 22:58:14 INFO mapreduce.ExportJobBase: Exported 6 records. Switch and log onto MariaDB, switch to harsh8 database, and query for the new table staff2 [root@bern ~]# mysql -uroot -pwelcome1
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 266
Server version: 5.5.65-MariaDB MariaDB Server
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [(none)]> use harsh8;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
MariaDB [harsh8]> show tables;
+------------------+
| Tables_in_harsh8 |
+------------------+
| staff |
| staff2 |
+------------------+
2 rows in set (2.64 sec)
MariaDB [harsh8]> SELECT NOW();
+---------------------+
| NOW() |
+---------------------+
| 2021-07-03 23:02:38 |
+---------------------+
1 row in set (0.00 sec) After the import MariaDB [harsh8]> select * from staff2;
+-----+------------+-------------------+--------+------------+
| id | Name | Position | Salary | Department |
+-----+------------+-------------------+--------+------------+
| 100 | Geoffrey | manager | 50000 | Admin |
| 101 | Thomas | Oracle Consultant | 15000 | IT |
| 102 | Biden | Project Manager | 28000 | PM |
| 103 | Carmicheal | Bigdata developer | 30000 | BDS |
| 104 | Johnson | Treasurer | 21000 | Accounts |
| 105 | Gerald | Director | 30000 | Management |
+-----+------------+-------------------+--------+------------+
6 rows in set (0.00 sec) Check the source table used in the export see the timestamp ! MariaDB [harsh8]> SELECT NOW();
+---------------------+
| NOW() |
+---------------------+
| 2021-07-03 23:04:50 |
+---------------------+
1 row in set (0.00 sec) Comparison MariaDB [harsh8]> select * from staff;
+-----+------------+-------------------+--------+------------+
| id | Name | Position | Salary | Department |
+-----+------------+-------------------+--------+------------+
| 100 | Geoffrey | manager | 50000 | Admin |
| 101 | Thomas | Oracle Consultant | 15000 | IT |
| 102 | Biden | Project Manager | 28000 | PM |
| 103 | Carmicheal | Bigdata developer | 30000 | BDS |
| 104 | Johnson | Treasurer | 21000 | Accounts |
| 105 | Gerald | Director | 30000 | Management |
+-----+------------+-------------------+--------+------------+ You have successfully created a table from a Sqoop export! Et Voila The conversion from part-m-00000 to txt did the trick, so this proves it's doable so you question is answered 🙂 You can revalidate by following my steps Happy hadooping !
... View more
07-02-2021
03:03 AM
@mike_bronson7 Waiting for your response with the logs.
... View more
07-02-2021
02:51 AM
@dooby There is a Jira out theret see the solution https://issues.apache.org/jira/browse/SPARK-32536
... View more
07-01-2021
01:12 PM
@Faizan123 Namenode [Master] and Datanode [Slave] are part of HDFS, which is the storage layer, and ResourceManager[Master] and NodeManager [Slave] are part of YARN, which is a Resource Negotiator. So HDFS and YARN work together usually but are quite independent at design and architecture but their slave processes run together on the compute nodes i.e DataNode and a NodeManager process. This a high-level architecture of RM and NM the 2 master processes and the latter being the brain of Hadoop Below is a standard layout of a Hadoop cluster though we could have easily added a second RM for HA On the 12 compute nodes the NM and DN and co-located for the localized processing It's illogical to separate the DN and NM on different nodes. The NodeManager is YARN’s per-node agent and takes care of the individual compute nodes in a Hadoop cluster. Updating the ResourceManager (RM) with the status of running jobs on the DN, overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management, and auxiliary services which may be exploited by different YARN applications. DataNodes store data in a Hadoop cluster and is the name of the daemon that manages the data. File data is replicated on multiple DataNodes for reliability and so that localized computation can be executed near the data. That's the reason DN and NM are co-located on the same VM/host. It could be very interesting to see a screenshot of the roles co-located with your data nodes. Hope that gives you a clearer picture
... View more
07-01-2021
12:12 PM
@harsh8 Any updates please let me know if you still need help Happy hadooping
... View more
07-01-2021
11:23 AM
1 Kudo
@drgenious Primo Impala shares metadata [data about data] with HMS Hive Metastore. Impala uses HDFS caching to provide performance and scalability benefits in production environments where Impala queries and other Hadoop jobs operate on quantities of data much larger than the physical RAM on the DataNodes, making it impractical to rely on the Linux OS cache, which only keeps the most recently used data in memory. Data read from the HDFS cache avoids the overhead of checksumming and memory-to-memory copying involved when using data from the Linux OS cache. Having said that when you restart impala you are discarding all the Cached Metadata [Location of table, permissions, query execution plans, or statistics] that makes it efficient. That explains why after the restart your queries are so slow. Impala is very efficient if it reads from data that is pinned in memory through HDFS caching. It takes advantage of the HDFS API and reads the data from memory rather than from disk whether the data files are pinned using Impala DDL statements, or using the command-line mechanism where you specify HDFS paths. There is no better source of Impala information than Cloudera I will urge you to take time and read the below documentation to pin the option in your memory 🙂 Using HDFS Caching with Impala Configuring HDFS Caching for Impala There are 2 other options that you should think of as less expensive than restarting Impala I can't imagine you you have more than 70 data nodes INVALIDATE METADATA Is an asynchronous operation that simply discards the loaded metadata from the catalog and coordinator caches. After that operation, the catalog and all the Impala coordinators only know about the existence of databases and tables and nothing more. Metadata loading for tables is triggered by any subsequent queries. REFRESH Reloads the metadata synchronously. REFRESH is more lightweight than doing a full metadata load after a table has been invalidated. REFRESH cannot detect changes in block locations triggered by operations like HDFS balancer, hence causing remote reads during query execution with negative performance implications. The INVALIDATE METADATA statement marks the metadata for one or all tables as stale. The next time the Impala service performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. As this is a very expensive operation compared to the incremental metadata update done by the REFRESH statement, when possible, prefer REFRESH rather than INVALIDATE METADATA. INVALIDATE METADATA is required when the following changes are made outside of Impala, in Hive and other Hive clients, such as SparkSQL: Metadata of existing tables changes.
New tables are added, and Impala will use the tables.
The SERVER or DATABASE level Sentry privileges are changed.
Block metadata changes, but the files remain the same (HDFS rebalance).
UDF jars change.
Some tables are no longer queried, and you want to remove their metadata from the catalog and coordinator caches to reduce memory requirements.
No INVALIDATE METADATA is needed when the changes are made by impalad. I hope that explains to you why and gives you options to use rather than warm start impala. If you know what table you want to query the run this before by qualify db. table name. This has saved me time with my data scientists and encapsulating them in their scripts is a good thing INVALIDATE METADATA [[db_name.]table_name] Recomputing the statistics is another solution Compute stats <table name>; COMPUTE STATS statement gathers information about the volume and distribution of data in a table and all associated columns and partitions. The information is stored in the Hive metastore database and used by Impala to help optimize queries. Hope that enlightens you.
... View more
07-01-2021
04:44 AM
@mike_bronson7 Can you share the below files in /var/log/hadoop-yarn/yarn hadoop-yarn-resourcemanager-{hostname}.log hadoop-yarn-resourcemanager-{hostname}.out Happy hadooping
... View more
06-30-2021
03:49 PM
1 Kudo
@dmharshit It's difficult to explain in 3 minutes but the capacity scheduler in YARN allows multi-tenancy of the Hadoop cluster where multiple users can share the large cluster. Every company has a private cluster cal leads to poor resource utilization. though it may provide enough resources in the cluster to meet their peak demand that peak demand may not occur that frequently, resulting in poor resource utilization at the rest of the time. Thus sharing clusters among Companys is a more cost-effective idea. However, Companys are concerned about sharing a cluster because they are worried that they may not get enough resources at the time of peak utilization. The CapacityScheduler in YARN mitigates that concern by giving each Company capacity guarantees. Capacity scheduler in YARN functionality Capacity scheduler in Hadoop works on the concept of queues. For example, each department gets its own dedicated queue with a percentage of the total cluster capacity for its own use. For example, if there are two departments sharing the cluster, one department may be given 60% of the cluster capacity and the other department is given 40%. On top of that, to provide further control and predictability on sharing of resources, the CapacityScheduler supports hierarchical queues. The company can further divide its allocated cluster capacity into separate sub-queues for a separate set of users within the department. The capacity scheduler is also flexible and allows the allocation of free resources to any queue beyond its capacity. This provides elasticity for the Companys in a cost-effective manner. When the queue to which these resources actually belong has increased demand the resources are allocated to it when those resources are released from other queues. This is a fantastic write-up YARN the Capacity Scheduler The maximum capacity is an elastic-like capacity that allows queues to make use of resources that are not being used to fill minimum capacity demand in other queues. Children Queues like in the figure above inherit the resources of their parent queue. For example, with the Preference branch, the Low leaf queue gets 20% of the Preference 20% minimum capacity while the High lead gets 80% of the 20% minimum capacity. Minimum Capacity always has to add up to 100% for all the leafs under a parent. I didn't have the opportunity tonight to but a cluster to mirror the above setup and share the capacity scheduler config to give you a better understanding.
... View more