Member since
03-06-2017
11
Posts
1
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
19932 | 10-22-2018 02:44 PM |
10-02-2019
06:18 AM
@lvazquez maybe you can directly execute a "kinit" to submit your user's credentials to your LDAP I manage to authenticate users from AD while the cluster is kerberorized through a FreeIPA Server. This is a command sample: %sh
echo "password" | kinit foo@hortonworks.local
hdfs dfs -ls /
Found 12 items
drwxrwxrwt - yarn hadoop 0 2019-10-02 13:53 /app-logs
drwxr-xr-x - hdfs hdfs 0 2019-10-01 15:27 /apps
drwxr-xr-x - yarn hadoop 0 2019-10-01 14:06 /ats
drwxr-xr-x - hdfs hdfs 0 2019-10-01 14:08 /atsv2
drwxr-xr-x - hdfs hdfs 0 2019-10-01 14:06 /hdp
drwx------ - livy hdfs 0 2019-10-02 11:35 /livy2-recovery
drwxr-xr-x - mapred hdfs 0 2019-10-01 14:06 /mapred
drwxrwxrwx - mapred hadoop 0 2019-10-01 14:08 /mr-history
drwxrwxrwx - spark hadoop 0 2019-10-02 15:08 /spark2-history
drwxrwxrwx - hdfs hdfs 0 2019-10-01 15:31 /tmp
drwxr-xr-x - hdfs hdfs 0 2019-10-02 14:23 /user
drwxr-xr-x - hdfs hdfs 0 2019-10-01 15:14 /warehouse I think this way is really ugly but at least, it is possible. Do not forget to change in your hdfs-site file the auth_to_local RULE:[1:$1@$0](.*@HORTONWORKS.LOCAL)s/@.*//
RULE:[1:$1@$0](.*@IPA.HORTONWORKS.LOCAL)s/@.*//
... View more
04-24-2019
08:42 AM
I manage to retrieve the group named "ad_sshaccess_users" from the LDAP directory to the Ambari. But there is "0 member" inside this group. But in the Active Directory I created 2 users under this group mapped in the FreeIPA. Do you know if Ambari can retrieve AD users through a FreeIPA server which is doing the LDAP part? I'm not sure about that.
... View more
04-17-2019
03:23 PM
I have a kerberorized HDP 3.1 cluster setup with a FreeIPA server. I already have the trust between the Active Directory and the FreeIPA server. Now, I would like to add the member of the group created inside the Active Directory server which I have mapped to the FreeIPA server. I created the Active Directory Group called "FreeIPA-Member" where i set some users: hdp-test and toto. I mapped the FreeIPA-Member group from the Active Directory to the FreeIPA server using the following commands: ipa group-add --desc='AD users external for FreeIPA-Members' ad_users_external_freeipa --external Created the POSIX group in FreeIPA ad_sshaccess_users ipa group-add -–desc='AD SSH access users' ad_sshaccess_users ipa group-add-member ad_users_external_freeipa --external “Ad\FreeIPA-Members” ipa group-add-member ad_sshaccess_users --groups ad_users_external_freeipa Now I have the ad_sshaccess_users group which is mapped to the external Active Directory group which contains my Active Directory users that I want to use to log-in to the Ambari Web UI. I also setup the LDAP part on the Ambari Server ambari-server setup-ldap
Using python /usr/bin/python
Enter Ambari Admin login: admin
Enter Ambari Admin password:
Fetching LDAP configuration from DB.
Primary LDAP Host (ipaserverhostname.ipadomain):
Primary LDAP Port (636):
Secondary LDAP Host :
Secondary LDAP Port :
Use SSL [true/false] (True):
Disable endpoint identification during SSL handshake [true/false] (False):
Do you want to provide custom TrustStore for Ambari [y/n] (n)?
User object class (posixAccount):
User ID attribute (uid):
Group object class (posixAccount):
Group name attribute (cn):
Group member attribute (member):
Distinguished name attribute (dn):
Search Base (cn=groups,cn=accounts,dc=ipa,dc=domain,dc=name,dc=com):
Referral method [follow/ignore] (follow):
Bind anonymously [true/false] (False):
Bind DN (uid=hadoopadmin,cn=users,cn=accounts,dc=ipa,dc=domain,dc=name,dc=com):
Enter Bind DN Password:
Confirm Bind DN Password:
Handling behavior for username collisions [convert/skip] for LDAP sync (skip):
Force lower-case user names [true/false] (True):
Results from LDAP are paginated when requested [true/false] (False): ambari-server restart I followed the HDP documentation to synchronize users and groups with the Ambari Server https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/ambari-authentication-ldap-ad/content/authe_ldapad_synchronizing_ldap_users_and_groups.html I try adding the ad_sshaccess_users group in a text file: echo "ad_sshaccess_users" > /tmp/groups.txt and then executing the sync-ldap command with the Ambari server: ambari-server sync-ldap --ldap-sync-admin-name=admin --ldap-sync-admin-password=admin --groups=/tmp/groups.txt Getting the following errors, which means that ambari server can't find the group in the LDAP DB... Using python /usr/bin/python
Syncing with LDAP...
Fetching LDAP configuration from DB.
Syncing specified users and groups...ERROR: Exiting with exit code 1.
REASON: Caught exception running LDAP sync. Couldn't sync LDAP group ad_sshaccess_users, it doesn't exist I can kinit with a user from the LDAP kinit hdp-testAD.DOMAIN
Password for hdp-test@AD.DOMAIN:
# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: hdp-test@AD.DOMAIN
Valid starting Expires Service principal
04/17/2019 15:04:28 04/18/2019 01:04:28 krbtgt/AD.DOMAIN@AD.DOMAIN
renew until 04/24/2019 15:04:25 If you have any solutions or any suggestions, do not hesitate Thanks in advance
... View more
Labels:
10-22-2018
02:44 PM
1 Kudo
A solution to import your data as parquet file and be able to treat the TIMESTAMP and DATE format which come from RDBMS such as IBM DB2 or MySQL is to import using the sqoop import --as-parquet command and map each field using --map-column-java which are TIMESTAMP and DATE to a String Java type. After that, you should be able to interrogate the Hive database though a SparkSession by changing the configuration of the actual Spark Session and set spark.sql.hive.convertMetastoreParquet to false. SparkSQL will use the Hive SerDe for reading parquet tables instead of the built in support. spark.sql.hive.convertMetastoreParquet false import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder()
.appName("test interrogate Hive parquet file using Spark")
.config("spark.sql.parquet.compression.codec", "snappy")
.config("spark.sql.warehouse.dir","/apps/hive/warehouse")
.config("hive.metastore.uris","thrift://sdsl-hdp-01.mycluster:9083")
.config("spark.sql.hive.convertMetastoreParquet", false)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
val df = sql("SELECT CAST(COL1 AS TIMESTAMP), COL2, COL3, CAST(COL4 AS TIMESTAMP), COL5 FROM db.mytable")
df.printSchema
root
|-- COL1: timestamp (nullable = true)
|-- COL2: string (nullable = true)
|-- COL3: string (nullable = true)
|-- COL4: timestamp (nullable = true)
|-- COL5: integer (nullable = true)
df.show(5, false)
+--------------------------+--------+--------+--------------------------+------+
|COL1 |COL2 |COL3 |COL4 |COL5|
+--------------------------+--------+--------+--------------------------+------+
|2003-01-01 00:00:00.100001| |00001 |2003-01-01 00:00:00.10361 |1 |
|2003-01-01 00:00:00.100002| |00002 |2003-01-01 00:00:00.100002|2 |
|2003-01-01 00:00:00.100003| |00003 |2003-01-01 00:00:00.100003|3 |
|2003-01-01 00:00:00.100004| |00004 |2003-01-01 00:00:00.100004|4 |
|2003-01-01 00:00:00.100005| |00005 |2003-01-01 00:00:00.100005|5 |
+--------------------------+--------+--------+--------------------------+------+
only showing top 5 row
... View more
10-22-2018
12:40 PM
@Rahul Soni I totally agree with you! Especially when you're working with Hive tables. But the customer is working with Spark and require parquet files as input data for the Spark jobs. I already check the reported issue and made the necessary modification on the /tmp/parquet-0.log and /tmp/parquet-0.log.lock access for my user (which is not hive in my case).
... View more
10-22-2018
12:40 PM
Actual STDOUT output [2018-03-23 09:07:48,433] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG sqoop.ConnFactory: Loaded manager factory: org.apache.sqoop.manager.oracle.OraOopManagerFactory
[2018-03-23 09:07:48,445] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG sqoop.ConnFactory: Loaded manager factory: com.cloudera.sqoop.manager.DefaultManagerFactory
[2018-03-23 09:07:48,445] {bash_operator.py:101} INFO - 18/03/23 09:07:48 WARN sqoop.ConnFactory: Parameter --driver is set to an explicit driver however appropriate connection manager is not being set (via --connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.GenericJdbcManager. Please specify explicitly which connection manager should be used next time.
[2018-03-23 09:07:48,457] {bash_operator.py:101} INFO - 18/03/23 09:07:48 INFO manager.SqlManager: Using default fetchSize of 1000
[2018-03-23 09:07:48,457] {bash_operator.py:101} INFO - 18/03/23 09:07:48 INFO tool.CodeGenTool: Will generate java class as codegen_TAG002_AGENT
[2018-03-23 09:07:48,494] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG manager.SqlManager: Execute getColumnInfoRawQuery : SELECT t.* FROM TAG002_AGENT AS t WHERE 1=0
[2018-03-23 09:07:48,555] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG manager.SqlManager: No connection paramenters specified. Using regular API for making connection.
[2018-03-23 09:07:48,920] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG manager.SqlManager: Using fetchSize for next query: 1000
[2018-03-23 09:07:48,920] {bash_operator.py:101} INFO - 18/03/23 09:07:48 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM TAG002_AGENT AS t WHERE 1=0
[2018-03-23 09:07:48,959] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG manager.SqlManager: Found column SDINTERMED of type [93, 26, 6]
[2018-03-23 09:07:48,959] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG manager.SqlManager: Found column CSTATRED of type [1, 3, 0]
[2018-03-23 09:07:48,959] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG manager.SqlManager: Found column VIDAGENT of type [1, 5, 0]
[2018-03-23 09:07:48,959] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG manager.SqlManager: Found column SDAGRESP of type [93, 26, 6]
[2018-03-23 09:07:48,959] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG manager.SqlManager: Found column CONTAG of type [5, 5, 0]
[2018-03-23 09:07:48,960] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG manager.SqlManager: Using fetchSize for next query: 1000
[2018-03-23 09:07:48,960] {bash_operator.py:101} INFO - 18/03/23 09:07:48 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM TAG002_AGENT AS t WHERE 1=0
[2018-03-23 09:07:48,967] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG manager.SqlManager: Found column SDINTERMED
[2018-03-23 09:07:48,967] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG manager.SqlManager: Found column CSTATRED
[2018-03-23 09:07:48,967] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG manager.SqlManager: Found column VIDAGENT
[2018-03-23 09:07:48,968] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG manager.SqlManager: Found column SDAGRESP
[2018-03-23 09:07:48,968] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG manager.SqlManager: Found column CONTAG
[2018-03-23 09:07:48,968] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG orm.ClassWriter: selected columns:
[2018-03-23 09:07:48,968] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG orm.ClassWriter: SDINTERMED
[2018-03-23 09:07:48,968] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG orm.ClassWriter: CSTATRED
[2018-03-23 09:07:48,968] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG orm.ClassWriter: VIDAGENT
[2018-03-23 09:07:48,968] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG orm.ClassWriter: SDAGRESP
[2018-03-23 09:07:48,969] {bash_operator.py:101} INFO - 18/03/23 09:07:48 DEBUG orm.ClassWriter: CONTAG
[2018-03-23 09:07:48,982] {bash_operator.py:101} INFO - 18/03/23 09:07:48 INFO orm.ClassWriter: Overriding type of column SDINTERMED to Timestamp
[2018-03-23 09:07:48,985] {bash_operator.py:101} INFO - 18/03/23 09:07:48 INFO orm.ClassWriter: Overriding type of column SDINTERMED to Timestamp
[2018-03-23 09:07:48,985] {bash_operator.py:101} INFO - 18/03/23 09:07:48 ERROR orm.ClassWriter: No ResultSet method for Java type Timestamp
[2018-03-23 09:07:48,985] {bash_operator.py:101} INFO - 18/03/23 09:07:48 INFO orm.ClassWriter: Overriding type of column SDAGRESP to Timestamp
[2018-03-23 09:07:48,985] {bash_operator.py:101} INFO - 18/03/23 09:07:48 ERROR orm.ClassWriter: No ResultSet method for Java type Timestamp
[2018-03-23 09:07:48,986] {bash_operator.py:101} INFO - 18/03/23 09:07:48 INFO orm.ClassWriter: Overriding type of column SDINTERMED to Timestamp
[2018-03-23 09:07:48,986] {bash_operator.py:101} INFO - 18/03/23 09:07:48 ERROR tool.ImportTool: Imported Failed: No ResultSet method for Java type Timestamp
... View more
10-22-2018
12:40 PM
Sqoop picks up the DATETIME correctly but when you import data from database as parquet file format, with Sqoop you'll have issues regarding the parquet schema saved in Hive/HDFS. The structure of the table is the following: TAG002_AGENT SDINTERMED (TIMESTAMP PK FK) CSTATRED (CHAR(3)) VIDAGENT (TIMESTAMP Nullable FK) CONTAG (SMALLINT) I also add options --verbose and --map-column-java SDINTERMED=Timestamp (and also try SDINTERMED=java.sql.Timestamp) This is the following command that i launch: sqoop import -Dmapreduce.job.user.classpath.first=true -Dhadoop.security.credential.provider.path=jceks://hdfs/user/airflow/credentials.jceks -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect "jdbc:db2://MVSTST1.lefoyer.lu:5047/DB2B:currentSchema=DEVB;" --username TAPBDB2B --password-alias db2.password -m 2 --as-parquetfile --outdir /tmp/java --driver com.ibm.db2.jcc.DB2Driver --target-dir /user/airflow/db2b/TAG002_AGENT --delete-target-dir --table TAG002_AGENT --map-column-java SDAGRESP=Timestamp,SDINTERMED=Timestamp --verbose
... View more
10-22-2018
12:40 PM
I have imported a table from DB2 using Sqoop 1.4.6.2 to HDFS in parquet format sqoop import -Dmapreduce.job.user.classpath.first=true -Dhadoop.security.credential.provider.path=jceks://hdfs/user/toto/creds.jceks -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect jdbc:db2://myserver:myport/db:currentSchema=myschema;" --username <username> --password-alias <password> -m 2 --as-parquetfile --outdir /tmp/java --driver com.ibm.db2.jcc.DB2Driver --target-dir /user/toto/db/mytable --delete-target-dir --table mytable I have my table directly imported into HDFS. Then i try to read it using Spark 2.1.1.2 and Scala 2.11.8 and change the column type which come from DB2 as TIMESTAMP and imported by Sqoop as BIGINT. I have to modify the column type using Spark then erase the old parquet file with the new one: spark-shell
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.parquet("/user/toto/db/mytable")
df.withColumn("col_bigint", ($"col_bigint / 1000).cast(TimestampType))
df.show(5, false) Throwing a WARNING WARN CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr
org.apache.parquet.VersionParsers$VersionParseException: Could not parse created_by: parquet-mr using format: (.+) version ((.*))?\(build ?(.*)\) With Sqoop, it's impossible to have directly the desire column format in TIMESTAMP. It automatically put it in a BIGINT type. So i want to read and to perform some transformation on these parquets file to have the right format. I already try to add every parquet-* JAR file into /usr/hdp/current/sqoop-cli/lib/
parquet-column-1.8.1.jar parquet-common-1.8.1.jar parquet-encoding-1.8.1.jar parquet-format-2.3.0-incubating.jar parquet-generator-1.8.1.jar parquet-hadoop-1.8.1.jar parquet-hadoop-bundle-1.6.0.jar parquet-jackson-1.8.1.jar parquet-avro-1.6.0.jar If i change the parquet-avro-1.6.0.jar by the parquet-avro-1.8.1.jar, Sqoop couldn't process it, because he can't find the method AvroWriter Initially, each JARs files in the Sqoop-CLI library were in version 1.6.0 but i change it and put them with the same version of my spark2 jar folder. If anyone can find a way to make it work, I will be very grateful
... View more
Labels:
- Labels:
-
Apache Spark
-
Apache Sqoop