Member since
06-04-2019
9
Posts
0
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3791 | 01-09-2024 03:40 AM |
01-31-2024
06:16 AM
Hello ChethanYM, Could you please provide some link where I can find documentation about this conf (spark.sql.htl.check=false) ? I could not find anythin in https://spark.apache.org/doc Regards, Guilherme C P
... View more
01-09-2024
03:40 AM
As I was already using the Hadoop Credential Provider, I found a solution that does not require decrypting the password as follows: PySpark code: # Spark session
spark = SparkSession.builder \
.config("spark.yarn.keytab=/etc/security/keytabs/<APPLICATION_USER>.keytab") \
.appName('SPARK_TEST') \
.master("yarn") \
.getOrCreate()
credential_provider_path = 'jceks://hdfs/<PATH>/<CREDENTIAL_FILE>.jceks'
credential_name = 'PASSWORD.ALIAS'
# Hadoop credential
conf = spark.sparkContext._jsc.hadoopConfiguration()
conf.set('hadoop.security.credential.provider.path',credential_provider_path)
credential_raw = conf.getPassword(credential_name)
for i in range(credential_raw.__len__()):
password = password + str(credential_raw.__getitem__(i)) The important point above is the .config() line in SparkSession. You must enter the keytab to access the password. Otherwise you will get the encrypted value. I can't say that I'm very happy with being able to directly manipulate the password value in the code. I would like to delegate this to some component in a way that the programmer does not have direct access to the password value. Maybe what I'm looking for is some kind of authentication provider, but for now the solution above works for me.
... View more
12-26-2023
03:31 AM
This way the password is provided to the connection is exposed in plain text?
... View more
12-22-2023
01:05 PM
Hello, Studying the documentation below I found a good example of how to use Spark with JDBC to connect to external databases. https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html I already tested the example and it worked very well. However, when I looked carefully I saw that the password (its text) was being manipulated directly by the program. This way, from the tests I did, the password is open both in the Spark log and the connection, if it is intercepted on the network, the password will be open in open text. I'm looking for a way to make this connection safely. One mechanism that comes to mind is the hadoop credential provider that I already used with good old sqoop. So in sqoop the password was protected and no visible in any form, even if the connection was intercepted. Could anyone tell me what mechanisms I can use to make this connection securely, without exposing credentials? It can be a credential provider or something else. I didn't find anything like that in the Spark documentation.
... View more
Labels:
- Labels:
-
Apache Spark
06-09-2022
12:25 PM
Hello everyone! Context: data ingestion made with Sqoop in HDP Cluster. I would like to know if there is or what is the best way to manage connection strings used by sqoop to connect to databases. My case is as follows: I have to do many data ingestions from rdbms to the Datalake of my company. We have many. With certain frequence the hosts of these databases changes to a new host. So when it hapens I have to change the jdbc string connection in my .properties file (that is passed to Sqoop) and make a new deploy of it to keep my ingestion working. I would like to have these jdbc strings externalized to something in the responsability of our operation team. I would like to have theses strings as an alias or something else that I could only refer and have the operation team taking care of the right host/vip to point to when sqoop opens a connection. Maybe a a set of entrys in a configuration file could be done in Ambari to store theses jdbc strings?
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hadoop
-
Apache Sqoop
12-24-2021
06:33 AM
Hello everybody, Recently my DBA team called me informing that Sqoop was sending queries to the database without parameter binding. Looking in the Sqoop documentation I got confirmation of the use of Prepared Statement by Sqoop only in the export command, with --batch parameter. With import command there is not this option to use. Could you help me identify how I could (if possible) use Sqoop's parameters to make it use binding variables in import querys? Any thoughts will be appreciated. Thanks! PS: Above is the description of the --batch parameter. --batch Use batch mode for underlying statement execution.
... View more
Labels:
- Labels:
-
Apache Sqoop
06-15-2021
11:41 AM
Hi @Bender, Thanks for your replay. The Faq wasn't useful. It's aswer is: "With BDA V2.0 Sqoop automatically supports Oracle Database and MySQL. Hence connect strings beginning with jdbc:oracle or jdbc:mysql:// are handled with no additional setup." 😞
... View more
06-04-2021
04:42 PM
Hello everybody,
I am looking for ways to increase speed of data ingestion in my Company's Datalake with, specifically, use of Oracle Data Connector in Sqoop.
Our HortonWorks(Now Cloudera) environment uses Sqoop 1.4.6.
Following theses docs in sqoop site I asked our dba's Team a user with this features:
A) create and alter session privileges
B) Object privileges:
select on dba_segments — Required for Sqoop imports only
select on dba_constraints — Required for Sqoop imports only
select on v_$database — Required for Sqoop imports only
select on v_$parameter — Required for Sqoop imports only
Playing with this Sqoop feature I ran this command:
Sqoop import -Doraoop.jdbc.url.verbatim=true --direct --connect jdbc:oracle:thin:@myserver.myserver.myserver:1521/mydb --username myuser --password-alias ******** --table myschema.mytable --target-dir /mypath/mytable --verbose
And get this output:
>>> Invoking Sqoop command line now >>>
2021-06-04 18:05:04,323 [main] WARN org.apache.sqoop.tool.SqoopTool - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2021-06-04 18:05:04,365 [main] INFO org.apache.sqoop.Sqoop - Running Sqoop version: 1.4.6.2.6.5.0-292
2021-06-04 18:05:04,388 [main] DEBUG org.apache.sqoop.tool.BaseSqoopTool - Enabled debug logging.
2021-06-04 18:05:04,390 [main] DEBUG org.apache.sqoop.util.password.CredentialProviderHelper - Reflecting credential provider classes and methods
2021-06-04 18:05:04,391 [main] DEBUG org.apache.sqoop.util.password.CredentialProviderHelper - Found org.apache.hadoop.security.alias.CredentialProvider
2021-06-04 18:05:04,399 [main] DEBUG org.apache.sqoop.util.password.CredentialProviderHelper - Found org.apache.hadoop.security.alias.CredentialProviderFactory
2021-06-04 18:05:04,400 [main] DEBUG org.apache.sqoop.util.password.CredentialProviderHelper - Found CredentialProvider#createCredentialEntry
2021-06-04 18:05:04,400 [main] DEBUG org.apache.sqoop.util.password.CredentialProviderHelper - Found CredentialProvider#flush
2021-06-04 18:05:04,401 [main] DEBUG org.apache.sqoop.util.password.CredentialProviderHelper - Found Configuration#getPassword
2021-06-04 18:05:04,401 [main] DEBUG org.apache.sqoop.util.password.CredentialProviderHelper - Found CredentialProviderFactory#getProviders
2021-06-04 18:05:04,402 [main] DEBUGorg.apache.sqoop.util.password.CredentialProviderHelper - Resolving alias with credential provider path set to jceks://hdfs/mypath/my.jceks
2021-06-04 18:05:04,672 [main] WARN org.apache.sqoop.ConnFactory - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2021-06-04 18:05:04,685 [main] DEBUG org.apache.sqoop.ConnFactory - Loaded manager factory: org.apache.sqoop.manager.oracle.OraOopManagerFactory
2021-06-04 18:05:04,709 [main] DEBUG org.apache.sqoop.ConnFactory - Loaded manager factory: com.cloudera.sqoop.manager.DefaultManagerFactory
2021-06-04 18:05:04,709 [main] DEBUG org.apache.sqoop.ConnFactory - Trying ManagerFactory: org.apache.sqoop.manager.oracle.OraOopManagerFactory
2021-06-04 18:05:04,752 [main] DEBUG org.apache.sqoop.manager.oracle.OraOopManagerFactory - Data Connector for Oracle and Hadoop can be called by Sqoop!
2021-06-04 18:05:04,771 [main] INFO org.apache.sqoop.manager.SqlManager - Using default fetchSize of 1000
2021-06-04 18:05:05,699 [main] DEBUG org.apache.sqoop.manager.oracle.OraOopUtilities - The Oracle table context has been derived from:oracleConnectionUserName = myuser tableStr = myschema.mytable as: owner : myschema table : mytable
2021-06-04 18:05:06,333 [main] INFO org.apache.sqoop.manager.oracle.OraOopManagerFactory -
**************************************************
*** Using Data Connector for Oracle and Hadoop ***
**************************************************
2021-06-04 18:05:06,334 [main] DEBUG org.apache.sqoop.manager.oracle.OraOopUtilities - Updated mapred.child.java.opts from "-Xmx200m" to "-Djava.security.egd=file:///dev/urandom -Xmx200m"
2021-06-04 18:05:06,353 [main] ERROR org.apache.sqoop.manager.oracle.OraOopManagerFactory - Unable to obtain the Oracle database version.
java.sql.SQLException: Result set after last row
at oracle.jdbc.driver.GeneratedScrollableResultSet.getInt(GeneratedScrollableResultSet.java:570)
[...]
2021-06-04 18:05:06,377 [main] INFO org.apache.sqoop.manager.oracle.OraOopManagerFactory - This Oracle database is a RAC.
2021-06-04 18:05:06,377 [main] INFO org.apache.sqoop.manager.oracle.OraOopManagerFactory - Data Connector for Oracle and Hadoop will not use dynamically generated JDBC URLs - this feature has been disabled.
2021-06-04 18:05:06,377 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.map.max.attempts is deprecated. Instead, use mapreduce.map.maxattempts
2021-06-04 18:05:06,377 [main] DEBUG org.apache.sqoop.ConnFactory - Instantiated ConnManager org.apache.sqoop.manager.oracle.OraOopConnManager@20134094
2021-06-04 18:05:06,378 [main] INFO org.apache.sqoop.tool.CodeGenTool - Beginning code generation
2021-06-04 18:05:06,391 [main] DEBUG org.apache.sqoop.manager.oracle.OraOopOracleQueries - getTableColumns() : sql =
SELECT column_name, data_type FROM dba_tab_columns WHERE owner = ? and table_name = ? and (DATA_TYPE IN ('BINARY_DOUBLE','BINARY_FLOAT','BLOB','CHAR','CLOB','DATE','FLOAT','LONG','NCHAR','NCLOB','NUMBER','NVARCHAR2','RAW','ROWID','URITYPE','VARCHAR2') OR DATA_TYPE LIKE 'INTERVAL YEAR(%) TO MONTH' OR DATA_TYPE LIKE 'INTERVAL DAY(%) TO SECOND(%)' OR DATA_TYPE LIKE 'TIMESTAMP(%)' OR DATA_TYPE LIKE 'TIMESTAMP(%) WITH TIME ZONE' OR DATA_TYPE LIKE 'TIMESTAMP(%) WITH LOCAL TIME ZONE') ORDER BY column_id
2021-06-04 18:05:06,605 [main] ERROR org.apache.sqoop.Sqoop - Got exception running Sqoop: java.lang.RuntimeException: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist
<<< Invocation of Sqoop command completed <<<
After some googling about the message "Unable to obtain the Oracle database version." I arrived on this link where I can see how OraOop identifys Oracle Database Version between lines 163 and 190 with this query:
SELECT v.banner, rtrim(v.version) full_version, rtrim(v.version_bit) version_bit, substr(v.version, 1, instr(v.version, '.', 1, 1) - 1) major, substr(v.version, instr(v.version, '.', 1, 1) + 1, instr(v.version, '.', 1, 2) - instr(v.version, '.', 1, 1) - 1) minor, substr(v.version, instr(v.version, '.', 1, 2) + 1, instr(v.version, '.', 1, 3) - instr(v.version, '.', 1, 2) - 1) version, substr(v.version, instr(v.version, '.', 1, 3) + 1, instr(v.version, '.', 1, 4) - instr(v.version, '.', 1, 3) - 1) patch, decode(instr(v.banner, '64bit'), 0, 'False', 'True') isdb64bit, decode(instr(b.banner, 'HPUX'), 0, 'False', 'True') ishpux FROM ( SELECT ROWNUM row_num, banner, substr(substr(banner, instr(banner, 'Release ') + 8), 1) version_bit, substr(substr(banner, instr(banner, 'Release ') + 8), 1, instr(substr(banner, instr(banner, 'Release ') + 8), ' ')) version FROM v$version WHERE banner LIKE 'Oracle%' OR banner LIKE 'Personal Oracle%' ) v, v$version b WHERE v.row_num = 1 and b.banner like 'TNS%' ;
When I run the query in Database it returned me no lines, but when I run the query without the bold part above i receve this line as aswer:
BANNER FULL_VERSION VERSION_BIT MAJOR MINOR VERSION PATCH ISDB6 ISHPU ------------------------------------------------------------------------- -------------- ------------------------- ------- ------- --------- ------ ----- ----- Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production 18.0.0.0.0 18.0.0.0.0 - Production 18 0 0 0 False False
So the OraOop code will only work when this query return at least one line, but I cannot modidy this class code for my needs.... after this long way now I m looking for a compatibility matrix between sqoop 1.4.6 versus Oracle Database versions, but couldn't find one.
Anyone can help me with it?
I'm in the right way to solve this problem?
Any thoughts will be appreciated.
Thanks!
... View more
Labels:
- Labels:
-
Apache Sqoop