Member since
06-02-2020
331
Posts
64
Kudos Received
49
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1092 | 07-11-2024 01:55 AM | |
3102 | 07-09-2024 11:18 PM | |
2666 | 07-09-2024 04:26 AM | |
2026 | 07-09-2024 03:38 AM | |
2324 | 06-05-2024 02:03 AM |
07-06-2021
05:37 AM
@roshanbi Have you resolved your issue and if so would you mind sharing the solution and marking this thread as solved? If you are still experiencing the issue, can you provide the information @RangaReddy has requested?
... View more
06-28-2021
02:59 AM
Hi @KhASQ For Watermarking use any framework/db to update values once job is successfully. If you are using kafka then kafka itself you can store kafka related watermarking. Other than kafka you want to use then choose any RDBMS or HBase table.
... View more
06-27-2021
09:04 AM
Hi @roshanbi Please find the difference: val textFileDF : Dataset[String] = spark.read.textFile("/path") // returns Dataset object val textFileRDD : RDD[String] = spark.sparkContext.textFile("/path") // returns RDD object If you are satisfied, please Accept as Solution.
... View more
06-24-2021
08:23 AM
Hi @roshanbi If you are satisfied with my answer please Accept as Solution.
... View more
06-22-2021
02:19 AM
Hi @magicchu Please raise an Hive escalation, our Hive team will work on this issue.
... View more
06-04-2021
04:30 AM
Hi @jamesstarks Could you please enable spark authentication using following param. --conf spark.authenticate=true
... View more
05-18-2021
08:22 PM
2 Kudos
Introduction
In this article, we will learn how to query the Hive tables data by using column names with regular expressions in Spark.
Assume we have a table with column names like col1, col2, col3, col4, col5, etc. If we want to select the data, we will use queries like select col1, col2, col3, col4, col5 from the table. Instead of specifying col1, col2, col3, col4, col5, we can use regular expressions while selecting columns like select `col.*` from table.
From Hive, this feature is supported from Hive 0.13.0 onwards. By default, this feature is disabled, and in order to enable it, we need to use set hive.support.quoted.identifiers=none.
From Spark, this feature is supported from Spark 2.3.0 onwards. By default this feature is disabled, and in order to enable it, we need use set spark.sql.parser.quotedRegexColumnNames=true.
Steps
Hive
Create the Hive table and insert the data: beeline -u jdbc:hive2://host:port
> create database if not exists regex_test;
> create external table if not exists regex_test.regex_test_tbl (col1 int, col2 string, col3 float, `timestamp` timestamp) stored as PARQUET;
> insert into regex_test.regex_test_tbl values (1, 'Ranga', 23000, cast('1988-01-06 00:59:59.345' as timestamp));
> insert into regex_test.regex_test_tbl values (2, 'Nishanth', 38000, cast('2018-05-29 17:32:59.345' as timestamp));
> insert into regex_test.regex_test_tbl values (3, 'Raja', 18000, cast('2067-05-29 17:32:59.345' as timestamp));
Select the regex_test.regex_test_tbl table data. > SET hive.cli.print.header=true;
> select * from regex_test.regex_test_tbl;
+----------------------+----------------------+----------------------+---------------------------+
| regex_test_tbl.col1 | regex_test_tbl.col2 | regex_test_tbl.col3 | regex_test_tbl.timestamp |
+----------------------+----------------------+----------------------+---------------------------+
| 1 | Ranga | 23000.0 | 1988-01-06 00:59:59.345 |
| 2 | Nishanth | 38000.0 | 2018-05-29 17:32:59.345 |
| 3 | Raja | 18000.0 | 2067-05-29 17:32:59.345 |
+----------------------+----------------------+----------------------+---------------------------+
Without setting set hive.support.quoted.identifiers=none; try to run the query using regular expressions. The following error is noticed: > select `col.*` from regex_test.regex_test_tbl;
FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'col.*': (possible column names are: col1, col2, col3, timestamp)
Now set the hive.support.quoted.identifiers=none and execute the above query. > set hive.support.quoted.identifiers=none;
> select `col.*` from regex_test.regex_test_tbl;
+----------------------+----------------------+----------------------+
| regex_test_tbl.col1 | regex_test_tbl.col2 | regex_test_tbl.col3 |
+----------------------+----------------------+----------------------+
| 1 | Ranga | 23000.0 |
| 2 | Nishanth | 38000.0 |
| 3 | Raja | 18000.0 |
+----------------------+----------------------+----------------------+
We can also select a specific column name while querying the data. > select `timestamp` from regex_test.regex_test_tbl;
+--------------------------+
| timestamp |
+--------------------------+
| 1988-01-06 00:59:59.345 |
| 2018-05-29 17:32:59.345 |
| 2067-05-29 17:32:59.345 |
+--------------------------+
Spark
Launch the spark shell and execute the following query to select the regex_test.regex_test_tbl table data. # spark-shell
scala> spark.sql("select * from regex_test.regex_test_tbl").show(truncate=false)
+----+--------+-------+-----------------------+
|col1|col2 |col3 |timestamp |
+----+--------+-------+-----------------------+
|2 |Nishanth|38000.0|2018-05-29 17:32:59.345|
|1 |Ranga |23000.0|1988-01-06 00:59:59.345|
|3 |Raja |18000.0|2067-05-29 17:32:59.345|
+----+--------+-------+-----------------------+
Without setting spark.sql.parser.quotedRegexColumnNames=true, try to run the query using regular expressions. We will get the following error: scala> spark.sql("select `col.*` from regex_test.regex_test_tbl").show(false)
org.apache.spark.sql.AnalysisException: cannot resolve '`col.*`' given input columns: [spark_catalog.regex_test.regex_test_tbl.col1, spark_catalog.regex_test.regex_test_tbl.col2, spark_catalog.regex_test.regex_test_tbl.col3, spark_catalog.regex_test.regex_test_tbl.timestamp]; line 1 pos 7;
'Project ['`col.*`]
+- SubqueryAlias spark_catalog.regex_test.regex_test_tbl
+- Relation[col1#203,col2#204,col3#205,timestamp#206] parquet
Now set the spark.sql.parser.quotedRegexColumnNames=true and execute the above query. scala> spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true").show(false)
+---------------------------------------+-----+
|key |value|
+---------------------------------------+-----+
|spark.sql.parser.quotedRegexColumnNames|true |
+---------------------------------------+-----+
scala> spark.sql("select `col.*` from regex_test.regex_test_tbl").show(false)
+----+--------+-------+
|col1|col2 |col3 |
+----+--------+-------+
|2 |Nishanth|38000.0|
|1 |Ranga |23000.0|
|3 |Raja |18000.0|
+----+--------+-------+
We can also select a specific column name while querying the data. scala> spark.sql("select `timestamp` from regex_test.regex_test_tbl").show(false)
+-----------------------+
|timestamp |
+-----------------------+
|2018-05-29 17:32:59.345|
|1988-01-06 00:59:59.345|
|2067-05-29 17:32:59.345|
+-----------------------+ Note: Currently in Spark, there is a limitation while selecting the data with the alias name. scala> spark.sql("select `col1` as `col` from regex_test.regex_test_tbl").show(false)
org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'alias';
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
References:
REGEX Column Specification
SPARK-12139
Thanks for visiting this article and happy learning !!
... View more
Labels:
05-10-2021
11:15 PM
4 Kudos
In this article, we will learn how to integrate Spark with Hive JdbcStorageHandler in CDP.
JdbcStorageHandler
By using the JdbcStorageHandler, we can connect Apache Hive to JDBC Data Sources (MySQL, PostgreSQL, Oracle, DB2, or Derby etc). Currently writing to a JDBC data source is not supported. To use JdbcStorageHandler, we need to create an external table using JdbcStorageHandler.
Example:
CREATE EXTERNAL TABLE employee
(
id BIGINT,
name STRING,
age INT,
salary DOUBLE
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "MYSQL",
"hive.sql.jdbc.driver" = "com.mysql.jdbc.Driver",
"hive.sql.jdbc.url" = "jdbc:mysql://localhost/test",
"hive.sql.dbcp.username" = "username",
"hive.sql.dbcp.password" = "password",
"hive.sql.table" = "EMPLOYEE",
"hive.sql.dbcp.maxActive" = "1"
);
Supported Data Types:
The column data type for a Hive JdbcStorageHandler table can be:
Numeric data type: byte, short, int, long, float, double
Decimal with scale and precision
String data type: string, char, varchar
Date
Timestamp
Note: Complex data type: struct, map, array are not supported.
The following are the steps to integrate with MySQL database:
Create the employee mysql table: mysql> create database test;
Query OK, 1 row affected (0.00 sec)
mysql> use test;
Database changed
mysql> drop table if EXISTS test.EMPLOYEE;
mysql> CREATE TABLE test.EMPLOYEE(
id INT,
name varchar(255),
salary DECIMAL,
dob DATE NOT NULL DEFAULT '2021-05-01',
doj TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP(),
PRIMARY KEY (id)
);
Query OK, 0 rows affected (0.01 sec)
Insert the data into the employee table: mysql> INSERT INTO test.EMPLOYEE (id, name, salary, dob, doj) VALUES (1, "Ranga", 10000.00, '1988-06-01', '2020-03-16 09:00:01.000000');
Query OK, 1 row affected (0.01 sec)
mysql> INSERT INTO test.EMPLOYEE (id, name, salary,dob) VALUES (2, "Nishanth", 50000.00, '2018-05-29');
Query OK, 1 row affected (0.01 sec)
mysql> INSERT INTO test.EMPLOYEE (id, name, salary) VALUES (3, "Raja", 30000.00);
Query OK, 1 row affected (0.01 sec)
Check the data in MySQL employee table: mysql> select * from test.EMPLOYEE;
+----+----------+--------+------------+---------------------+
| id | name | salary | dob | doj |
+----+----------+--------+------------+---------------------+
| 1 | Ranga | 10000 | 1988-06-01 | 2020-03-16 09:00:01 |
| 2 | Nishanth | 50000 | 2018-05-29 | 2021-05-01 09:02:33 |
| 3 | Raja | 30000 | 2021-05-01 | 2021-05-01 09:02:45 |
+----+----------+--------+------------+---------------------+
3 rows in set (0.00 sec)
Login to Beeline: #beeline -u jdbc:hive2://<hiveserver2_host>:<port>/<db>
beeline -u jdbc:hive2://localhost:10000/default
Create an employee table in hive using JdbcStorageHandler: > create database if not exists db_test;
INFO : OK
> use db_test;
INFO : OK
> DROP TABLE IF EXISTS db_test.employee;
> CREATE EXTERNAL TABLE db_test.employee(
id INT,
name STRING,
salary DOUBLE,
dob DATE,
doj TIMESTAMP
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "MYSQL",
"hive.sql.jdbc.driver" = "com.mysql.jdbc.Driver",
"hive.sql.jdbc.url" = "jdbc:mysql://localhost/test",
"hive.sql.dbcp.username" = "ranga",
"hive.sql.dbcp.password" = "ranga",
"hive.sql.query" = "select * from test.EMPLOYEE",
"hive.sql.dbcp.maxActive" = "1"
);
> show tables;
INFO : OK
+-----------+
| tab_name |
+-----------+
| employee |
+-----------+
Check the employee data in the Hive table: > select * from db_test.employee;
INFO : OK
+--------------+----------------+------------------+---------------+------------------------+
| employee.id | employee.name | employee.salary | employee.dob | employee.doj |
+--------------+----------------+------------------+---------------+------------------------+
| 1 | Ranga | 10000.0 | 1988-06-01 | 2020-03-16 09:00:01.0 |
| 2 | Nishanth | 50000.0 | 2018-05-29 | 2021-05-01 09:02:33.0 |
| 3 | Raja | 30000.0 | 2021-05-01 | 2021-05-01 09:02:45.0 |
+--------------+----------------+------------------+---------------+------------------------+
3 rows selected (0.349 seconds)
> !exit
Launch the spark-shell in CDP: Note: Copy the mysql-connector-java.jar to any location and provide it in spark-shell. In my case i have copied mysql-connector.jar to /usr/share/java/ location: spark-shell --master yarn --jars /opt/cloudera/parcels/CDH/lib/hive/lib/hive-jdbc-handler.jar,/usr/share/java/mysql-connector-java.jar
Select the employee table data using spark-sql: scala> spark.sql("select * from db_test.employee").show(truncate=false)
Hive Session ID = 4e572b40-76c2-4991-b41e-ee7830c9bff7
+---+--------+-------+----------+-------------------+
|id |name |salary |dob |doj |
+---+--------+-------+----------+-------------------+
|1 |Ranga |10000.0|1988-06-01|2020-03-16 09:00:01|
|2 |Nishanth|50000.0|2018-05-29|2021-05-01 09:02:33|
|3 |Raja |30000.0|2021-05-01|2021-05-01 09:02:45|
+---+--------+-------+----------+-------------------+
References:
JDBC Storage Handler
Hive query SQL using jdbcstoragehandler
Thank You!
... View more
Labels:
05-04-2021
02:18 AM
Hi @Sugumar As we discussed in https://community.cloudera.com/t5/Support-Questions/Java-Spark-driver-and-executor-logs-in-cluster-mode/m-p/315859#M226599 post, if your cluster is DSE then please check DSE team. If it is yarn then u can use following command: yarn logs -applicationId <Application_ID> > application_id.log
... View more
05-04-2021
01:48 AM
Hi @Sugumar I don't have much idea on DSE cluster. Please check the following link maybe it will help https://docs.datastax.com/en/dse/6.7/dse-admin/datastax_enterprise/spark/sparkLogging.html
... View more
- « Previous
- Next »