About RangaReddy

cjervis · ‎07-06-2021

@roshanbi Have you resolved your issue and if so would you mind sharing the solution and marking this thread as solved? If you are still experiencing the issue, can you provide the information @RangaReddy has requested?

RangaReddy · ‎06-28-2021

Hi @KhASQ For Watermarking use any framework/db to update values once job is successfully. If you are using kafka then kafka itself you can store kafka related watermarking. Other than kafka you want to use then choose any RDBMS or HBase table.

RangaReddy · ‎06-27-2021

Hi @roshanbi Please find the difference: val textFileDF : Dataset[String] = spark.read.textFile("/path") // returns Dataset object val textFileRDD : RDD[String] = spark.sparkContext.textFile("/path") // returns RDD object If you are satisfied, please Accept as Solution.

RangaReddy · ‎06-24-2021

Hi @roshanbi If you are satisfied with my answer please Accept as Solution.

RangaReddy · ‎06-22-2021

Hi @magicchu Please raise an Hive escalation, our Hive team will work on this issue.

RangaReddy · ‎06-04-2021

Hi @jamesstarks Could you please enable spark authentication using following param. --conf spark.authenticate=true

RangaReddy · ‎05-18-2021

Introduction In this article, we will learn how to query the Hive tables data by using column names with regular expressions in Spark. Assume we have a table with column names like col1, col2, col3, col4, col5, etc. If we want to select the data, we will use queries like select col1, col2, col3, col4, col5 from the table. Instead of specifying col1, col2, col3, col4, col5, we can use regular expressions while selecting columns like select `col.*` from table. From Hive, this feature is supported from Hive 0.13.0 onwards. By default, this feature is disabled, and in order to enable it, we need to use set hive.support.quoted.identifiers=none. From Spark, this feature is supported from Spark 2.3.0 onwards. By default this feature is disabled, and in order to enable it, we need use set spark.sql.parser.quotedRegexColumnNames=true. Steps Hive Create the Hive table and insert the data: beeline -u jdbc:hive2://host:port > create database if not exists regex_test; > create external table if not exists regex_test.regex_test_tbl (col1 int, col2 string, col3 float, `timestamp` timestamp) stored as PARQUET; > insert into regex_test.regex_test_tbl values (1, 'Ranga', 23000, cast('1988-01-06 00:59:59.345' as timestamp)); > insert into regex_test.regex_test_tbl values (2, 'Nishanth', 38000, cast('2018-05-29 17:32:59.345' as timestamp)); > insert into regex_test.regex_test_tbl values (3, 'Raja', 18000, cast('2067-05-29 17:32:59.345' as timestamp)); Select the regex_test.regex_test_tbl table data. > SET hive.cli.print.header=true; > select * from regex_test.regex_test_tbl; +----------------------+----------------------+----------------------+---------------------------+ | regex_test_tbl.col1 | regex_test_tbl.col2 | regex_test_tbl.col3 | regex_test_tbl.timestamp | +----------------------+----------------------+----------------------+---------------------------+ | 1 | Ranga | 23000.0 | 1988-01-06 00:59:59.345 | | 2 | Nishanth | 38000.0 | 2018-05-29 17:32:59.345 | | 3 | Raja | 18000.0 | 2067-05-29 17:32:59.345 | +----------------------+----------------------+----------------------+---------------------------+ Without setting set hive.support.quoted.identifiers=none; try to run the query using regular expressions. The following error is noticed: > select `col.*` from regex_test.regex_test_tbl; FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'col.*': (possible column names are: col1, col2, col3, timestamp) Now set the hive.support.quoted.identifiers=none and execute the above query. > set hive.support.quoted.identifiers=none; > select `col.*` from regex_test.regex_test_tbl; +----------------------+----------------------+----------------------+ | regex_test_tbl.col1 | regex_test_tbl.col2 | regex_test_tbl.col3 | +----------------------+----------------------+----------------------+ | 1 | Ranga | 23000.0 | | 2 | Nishanth | 38000.0 | | 3 | Raja | 18000.0 | +----------------------+----------------------+----------------------+ We can also select a specific column name while querying the data. > select `timestamp` from regex_test.regex_test_tbl; +--------------------------+ | timestamp | +--------------------------+ | 1988-01-06 00:59:59.345 | | 2018-05-29 17:32:59.345 | | 2067-05-29 17:32:59.345 | +--------------------------+ Spark Launch the spark shell and execute the following query to select the regex_test.regex_test_tbl table data. # spark-shell scala> spark.sql("select * from regex_test.regex_test_tbl").show(truncate=false) +----+--------+-------+-----------------------+ |col1|col2 |col3 |timestamp | +----+--------+-------+-----------------------+ |2 |Nishanth|38000.0|2018-05-29 17:32:59.345| |1 |Ranga |23000.0|1988-01-06 00:59:59.345| |3 |Raja |18000.0|2067-05-29 17:32:59.345| +----+--------+-------+-----------------------+ Without setting spark.sql.parser.quotedRegexColumnNames=true, try to run the query using regular expressions. We will get the following error: scala> spark.sql("select `col.*` from regex_test.regex_test_tbl").show(false) org.apache.spark.sql.AnalysisException: cannot resolve '`col.*`' given input columns: [spark_catalog.regex_test.regex_test_tbl.col1, spark_catalog.regex_test.regex_test_tbl.col2, spark_catalog.regex_test.regex_test_tbl.col3, spark_catalog.regex_test.regex_test_tbl.timestamp]; line 1 pos 7; 'Project ['`col.*`] +- SubqueryAlias spark_catalog.regex_test.regex_test_tbl +- Relation[col1#203,col2#204,col3#205,timestamp#206] parquet Now set the spark.sql.parser.quotedRegexColumnNames=true and execute the above query. scala> spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true").show(false) +---------------------------------------+-----+ |key |value| +---------------------------------------+-----+ |spark.sql.parser.quotedRegexColumnNames|true | +---------------------------------------+-----+ scala> spark.sql("select `col.*` from regex_test.regex_test_tbl").show(false) +----+--------+-------+ |col1|col2 |col3 | +----+--------+-------+ |2 |Nishanth|38000.0| |1 |Ranga |23000.0| |3 |Raja |18000.0| +----+--------+-------+ We can also select a specific column name while querying the data. scala> spark.sql("select `timestamp` from regex_test.regex_test_tbl").show(false) +-----------------------+ |timestamp | +-----------------------+ |2018-05-29 17:32:59.345| |1988-01-06 00:59:59.345| |2067-05-29 17:32:59.345| +-----------------------+ Note: Currently in Spark, there is a limitation while selecting the data with the alias name. scala> spark.sql("select `col1` as `col` from regex_test.regex_test_tbl").show(false) org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'alias'; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95) References: REGEX Column Specification SPARK-12139 Thanks for visiting this article and happy learning !!

RangaReddy · ‎05-10-2021

In this article, we will learn how to integrate Spark with Hive JdbcStorageHandler in CDP. JdbcStorageHandler By using the JdbcStorageHandler, we can connect Apache Hive to JDBC Data Sources (MySQL, PostgreSQL, Oracle, DB2, or Derby etc). Currently writing to a JDBC data source is not supported. To use JdbcStorageHandler, we need to create an external table using JdbcStorageHandler. Example: CREATE EXTERNAL TABLE employee ( id BIGINT, name STRING, age INT, salary DOUBLE ) STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( "hive.sql.database.type" = "MYSQL", "hive.sql.jdbc.driver" = "com.mysql.jdbc.Driver", "hive.sql.jdbc.url" = "jdbc:mysql://localhost/test", "hive.sql.dbcp.username" = "username", "hive.sql.dbcp.password" = "password", "hive.sql.table" = "EMPLOYEE", "hive.sql.dbcp.maxActive" = "1" ); Supported Data Types: The column data type for a Hive JdbcStorageHandler table can be: Numeric data type: byte, short, int, long, float, double Decimal with scale and precision String data type: string, char, varchar Date Timestamp Note: Complex data type: struct, map, array are not supported. The following are the steps to integrate with MySQL database: Create the employee mysql table: mysql> create database test; Query OK, 1 row affected (0.00 sec) mysql> use test; Database changed mysql> drop table if EXISTS test.EMPLOYEE; mysql> CREATE TABLE test.EMPLOYEE( id INT, name varchar(255), salary DECIMAL, dob DATE NOT NULL DEFAULT '2021-05-01', doj TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP(), PRIMARY KEY (id) ); Query OK, 0 rows affected (0.01 sec) Insert the data into the employee table: mysql> INSERT INTO test.EMPLOYEE (id, name, salary, dob, doj) VALUES (1, "Ranga", 10000.00, '1988-06-01', '2020-03-16 09:00:01.000000'); Query OK, 1 row affected (0.01 sec) mysql> INSERT INTO test.EMPLOYEE (id, name, salary,dob) VALUES (2, "Nishanth", 50000.00, '2018-05-29'); Query OK, 1 row affected (0.01 sec) mysql> INSERT INTO test.EMPLOYEE (id, name, salary) VALUES (3, "Raja", 30000.00); Query OK, 1 row affected (0.01 sec) Check the data in MySQL employee table: mysql> select * from test.EMPLOYEE; +----+----------+--------+------------+---------------------+ | id | name | salary | dob | doj | +----+----------+--------+------------+---------------------+ | 1 | Ranga | 10000 | 1988-06-01 | 2020-03-16 09:00:01 | | 2 | Nishanth | 50000 | 2018-05-29 | 2021-05-01 09:02:33 | | 3 | Raja | 30000 | 2021-05-01 | 2021-05-01 09:02:45 | +----+----------+--------+------------+---------------------+ 3 rows in set (0.00 sec) Login to Beeline: #beeline -u jdbc:hive2://<hiveserver2_host>:<port>/<db> beeline -u jdbc:hive2://localhost:10000/default Create an employee table in hive using JdbcStorageHandler: > create database if not exists db_test; INFO : OK > use db_test; INFO : OK > DROP TABLE IF EXISTS db_test.employee; > CREATE EXTERNAL TABLE db_test.employee( id INT, name STRING, salary DOUBLE, dob DATE, doj TIMESTAMP ) STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( "hive.sql.database.type" = "MYSQL", "hive.sql.jdbc.driver" = "com.mysql.jdbc.Driver", "hive.sql.jdbc.url" = "jdbc:mysql://localhost/test", "hive.sql.dbcp.username" = "ranga", "hive.sql.dbcp.password" = "ranga", "hive.sql.query" = "select * from test.EMPLOYEE", "hive.sql.dbcp.maxActive" = "1" ); > show tables; INFO : OK +-----------+ | tab_name | +-----------+ | employee | +-----------+ Check the employee data in the Hive table: > select * from db_test.employee; INFO : OK +--------------+----------------+------------------+---------------+------------------------+ | employee.id | employee.name | employee.salary | employee.dob | employee.doj | +--------------+----------------+------------------+---------------+------------------------+ | 1 | Ranga | 10000.0 | 1988-06-01 | 2020-03-16 09:00:01.0 | | 2 | Nishanth | 50000.0 | 2018-05-29 | 2021-05-01 09:02:33.0 | | 3 | Raja | 30000.0 | 2021-05-01 | 2021-05-01 09:02:45.0 | +--------------+----------------+------------------+---------------+------------------------+ 3 rows selected (0.349 seconds) > !exit Launch the spark-shell in CDP: Note: Copy the mysql-connector-java.jar to any location and provide it in spark-shell. In my case i have copied mysql-connector.jar to /usr/share/java/ location: spark-shell --master yarn --jars /opt/cloudera/parcels/CDH/lib/hive/lib/hive-jdbc-handler.jar,/usr/share/java/mysql-connector-java.jar Select the employee table data using spark-sql: scala> spark.sql("select * from db_test.employee").show(truncate=false) Hive Session ID = 4e572b40-76c2-4991-b41e-ee7830c9bff7 +---+--------+-------+----------+-------------------+ |id |name |salary |dob |doj | +---+--------+-------+----------+-------------------+ |1 |Ranga |10000.0|1988-06-01|2020-03-16 09:00:01| |2 |Nishanth|50000.0|2018-05-29|2021-05-01 09:02:33| |3 |Raja |30000.0|2021-05-01|2021-05-01 09:02:45| +---+--------+-------+----------+-------------------+ References: JDBC Storage Handler Hive query SQL using jdbcstoragehandler Thank You!

RangaReddy · ‎05-04-2021

Hi @Sugumar As we discussed in https://community.cloudera.com/t5/Support-Questions/Java-Spark-driver-and-executor-logs-in-cluster-mode/m-p/315859#M226599 post, if your cluster is DSE then please check DSE team. If it is yarn then u can use following command: yarn logs -applicationId <Application_ID> > application_id.log

RangaReddy · ‎05-04-2021

Hi @Sugumar I don't have much idea on DSE cluster. Please check the following link maybe it will help https://docs.datastax.com/en/dse/6.7/dse-admin/datastax_enterprise/spark/sparkLogging.html

Online	Offline
Last Visited	‎08-29-2024 03:41 AM

Member Since	‎06-02-2020 05:25 AM
Last Visited	‎08-29-2024 03:41 AM
Posts	331
Kudos received	68

Cloudera Community

Re: Icebreg on CDP private cloud 7.1.9

Re: How to set default time zone/local time for Sp...

Re: Load Iceberg Table on PowerBI Desktop

Re: NoClassDefFoundError due to Incompatible Spark...

Re: Creating Iceberg table

Re: Spark on Windows

Re: Where to store Apache Spark batch ETL job pro...

Re: RDD questions

Re: data frames and data sets

Re: hive udf does not work in beeline

Re: Debug Spark job

Spark to support REGEX column specification for Hi...

How to integrate Spark with Hive JdbcStorageHandle...

Re: Spark Log4j - Driver and executor log in conso...

Re: Java Spark driver and executor logs in cluster...