Created on 10-14-2022 04:14 AM
Let's understand the information_schema database:
Hive supports the ANSI-standard information_schema database, which we can query for information about tables, views, columns, and our Hive privileges. The information_schema data reveals the state of the system, similar to sys database data, but in a user-friendly, read-only way.
Example:
SELECT * FROM information_schema.tables WHERE is_insertable_into='YES' limit 2;
...
+--------------------+-------------------+-----------------
|tables.table_catalog|tables.table_schema|tables.table_name
+--------------------+-------------------+-----------------
|default |default |students2
|default |default |t3
Now we will try to access the following table under the information_schema database.
spark.sql("select * from information_schema.schemata").show()
We will get the following exception:
org.apache.spark.sql.AnalysisException: Undefined function: 'restrict_information_schema'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 632
We will get the above exception because in spark we don't have restrict_information_schema function and in Hive it is available. We can check the available functions using the following command:
spark.sql("show functions").show()
We can solve the above error by passing hive-exec.jar and by creating a temporary function.
spark-shell --jars /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/hive/lib/hive-exec.jar
spark.sql("""
CREATE TEMPORARY FUNCTION restrict_information_schema AS
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFRestrictInformationSchema'
""")
After registering the function if we try to access the table data we will another error like below:
scala> spark.sql("select * from information_schema.schemata").show()
org.apache.spark.sql.AnalysisException: Undefined function: 'current_user'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 632
Difficult to find out how many function(s) we need to register.
To avoid registering functions, we can use the Spark JDBC API to read the tables under information_schema.
spark-shell --jars /opt/cloudera/parcels/CDH/lib/hive/lib/hive-jdbc.jar
val options = Map(
"url" -> "jdbc:hive2://localhost:10000/default;",
"driver" -> "org.apache.hive.jdbc.HiveDriver",
"dbtable" -> "information_schema.schemata",
"user" -> "hive_user",
"password" -> "hive_password"
)
val df = spark.read.format("jdbc").options(options).load()
df.show()
+---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+
|schemata.catalog_name|schemata.schema_name|schemata.schema_owner|schemata.default_character_set_catalog|schemata.default_character_set_schema|schemata.default_character_set_name|schemata.sql_path|
+---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+
| schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path|
| schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path|
| schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path|
+---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+
References: