Community Articles

RangaReddy · ‎10-14-2022

Let's understand the information_schema database:

Hive supports the ANSI-standard information_schema database, which we can query for information about tables, views, columns, and our Hive privileges. The information_schema data reveals the state of the system, similar to sys database data, but in a user-friendly, read-only way.

Example:

SELECT * FROM information_schema.tables WHERE is_insertable_into='YES' limit 2;
...
+--------------------+-------------------+-----------------
|tables.table_catalog|tables.table_schema|tables.table_name
+--------------------+-------------------+-----------------
|default             |default            |students2
|default             |default            |t3

Now we will try to access the following table under the information_schema database.

spark.sql("select * from information_schema.schemata").show()

We will get the following exception:

org.apache.spark.sql.AnalysisException: Undefined function: 'restrict_information_schema'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 632

We will get the above exception because in spark we don't have restrict_information_schema function and in Hive it is available. We can check the available functions using the following command:

spark.sql("show functions").show()

We can solve the above error by passing hive-exec.jar and by creating a temporary function.

spark-shell --jars /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/hive/lib/hive-exec.jar

spark.sql("""
    CREATE TEMPORARY FUNCTION restrict_information_schema AS  
    'org.apache.hadoop.hive.ql.udf.generic.GenericUDFRestrictInformationSchema'
""")

After registering the function if we try to access the table data we will another error like below:

scala> spark.sql("select * from information_schema.schemata").show()
org.apache.spark.sql.AnalysisException: Undefined function: 'current_user'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 632

Difficult to find out how many function(s) we need to register.

To avoid registering functions, we can use the Spark JDBC API to read the tables under information_schema.

spark-shell --jars /opt/cloudera/parcels/CDH/lib/hive/lib/hive-jdbc.jar

val options = Map(
  "url" -> "jdbc:hive2://localhost:10000/default;", 
  "driver" -> "org.apache.hive.jdbc.HiveDriver", 
  "dbtable" -> "information_schema.schemata", 
  "user" -> "hive_user", 
  "password" -> "hive_password"
)

val df = spark.read.format("jdbc").options(options).load()
df.show()
+---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+
|schemata.catalog_name|schemata.schema_name|schemata.schema_owner|schemata.default_character_set_catalog|schemata.default_character_set_schema|schemata.default_character_set_name|schemata.sql_path|
+---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+
| schemata.catalog_...|schemata.schema_name| schemata.schema_o...|                  schemata.default_...|                 schemata.default_...|               schemata.default_...|schemata.sql_path|
| schemata.catalog_...|schemata.schema_name| schemata.schema_o...|                  schemata.default_...|                 schemata.default_...|               schemata.default_...|schemata.sql_path|
| schemata.catalog_...|schemata.schema_name| schemata.schema_o...|                  schemata.default_...|                 schemata.default_...|               schemata.default_...|schemata.sql_path|
+---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+

References:

1. https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/using-hiveql/topics/hive_query_information_sc...

Cloudera Community

Community Articles

Spark to read the Hive tables under information_schema database

Apache Hive

Apache Spark