Member since
06-02-2020
331
Posts
67
Kudos Received
49
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2798 | 07-11-2024 01:55 AM | |
| 7859 | 07-09-2024 11:18 PM | |
| 6569 | 07-09-2024 04:26 AM | |
| 5903 | 07-09-2024 03:38 AM | |
| 5598 | 06-05-2024 02:03 AM |
10-14-2022
06:10 AM
Let's assume we have a Hive table with the name test and the table is stored under /tmp directory. In the test table, data is stored as below: hdfs dfs -ls -R /tmp/test
drwxr-xr-x - hive hive 0 2022-08-24 09:15 /tmp/test/dir1
-rw-r--r-- 3 hive hive 685 2022-08-24 09:15 /tmp/test/dir1/000000_0
drwxr-xr-x - hive hive 0 2022-08-24 09:15 /tmp/test/dir2
-rw-r--r-- 3 hive hive 685 2022-08-24 09:15 /tmp/test/dir2/000000_0 Generally, the above kind of data will be generated while doing union all operations in Hive. By using spark, if we try to load the hive table data, we will get the following exception: scala> spark.sql("SELECT * FROM test").show() java.io.IOException: Not a file: hdfs://localhost:8020/tmp/test/dir1
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:340)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
at scala.Option.getOrElse(Option.scala:121)
..... By default spark will not read the table data if it contains subdirectories. To solve the this issue, we need to set the following parameter: spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true") We can also get same kind of exception while reading data in hive table. To solve this issue in Hive, we need to set the following two parameters: hive> set mapred.input.dir.recursive=true;
hive> set hive.mapred.supports.subdirectories=true; We can also set above two parameters in hive-site.xml.
... View more
10-14-2022
04:14 AM
Let's understand the information_schema database:
Hive supports the ANSI-standard information_schema database, which we can query for information about tables, views, columns, and our Hive privileges. The information_schema data reveals the state of the system, similar to sys database data, but in a user-friendly, read-only way.
Example:
SELECT * FROM information_schema.tables WHERE is_insertable_into='YES' limit 2;
...
+--------------------+-------------------+-----------------
|tables.table_catalog|tables.table_schema|tables.table_name
+--------------------+-------------------+-----------------
|default |default |students2
|default |default |t3
Now we will try to access the following table under the information_schema database.
spark.sql("select * from information_schema.schemata").show()
We will get the following exception:
org.apache.spark.sql.AnalysisException: Undefined function: 'restrict_information_schema'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 632
We will get the above exception because in spark we don't have restrict_information_schema function and in Hive it is available. We can check the available functions using the following command:
spark.sql("show functions").show()
We can solve the above error by passing hive-exec.jar and by creating a temporary function.
spark-shell --jars /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/hive/lib/hive-exec.jar
spark.sql("""
CREATE TEMPORARY FUNCTION restrict_information_schema AS
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFRestrictInformationSchema'
""")
After registering the function if we try to access the table data we will another error like below:
scala> spark.sql("select * from information_schema.schemata").show()
org.apache.spark.sql.AnalysisException: Undefined function: 'current_user'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 632
Difficult to find out how many function(s) we need to register.
To avoid registering functions, we can use the Spark JDBC API to read the tables under information_schema.
spark-shell --jars /opt/cloudera/parcels/CDH/lib/hive/lib/hive-jdbc.jar
val options = Map(
"url" -> "jdbc:hive2://localhost:10000/default;",
"driver" -> "org.apache.hive.jdbc.HiveDriver",
"dbtable" -> "information_schema.schemata",
"user" -> "hive_user",
"password" -> "hive_password"
)
val df = spark.read.format("jdbc").options(options).load()
df.show()
+---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+
|schemata.catalog_name|schemata.schema_name|schemata.schema_owner|schemata.default_character_set_catalog|schemata.default_character_set_schema|schemata.default_character_set_name|schemata.sql_path|
+---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+
| schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path|
| schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path|
| schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path|
+---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+
References:
1. https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/using-hiveql/topics/hive_query_information_schema.html
... View more
Labels:
10-11-2022
02:13 AM
Hi @Ploeplse Still, if you are facing the issue, could you share the requested information (i.e code and impala table creation script)
... View more
10-07-2022
01:58 AM
@imule, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.
... View more
09-26-2022
04:29 AM
@poorva, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.
... View more
09-21-2022
11:38 PM
Hello @Boron I believe you are using HDP 3.x. Note that there is no Spark 1.x available in HDP 3. We need to use Spark 2.x. Set the SPARK_HOME to Spark 2. export SPARK_HOME=/usr/hdp/current/spark2-client
... View more
08-31-2022
10:53 PM
Hi @Yosieam Please avoid calling read_file_log.collect() method. It will bring whole data to the driver and the driver needs to have more memory to hold that much data. Please check the modified code: move_to_rdd = sc.textFile("datalog2.log").map(lambda row : row.split("time=")).filter(lambda x : x != "")
ReSymbol = move_to_rdd.map(lambda x : re.sub(r'\t', ' ', x)).map(lambda x : re.sub(r'\n', ' ', x)).map(lambda x : re.sub(r' +', ' ', x))
... View more
08-31-2022
10:48 PM
Hi @mmk I think you have shared the following information. 7 nodes with each having 250 gb memory and vcpu = 32 per each node spark-defaults.conf spark.executor.memory = 100g spark.executor.memoryOverhead = 49g spark.driver.memoryOverhead=200g spark.driver.memory = 500g You have maximum of 250 gb for node and you have specified driver memory is (500gb and 200gb overhead). How it possible to driver to get 700gb? Generally you should not exceed the driver/executor memory beyond yarn physical memory. Coming to the actual problem, please avoid the show() to print 8000000 records. If you need to get the print the all values, then implement a logic to 1000 records at once and next 1000 records for another iteration. https://stackoverflow.com/questions/29227949/how-to-implement-spark-sql-pagination-query
... View more
08-31-2022
09:45 PM
Hi @mmk By default, Hive will load all SerDe under the hive/lib location. So you are able to do the create/insert/select operations. In order to read the Hive table created with Custom or external SerDe we need to provide to spark, so spark internally use those libraries and it will load the Hive table data. If you are not provided the serde you can see the following exception: org.apache.hadoop.hive.serde2.SerDeException Please add the following library to the spark-submit command: json-serde-<version>.jar
... View more
08-31-2022
09:36 PM
Hi @suri789 I think you haven't shared the full code, sample data and expected output to provide a solution. Please share the code proper format.
... View more