About RangaReddy

RangaReddy · ‎10-14-2022

Let's assume we have a Hive table with the name test and the table is stored under /tmp directory. In the test table, data is stored as below: hdfs dfs -ls -R /tmp/test drwxr-xr-x - hive hive 0 2022-08-24 09:15 /tmp/test/dir1 -rw-r--r-- 3 hive hive 685 2022-08-24 09:15 /tmp/test/dir1/000000_0 drwxr-xr-x - hive hive 0 2022-08-24 09:15 /tmp/test/dir2 -rw-r--r-- 3 hive hive 685 2022-08-24 09:15 /tmp/test/dir2/000000_0 Generally, the above kind of data will be generated while doing union all operations in Hive. By using spark, if we try to load the hive table data, we will get the following exception: scala> spark.sql("SELECT * FROM test").show() java.io.IOException: Not a file: hdfs://localhost:8020/tmp/test/dir1 at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:340) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:269) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269) at scala.Option.getOrElse(Option.scala:121) ..... By default spark will not read the table data if it contains subdirectories. To solve the this issue, we need to set the following parameter: spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true") We can also get same kind of exception while reading data in hive table. To solve this issue in Hive, we need to set the following two parameters: hive> set mapred.input.dir.recursive=true; hive> set hive.mapred.supports.subdirectories=true; We can also set above two parameters in hive-site.xml.

RangaReddy · ‎10-14-2022

Let's understand the information_schema database: Hive supports the ANSI-standard information_schema database, which we can query for information about tables, views, columns, and our Hive privileges. The information_schema data reveals the state of the system, similar to sys database data, but in a user-friendly, read-only way. Example: SELECT * FROM information_schema.tables WHERE is_insertable_into='YES' limit 2; ... +--------------------+-------------------+----------------- |tables.table_catalog|tables.table_schema|tables.table_name +--------------------+-------------------+----------------- |default |default |students2 |default |default |t3 Now we will try to access the following table under the information_schema database. spark.sql("select * from information_schema.schemata").show() We will get the following exception: org.apache.spark.sql.AnalysisException: Undefined function: 'restrict_information_schema'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 632 We will get the above exception because in spark we don't have restrict_information_schema function and in Hive it is available. We can check the available functions using the following command: spark.sql("show functions").show() We can solve the above error by passing hive-exec.jar and by creating a temporary function. spark-shell --jars /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/hive/lib/hive-exec.jar spark.sql(""" CREATE TEMPORARY FUNCTION restrict_information_schema AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFRestrictInformationSchema' """) After registering the function if we try to access the table data we will another error like below: scala> spark.sql("select * from information_schema.schemata").show() org.apache.spark.sql.AnalysisException: Undefined function: 'current_user'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 632 Difficult to find out how many function(s) we need to register. To avoid registering functions, we can use the Spark JDBC API to read the tables under information_schema. spark-shell --jars /opt/cloudera/parcels/CDH/lib/hive/lib/hive-jdbc.jar val options = Map( "url" -> "jdbc:hive2://localhost:10000/default;", "driver" -> "org.apache.hive.jdbc.HiveDriver", "dbtable" -> "information_schema.schemata", "user" -> "hive_user", "password" -> "hive_password" ) val df = spark.read.format("jdbc").options(options).load() df.show() +---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+ |schemata.catalog_name|schemata.schema_name|schemata.schema_owner|schemata.default_character_set_catalog|schemata.default_character_set_schema|schemata.default_character_set_name|schemata.sql_path| +---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+ | schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path| | schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path| | schemata.catalog_...|schemata.schema_name| schemata.schema_o...| schemata.default_...| schemata.default_...| schemata.default_...|schemata.sql_path| +---------------------+--------------------+---------------------+--------------------------------------+-------------------------------------+-----------------------------------+-----------------+ References: 1. https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/using-hiveql/topics/hive_query_information_schema.html

RangaReddy · ‎10-11-2022

Hi @Ploeplse Still, if you are facing the issue, could you share the requested information (i.e code and impala table creation script)

VidyaSargur · ‎10-07-2022

@imule, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.

VidyaSargur · ‎09-26-2022

@poorva, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.

Deepan_N · ‎09-21-2022

Hello @Boron I believe you are using HDP 3.x. Note that there is no Spark 1.x available in HDP 3. We need to use Spark 2.x. Set the SPARK_HOME to Spark 2. export SPARK_HOME=/usr/hdp/current/spark2-client

RangaReddy · ‎08-31-2022

Hi @Yosieam Please avoid calling read_file_log.collect() method. It will bring whole data to the driver and the driver needs to have more memory to hold that much data. Please check the modified code: move_to_rdd = sc.textFile("datalog2.log").map(lambda row : row.split("time=")).filter(lambda x : x != "") ReSymbol = move_to_rdd.map(lambda x : re.sub(r'\t', ' ', x)).map(lambda x : re.sub(r'\n', ' ', x)).map(lambda x : re.sub(r' +', ' ', x))

RangaReddy · ‎08-31-2022

Hi @mmk I think you have shared the following information. 7 nodes with each having 250 gb memory and vcpu = 32 per each node spark-defaults.conf spark.executor.memory = 100g spark.executor.memoryOverhead = 49g spark.driver.memoryOverhead=200g spark.driver.memory = 500g You have maximum of 250 gb for node and you have specified driver memory is (500gb and 200gb overhead). How it possible to driver to get 700gb? Generally you should not exceed the driver/executor memory beyond yarn physical memory. Coming to the actual problem, please avoid the show() to print 8000000 records. If you need to get the print the all values, then implement a logic to 1000 records at once and next 1000 records for another iteration. https://stackoverflow.com/questions/29227949/how-to-implement-spark-sql-pagination-query

RangaReddy · ‎08-31-2022

Hi @mmk By default, Hive will load all SerDe under the hive/lib location. So you are able to do the create/insert/select operations. In order to read the Hive table created with Custom or external SerDe we need to provide to spark, so spark internally use those libraries and it will load the Hive table data. If you are not provided the serde you can see the following exception: org.apache.hadoop.hive.serde2.SerDeException Please add the following library to the spark-submit command: json-serde-<version>.jar

RangaReddy · ‎08-31-2022

Hi @suri789 I think you haven't shared the full code, sample data and expected output to provide a solution. Please share the code proper format.

Online	Offline
Last Visited	‎08-29-2024 03:41 AM

Member Since	‎06-02-2020 05:25 AM
Last Visited	‎08-29-2024 03:41 AM
Posts	331
Kudos received	68

Cloudera Community

Re: Icebreg on CDP private cloud 7.1.9

Re: How to set default time zone/local time for Sp...

Re: Load Iceberg Table on PowerBI Desktop

Re: NoClassDefFoundError due to Incompatible Spark...

Re: Creating Iceberg table

Spark to read the Hive table sub-directory data

Spark to read the Hive tables under information_sc...

Re: Pyspark Dataframe into Impala table ==> syntax...

Re: Spark-submit suddenly doesn't recognize module...

Re: SparkContext was shut down

Re: Cannot get pyspark to work (Creating Spark Con...

Re: Exception in thread "dispatcher-event-loop-61"...

Re: Unable to fetch more data like more than 7 mil...

Re: Unable to execute select queries from Pyspark ...

Re: How can I get the DISTINCT values with same na...