About rdhau

rdhau · ‎04-16-2023

Hi All, I feel this community is teaching me lots of things which In was not aware before. i am working on a project where i need to import the data from oracle and store that data into hdfs. I am using pyspark to load the data into DF and then storing the data into HDFS, however in Oracle the table size is big and it has 420000000 record. Now i want to read those tables in parallel but but there are multiple tables and i am not able to make partition on the tables. is there any way to read the data in parallel when you don't know the partition column ?

rdhau · ‎04-16-2023

thanks i have disabled the dynamic allocation and it was working now.

rdhau · ‎04-12-2023

Hi, i have applied the repartition but still only 1 executor is running at a time. could you please help me with this and also while writing can you share the syntax where i can give the path to save the data in hdfs .

rdhau · ‎04-12-2023

Hi All, I am trying to f=import the data from oracle database and writing the data to hdfs using pyspark. Oracle has 480 tables i am creating a loop over list of tables but while writing the data into hdfs spark taking too much time. when i check in logs only 1 executor is running while i was passing --num-executor 4. here is my code # oracle-example.py from pyspark.sql import SparkSession from pyspark.sql import HiveContext appName = "PySpark Example - Oracle Example" master = "yarn" spark = SparkSession.builder.master(master).appName(appName).enableHiveSupport().getOrCreate() spark.sparkContext.getConf().getAll() #to get the list of tables present in schema sql = "SELECT table_name FROM all_tables WHERE owner = '**'" user = "**" password = "**" jdbc_url = "jdbc:oracle:thin:@****/**" # Change this to your Oracle's details accordingly server = "**" port = ** service_name = '**' jdbcDriver = "oracle.jdbc.OracleDriver" # Create a data frame by reading data from Oracle via JDBC to get the list of tables prersent in schema tablelist = spark.read.format("jdbc") \ .option("url", jdbc_url) \ .option("query", sql) \ .option("user", user) \ .option("password", password) \ .option("driver", jdbcDriver) \ .load().select("table_name") connection_details = { "user": "**", "password": "**", "driver": "oracle.jdbc.OracleDriver", } tablelist = [row.table_name for row in tablelist.collect()] for i in range(len(tablelist)): df = spark.read.jdbc(url=jdbc_url, table='sgms.'+tablelist[i], properties=connection_details) df.write.save('hdfs:/rajsampark/sgms/'+tablelist[i], format='csv', mode='overwrite') print("Write sucessfully for table "+tablelist[i]) And I am submitting the code using spark- submit please help

Online	Offline
Last Visited	‎04-18-2023 12:51 AM

Member Since	‎04-12-2023 04:33 AM
Last Visited	‎04-18-2023 12:51 AM
Posts	4

Cloudera Community

reading data from oracle in parallel

Re: write is slow in hdfs using pyspark

Re: write is slow in hdfs using pyspark

write is slow in hdfs using pyspark