About Sidhartha

RangaReddy · ‎03-06-2024

Hi @Sidhartha Could you please try the following sample def convertDatatype(datatype: String): DataType = { val convert = datatype match { case "string" => StringType case "short" => ShortType case "int" => IntegerType case "bigint" => LongType case "float" => FloatType case "double" => DoubleType case "decimal" => DecimalType(38,30) case "date" => TimestampType case "boolean" => BooleanType case "timestamp" => TimestampType } convert } val input_data = List(Row(1l, "Ranga", 27, BigDecimal(306.000000000000000000)), Row(2l, "Nishanth", 6, BigDecimal(606.000000000000000000))) val input_rdd = spark.sparkContext.parallelize(input_data) val hiveCols="id:bigint,name:string,age:int,salary:decimal" val schemaList = hiveCols.split(",") val schemaStructType = new StructType(schemaList.map(col => col.split(":")).map(e => StructField(e(0), convertDatatype(e(1)), true))) val myDF = spark.createDataFrame(input_rdd, schemaStructType) myDF.printSchema() myDF.show() val myDF2 = myDF.withColumn("new_salary", col("salary").cast("double")) myDF2.printSchema() myDF2.show()

prasanna06 · ‎01-26-2021

I'm also getting the same exception while fetching the database details by using pyspark from pyspark.sql import HiveContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql import SparkSession from pyspark import SparkConf, SparkContext conf = (SparkConf().set("spark.kryoserializer.buffer.max", "512m")) sc = SparkContext(conf=conf) hive_context = HiveContext(sc) sqlContext = SQLContext(sc) sqlContext.sql("use default") above code I can execute till the 6 line, from 7th line I'm getting the exception. in my cluster by default PostgreSQL database came, after that, I manually installed Mysql, not integrated with Cloudera, because of this I'm getting this exception? can help me with this, please.

EricL · ‎07-04-2019

Hi @Sidhartha, The error means that the destination table base.customers already has the partition source_name=ORACLE, as the error indicated: Partition already exists [customers(source_name=ORACLE)] Can you run : SHOW PARTITIONS base.customers; to confirm? If yes, try to drop it and then run EXCHANGE PARTITION again. Cheers Eric

DennisJaheruddi · ‎04-09-2019

There is something very unusual happening here. Based on your outputs, values are not only ending up in the wrong columns, but you are even getting different values! In the 'correct' record, you have 5686.76, and in the 'wrong' record you have -5686.76. My first guess was that there is a mistake in how you send data to the appropriate columns, but I don't see how that can explain a minus sign changing position. To troubleshoot something like this, it is really important to dig into the details. I would therefore recommend you to bring your question down to a 'Minimal reproducible example'. Eliminating any complexity that is not causing unexpected results. For example: You show a load command to get data into spark, consider replacing it with an actual string (and make sure to check whether the string allows you to reproduce the problem). You also show 2 writes, but if we have the exact input and code to reproduce the problem the correct answer is probably not relevant. Also, you use some code to list columns, consider hardcoding it first. As mentioned, really try to take out all complexity untill we land on a minimal amount that still reproduces the problem. Hopefully you will already see the answer once you have eliminated all the distractions, and if not you will have a fully trimmed down version, which you can use to update your question here!

AKJ · ‎09-03-2018

Hi, I also faced similar issues while applying regex_replace() to only strings columns of a dataframe. The trick is to make regEx pattern (in my case "pattern") that resolves inside the double quotes and also apply escape characters. val pattern="\"\\\\[\\\\x{FFFD}\\\\x{0}\\\\x{1F}\\\\x{81}\\\\x{8D}\\\\x{8F}\\\\x{90}\\\\x{9D}\\\\x{A0}\\\\x{0380}\\\\]\"" val regExpr = parq_out_df_temp.schema.fields.map(x => if(x.dataType == StringType){"regexp_replace(%s, $pattern,'') as %s".format(x.name,x.name)} else x.name) val parq_out=parq_out_df_temp.selectExpr(regExpr:_*) This worked fine for me !!

GeKas · ‎07-18-2018

According to the error, it is looking for java 7 installed by cloudera. You should define JAVA_HOME={path_to_your_jdk8_installation} in bashrc.

GeKas · ‎03-20-2018

The error complains about the value of "hadoop.security.authentication". You have set it to "Kerberos" while the accepted values are "simple" and "kerberos" (all letters in lowercase).

Sidhartha · ‎01-01-2018

The files will not be in a specific order. Is this a solution: Load all the files into Spark & create a dataframe out of it and then split this main dataframe into smaller ones by using the delimiter("...") which is present at the end of each file. Once this is done, map the dataframes by checking if the third line of each file contains the words: "SEVERE: Error" and group/merge them together. Similarly following the approach for the other cases and finally have three separate dataframes exclusice for each case. Is this approach viable or is there any better way I can follow.

csguna · ‎07-04-2017

What user are you runinig the spark ? is this path that you are referring is in hdfs or local /home/cloudera/partfile perform this and let me know if the files are getting listed hadoop fs -ls /home/cloudera/

Sidhartha · ‎06-27-2017

The mistake I made was Case class should be outside the main and inside the object In this line: val partfile = sparkSession.read.text("partfile").as[String] , I used read.text("..") to get a file into Spark where we can use read.textFile("...")

Online	Offline
Last Visited	‎07-18-2020 08:18 AM

Member Since	‎03-28-2017 09:49 AM
Last Visited	‎07-18-2020 08:18 AM
Posts	38

Cloudera Community

Re: toDF is not a member of org.apache.spark.rdd.R...

Re: How to cast Decimal columns of dataframe to Do...

Re: How to fix exception: ERROR XSDB6: Another ins...

Re: Unable to perform hive exchange partition due ...

Re: Unable to map the data properly from a CSV fil...

Re: How to apply Regex pattern on a Dataframe's St...

Re: Unable to launch spark using spark-shell on li...

Re: Java - GSS initiate failed Exception, Caused b...

Re: How to split the dataframe of multiple files i...

Re: HDFS path does not exist with SparkSession obj...

Re: toDF is not a member of org.apache.spark.rdd.R...