About JfisherLVL3

JfisherLVL3 · ‎11-14-2017

Thank you for your response AutoIN. It looks like this workaround will work for what I need it for.

JfisherLVL3 · ‎11-13-2017

Thank you for the suggestion EB9185. This unfortunately doesn't work in scala though. The saveAsTable() method in scala is only designed to take the tableName as input and you need to pipe the mode() and format() methods separately.

JfisherLVL3 · ‎11-13-2017

Also if it helps this is on Spark 2.2.0 and CDH5.12.0

JfisherLVL3 · ‎11-13-2017

After doing more testing I'm finding that this happens on tables that aren't even parquet tables and the write command isn't even specifying parquet as the write format. This error does not happen with the insertInto() command though. This is not a good work around because saveAsTable() checks column names wheras insertInto() is based on data order and that is not acceptable for the use case I have. Table I'm trying to write to info: $ hive -e "describe formatted test_parquet_spark" # col_name data_type comment col1 string col2 string # Detailed Table Information Database: default CreateTime: Fri Nov 10 22:54:20 GMT 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Table Type: MANAGED_TABLE # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format 1 Testing in spark2-shell: scala> val spark = SparkSession.builder.config("spark.sql.warehouse.dir", "/user/hive/warehouse").enableHiveSupport.getOrCreate 17/11/13 20:17:05 WARN sql.SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect. spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@2524f9bf scala> val schema = new StructType().add(StructField("col1", StringType)).add(StructField("col2", StringType)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(col1,StringType,true), StructField(col2,StringType,true)) scala> val dataRDD = spark.sparkContext.parallelize(Seq(Row("foobar", "barfoo"))) dataRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:27 scala> val dataDF = spark.createDataFrame(dataRDD, schema) dataDF: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dataDF.write.insertInto("test_parquet_spark") $ hive -e "select * from test_parquet_spark" OK foobar barfoo scala> dataDF.write.mode("append").saveAsTable("test_parquet_spark") org.apache.spark.sql.AnalysisException: The format of the existing table default.test_parquet_spark is `HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`.; at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:115) at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75) at org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50) at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:73) at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610) at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:420) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:399) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:354) ... 48 elided Why does saveAsTable() not work while insertInto() does in this example? And why does saveAsTable() not recognize parquet tables as being parquet tables? If anyone knows what is going on I would really appreciate it. Thanks

JfisherLVL3 · ‎11-10-2017

I'm trying to write a dataframe to a parquet hive table and keep getting an error saying that the table is HiveFileFormat and not ParquetFileFormat. The table is definitely a parquet table. Here's how I'm creating the sparkSession: val spark = SparkSession .builder() .config("spark.sql.warehouse.dir", "/user/hive/warehouse") .config("spark.sql.sources.maxConcurrentWrites","1") .config("spark.sql.parquet.compression.codec", "snappy") .config("hive.exec.dynamic.partition", "true") .config("hive.exec.dynamic.partition.mode", "nonstrict") .config("parquet.compression", "SNAPPY") .config("hive.exec.max.dynamic.partitions", "3000") .config("parquet.enable.dictionary", "false") .config("hive.support.concurrency", "true") .enableHiveSupport() .getOrCreate() Here's the code that fails and the error message: dataFrame.write.format("parquet").mode(saveMode).partitionBy(partitionCol).saveAsTable(tableName) org.apache.spark.sql.AnalysisException: The format of the existing table tableName is `HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`.; Here's the table storage info: # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim \t line.delim \n serialization.format \t This same code worked before upgrading to spark2, for some reason it isn't recognizing the table as parquet now.

JfisherLVL3 · ‎01-19-2016

I'm trying to use SQOOP to pull data from ORACLE into HDFS while using multiple mappers. The problem I'm running into is when SQOOP is creating the boundaries for the mappers it's generating them with milliseconds and then tries to compare that to a DATE field in ORACLE, which doesn't work and throws errors. The query from a shell file: sqoop import --connect jdbc:oracle:thin:{CONNECTION ADDRESS} \ --username {USERNAME} \ --password {PASSWORD} \ --verbose \ --table {ORACLE TABLE} \ --where "orig_dt >= to_date('2015-12-01 00:00:00', 'YYYY-MM-DD HH24:MI:SS') and orig_dt < to_date('2015-12-01 01:00:00', 'YYYY-MM-DD HH24:MI:SS')" \ -z \ --compression-codec org.apache.hadoop.io.compress.SnappyCodec \ --as-parquetfile \ --target-dir {HDFS LOCATION} \ --split-by ORIG_DT \ -m 2 The where clause that SQOOP generates for the first mapper: WHERE ( orig_dt >= to_date('2015-12-01 00:00:00', 'YYYY-MM-DD HH24:MI:SS') and orig_dt < to_date('2015-12-01 01:00:00', 'YYYY-MM-DD HH24:MI:SS') ) AND ( ORIG_DT >= '2015-12-01 00:00:00.0' ) AND ( ORIG_DT < '2015-12-01 00:29:59.5' ) The error I'm receiving from this is: ORA-01861: literal does not match format string After doing some research I found this SQOOP issue: https://issues.apache.org/jira/browse/SQOOP-1946 Which seems to match the problem I'm having. I've implemented the workaround they recommend but it now generates a different error: ORA-01830: date format picture ends before converting entire input string This new error seems to be caused by the milliseconds that SQOOP keeps generating with the splits because I've tried running the where clause that SQOOP has generated against the ORACLE server directly and it throws the ORA-01830 error with milliseconds BUT runs properly when I take off the milliseconds. (Side note: Before adding the workaround from SQOOP-1946 issue if I was to run the Where clause SQOOP generated directly against the ORACLE server without milliseconds it would still throw the ORA-01861 error) So the main question is this: Is there any way to prevent SQOOP from using milliseconds when generating splits for a date column when moving data from ORACLE to HDFS? OR is there some other way to solve this problem? It may be worth noting that this query worked fine when I was on CDH5.4.8 and SQOOP 1.4.5 but now after upgrading to CDH 5.5.1 and SQOOP 1.4.6 these errors are getting thrown. Also, this query does work with a single mapper but using multiple mappers throw these errors.

JfisherLVL3 · ‎07-17-2015

Updating the SharLib Worked! =D I did have to bounce oozie before it fixed it but this solution works.

JfisherLVL3 · ‎07-16-2015

Would there be any way to tell oozie to not use kite when sqooping to avoid this error entirely?

JfisherLVL3 · ‎07-15-2015

[Research I've done: It looks like the method registerDatasetRepository was included in the kite-data-core.jar file in the 0.14.1 version but was removed in version 0.15.0. We are currently using one of the kite-data-core.jar files after version 0.15.0 and I am unable to use the older version since there are other things using that jar on the hdfs. Is there any way I can tell sqoop to not use kite or add another jar with this method included or any other solution to this error?] We want to use the newest version (and as far as i know, we currently are) but the code that is generated from the sqoop command wants to use the registerDatasetRepository method from kite-data-core.jar. But it looks like that method no longer exists in the newer versions of kite-data-core.jar.

JfisherLVL3 · ‎07-14-2015

Im currently trying to Sqoop some data from an Oracle DB to HDFS using Oozie to schedule the sqoop workflow. My Sqoop version: Sqoop 1.4.5-cdh5.4.2 My Sqoop code: import -- connect {JDBCpath} \ --username {Username} \ --password {Password} \ --verbose \ --table {Table} \ --where "{Query}" \ -z \ --compression-codec org.apache.hadoop.io.compress.SnappyCodec \ --as-parquetfile \ --target-dir {TargetDirectory} \ --split-by {columnToSplitBy} \ -m 14 The Error I have been getting: java.lang.NoSuchMethodError: org.kitesdk.data.impl.Accessor.registerDatasetRepository(Lorg/kitesdk/data/spi/URIPattern;Lorg/kitesdk/data/spi/OptionBuilder;)V [Edited to focus on my main problem, rather than the research I've done into it which may or may not be the correct path in solving this. This information will be reposted in a reply.]

Online	Offline
Last Visited	‎05-04-2018 04:37 PM

Member Since	‎07-14-2015 10:50 AM
Last Visited	‎05-04-2018 04:37 PM
Posts	11

Cloudera Community

Re: Spark 2 Can't write dataframe to parquet table

Re: Spark 2 Can't write dataframe to parquet table

Re: Spark 2 Can't write dataframe to parquet table

Re: Spark 2 Can't write dataframe to parquet table

Spark 2 Can't write dataframe to parquet table

Sqoop split-by date wants to compare a timestamp w...

Re: Error Sqooping data with Oozie, java.lang.NoSu...

Re: Error Sqooping data with Oozie, java.lang.NoSu...

Re: Error Sqooping data with Oozie, java.lang.NoSu...

Error Sqooping data with Oozie, java.lang.NoSuchMe...