Member since
07-14-2015
11
Posts
0
Kudos Received
0
Solutions
11-14-2017
09:09 AM
Thank you for your response AutoIN. It looks like this workaround will work for what I need it for.
... View more
11-13-2017
02:06 PM
Thank you for the suggestion EB9185. This unfortunately doesn't work in scala though. The saveAsTable() method in scala is only designed to take the tableName as input and you need to pipe the mode() and format() methods separately.
... View more
11-13-2017
12:47 PM
Also if it helps this is on Spark 2.2.0 and CDH5.12.0
... View more
11-13-2017
12:36 PM
After doing more testing I'm finding that this happens on tables that aren't even parquet tables and the write command isn't even specifying parquet as the write format. This error does not happen with the insertInto() command though. This is not a good work around because saveAsTable() checks column names wheras insertInto() is based on data order and that is not acceptable for the use case I have. Table I'm trying to write to info: $ hive -e "describe formatted test_parquet_spark"
# col_name data_type comment
col1 string
col2 string
# Detailed Table Information
Database: default
CreateTime: Fri Nov 10 22:54:20 GMT 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Table Type: MANAGED_TABLE
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1 Testing in spark2-shell: scala> val spark = SparkSession.builder.config("spark.sql.warehouse.dir", "/user/hive/warehouse").enableHiveSupport.getOrCreate
17/11/13 20:17:05 WARN sql.SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@2524f9bf
scala> val schema = new StructType().add(StructField("col1", StringType)).add(StructField("col2", StringType))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(col1,StringType,true), StructField(col2,StringType,true))
scala> val dataRDD = spark.sparkContext.parallelize(Seq(Row("foobar", "barfoo")))
dataRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> val dataDF = spark.createDataFrame(dataRDD, schema)
dataDF: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> dataDF.write.insertInto("test_parquet_spark")
$ hive -e "select * from test_parquet_spark"
OK
foobar barfoo
scala> dataDF.write.mode("append").saveAsTable("test_parquet_spark")
org.apache.spark.sql.AnalysisException: The format of the existing table default.test_parquet_spark is `HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`.; at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:115) at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75) at org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50) at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:73) at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610) at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:420) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:399) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:354) ... 48 elided Why does saveAsTable() not work while insertInto() does in this example? And why does saveAsTable() not recognize parquet tables as being parquet tables? If anyone knows what is going on I would really appreciate it. Thanks
... View more
11-10-2017
11:59 AM
I'm trying to write a dataframe to a parquet hive table and keep getting an error saying that the table is HiveFileFormat and not ParquetFileFormat. The table is definitely a parquet table.
Here's how I'm creating the sparkSession:
val spark = SparkSession
.builder()
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.config("spark.sql.sources.maxConcurrentWrites","1")
.config("spark.sql.parquet.compression.codec", "snappy")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("parquet.compression", "SNAPPY")
.config("hive.exec.max.dynamic.partitions", "3000")
.config("parquet.enable.dictionary", "false")
.config("hive.support.concurrency", "true")
.enableHiveSupport()
.getOrCreate()
Here's the code that fails and the error message:
dataFrame.write.format("parquet").mode(saveMode).partitionBy(partitionCol).saveAsTable(tableName)
org.apache.spark.sql.AnalysisException: The format of the existing table tableName is `HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`.;
Here's the table storage info:
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
line.delim \n
serialization.format \t
This same code worked before upgrading to spark2, for some reason it isn't recognizing the table as parquet now.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
01-19-2016
10:02 AM
I'm trying to use SQOOP to pull data from ORACLE into HDFS while using multiple mappers. The problem I'm running into is when SQOOP is creating the boundaries for the mappers it's generating them with milliseconds and then tries to compare that to a DATE field in ORACLE, which doesn't work and throws errors. The query from a shell file: sqoop import --connect jdbc:oracle:thin:{CONNECTION ADDRESS} \ --username {USERNAME} \ --password {PASSWORD} \ --verbose \ --table {ORACLE TABLE} \ --where "orig_dt >= to_date('2015-12-01 00:00:00', 'YYYY-MM-DD HH24:MI:SS') and orig_dt < to_date('2015-12-01 01:00:00', 'YYYY-MM-DD HH24:MI:SS')" \ -z \ --compression-codec org.apache.hadoop.io.compress.SnappyCodec \ --as-parquetfile \ --target-dir {HDFS LOCATION} \ --split-by ORIG_DT \ -m 2 The where clause that SQOOP generates for the first mapper: WHERE ( orig_dt >= to_date('2015-12-01 00:00:00', 'YYYY-MM-DD HH24:MI:SS') and orig_dt < to_date('2015-12-01 01:00:00', 'YYYY-MM-DD HH24:MI:SS') ) AND ( ORIG_DT >= '2015-12-01 00:00:00.0' ) AND ( ORIG_DT < '2015-12-01 00:29:59.5' ) The error I'm receiving from this is: ORA-01861: literal does not match format string After doing some research I found this SQOOP issue: https://issues.apache.org/jira/browse/SQOOP-1946 Which seems to match the problem I'm having. I've implemented the workaround they recommend but it now generates a different error: ORA-01830: date format picture ends before converting entire input string This new error seems to be caused by the milliseconds that SQOOP keeps generating with the splits because I've tried running the where clause that SQOOP has generated against the ORACLE server directly and it throws the ORA-01830 error with milliseconds BUT runs properly when I take off the milliseconds. (Side note: Before adding the workaround from SQOOP-1946 issue if I was to run the Where clause SQOOP generated directly against the ORACLE server without milliseconds it would still throw the ORA-01861 error) So the main question is this: Is there any way to prevent SQOOP from using milliseconds when generating splits for a date column when moving data from ORACLE to HDFS? OR is there some other way to solve this problem? It may be worth noting that this query worked fine when I was on CDH5.4.8 and SQOOP 1.4.5 but now after upgrading to CDH 5.5.1 and SQOOP 1.4.6 these errors are getting thrown. Also, this query does work with a single mapper but using multiple mappers throw these errors.
... View more
Labels:
07-17-2015
09:50 AM
Updating the SharLib Worked! =D I did have to bounce oozie before it fixed it but this solution works.
... View more
07-16-2015
10:24 AM
Would there be any way to tell oozie to not use kite when sqooping to avoid this error entirely?
... View more
07-15-2015
08:06 AM
[Research I've done: It looks like the method registerDatasetRepository was included in the kite-data-core.jar file in the 0.14.1 version but was removed in version 0.15.0. We are currently using one of the kite-data-core.jar files after version 0.15.0 and I am unable to use the older version since there are other things using that jar on the hdfs. Is there any way I can tell sqoop to not use kite or add another jar with this method included or any other solution to this error?] We want to use the newest version (and as far as i know, we currently are) but the code that is generated from the sqoop command wants to use the registerDatasetRepository method from kite-data-core.jar. But it looks like that method no longer exists in the newer versions of kite-data-core.jar.
... View more
07-14-2015
11:04 AM
Im currently trying to Sqoop some data from an Oracle DB to HDFS using Oozie to schedule the sqoop workflow. My Sqoop version: Sqoop 1.4.5-cdh5.4.2 My Sqoop code: import -- connect {JDBCpath} \ --username {Username} \ --password {Password} \ --verbose \ --table {Table} \ --where "{Query}" \ -z \ --compression-codec org.apache.hadoop.io.compress.SnappyCodec \ --as-parquetfile \ --target-dir {TargetDirectory} \ --split-by {columnToSplitBy} \ -m 14 The Error I have been getting: java.lang.NoSuchMethodError: org.kitesdk.data.impl.Accessor.registerDatasetRepository(Lorg/kitesdk/data/spi/URIPattern;Lorg/kitesdk/data/spi/OptionBuilder;)V [Edited to focus on my main problem, rather than the research I've done into it which may or may not be the correct path in solving this. This information will be reposted in a reply.]
... View more
Labels:
- Labels:
-
Apache Sqoop