Member since
05-11-2016
5
Posts
0
Kudos Received
0
Solutions
03-29-2017
01:19 AM
Unfortunately, the content of this file is under NDA, so I can't provice you the file. Some information that I can give is summarized here: Output from "hdfs dfs -ls": -rwxrwx--x+ 3 hive hive 1093251527 2016-09-30 21:15 /path/to/file/month=12/part-r-00000-be7725db-da77-4a34-a3c6-2e5a9276228c.snappy.parquet We have a _metadata and an _common_metadata file in the same directory (I tried removing them, but this did not resolve the issue) Compression: snappy It was created using: parquet-mr version 1.5.0-cdh5.7.1 (build ${buildNumber}) (output from parquet-tools, version 1.9.0) Software used for creation: Bundled Spark 1.6.0 from CDH 5.7.1 (in the meantime we are using CDH 5.9.0) The file contains 713 row groups The file contains 867 columns (of types int64, double and binary) One further things that I tried is copying the problematic file to a seperate directory (without the two metadata files) create a new table from this file with Impala and do the test here. Unfortunately this produces the exactly same behaviour. When it is cached I get the error message, when it is not cached, everything works fine. Let me know if this helps you in understanding this problem or if you need further information (except from the contents of the file. Thanks a lot already! Kind Regards
... View more
03-27-2017
02:09 AM
Hi Tim Thanks a lot for your answer. I tried INVALIDATE METADATA <table> as well as REFRESH <table> to force refreshing the metadata of this table. Unfortunately, the problem remains exactly the same. We want to address a specific performace problem with caching. We have one quite large table in our HDFS (in total much larger than our memory) and we want to cache some of its partitions in memory to speed up the development.
... View more
03-21-2017
07:14 AM
Hi all When we activate HDFS caching for a partitioned table in HDFS in our cluster (CDH 5.9.0) for some files we randomly get errors for the cached files. Here an example case: We have a partitioned table with many partitions (4 partition colums, with 1-20 partitions on each level). For simplicity I boiled the test case down to three of these partitions, let's take for example the month 10, 11, and 12. For the paritions 10 and 11 everything works fine. For partition 12 in 50% of all cases for the simple query: SELECT COUNT(*) FROM table WHERE month=12 I get the error message: File hdfs://path/to/file/month=12/part-r-00000-be7725db-da77-4a34-a3c6-2e5a9276228c.snappy.parquet has invalid file metadata at file offset 40165660. Error = couldn't deserialize thrift msg:
TProtocolException: Invalid data In the other 50% of the cases I get the correct result. Once I deactivate caching the file is always read correctly. Does anybody have an idea on the cause of this issue and on solving it? If more information is needed, just let me know. Kind Regards
... View more
Labels:
- Labels:
-
Apache Impala
-
HDFS
03-20-2017
06:33 AM
Thanks a lot. With the given workaround at the end of the Zeppelin issue, it works for me now.
... View more
03-17-2017
08:35 AM
Hi all We have Spark 2.0 (*) installed from the Cloudera parcel on our cluster (CDH 5.9.0). When running a quite simple App which just reads in some csv files and does a groupBy I always receive errors. The App is submitted with: spark2-submit --class my_class myapp-1.0-SNAPSHOT.jar And I receive the following error message: java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateFormat; local class incompatible: stream classdesc serialVersionUID = 2, local class serialVersionUID = 1 I figured out that there are multiple versions of lang3 installed with the Cloudera release and modified the spark2-submit to: spark2-submit --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true --jars /var/opt/teradata/cloudera/parcels/CDH/jars/commons-lang3-3.3.2.jar --class my_class myapp-1.0-SNAPSHOT.jar This way I cloud get rid of the first error message, but now I get: java.lang.ClassCastException: cannot assign instance of org.apache.commons.lang3.time.FastDateFormat to field org.apache.spark.sql.execution.datasources.csv.CSVOptions.dateFormat of type org.apache.commons.lang3.time.FastDateFormat in instance of org.apache.spark.sql.execution.datasources.csv.CSVOptions The App was written in Scala and compiled using Maven. The source code (**) and the maven pom file (***) are attached at the bottom of this post. Does anybody have an idea on solving this issue? Any help is highly appreciated! Thanks a lot in advance! Kind Regards (*) $spark2-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0.cloudera1
/_/
Branch HEAD
Compiled by user jenkins on 2016-12-06T18:34:13Z
Revision 2389f44e0185f33969d782ed09b41ae45fe30324 (**) import org.apache.spark.sql.SparkSession
object my_class {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("myapp")
.getOrCreate()
val csv = spark.read.option("header", value = false).csv("/path/to/folder/with/some/csv/files/")
val pivot = csv.groupBy("_c0").count()
csv.take(10).foreach(println)
pivot.take(10).foreach(println)
spark.stop()
}
} (***) <?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>de.lht.datalab.ingestion</groupId>
<artifactId>myapp</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<scala.version.base>2.11</scala.version.base>
<scala.version>${scala.version.base}.8</scala.version>
<spark.version>2.0.0.cloudera1</spark.version>
</properties>
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version.base}</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
... View more
Labels:
- Labels:
-
Apache Spark