Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark JDBC SQLite long integers truncated

Highlighted

Spark JDBC SQLite long integers truncated

New Contributor

I'm trying to ingest data from SQLite3 files to Parquet files on HDFS, using Spark 1.6.3 and the Xerial SQLite JDBC driver version 3.20.0. I'm having an issue where long integers (8 byte values) are getting truncated (cut off at 4 bytes).

Here's the input data as per Sqlite3:

sqlite> .schema mytable
CREATE TABLE `mytable` (  `rowid` INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT , `timeval` INTEGER NULL DEFAULT NULL, ... );

sqlite> select rowid, timeval from mytable order by rowid limit 10;
1|1501166842023455000
2|1501166842024397000
3|1501166842024756000
4|1501166842025088000
5|1501166842025418000
6|1501166842026171000
7|1501166842026531000
8|1501177221697678000
9|1501354232016297000
10|1501414777569030000

To load the data into Spark:

scala> val df = sqlContext.read.format("jdbc").option("url", "jdbc:sqlite:/path/to/local/file.sqlite").option("dbtable", "mytable").option("driver", "org.sqlite.JDBC").load()

SQLite does not have a separate long integer datatype; therefore I think Spark defines the dataframe's schema with regular integer types for the columns:

scala> df.printSchema
root
 |-- rowid: integer (nullable = false)
 |-- timeval: integer (nullable = true)
 ...

When using Spark to write the values to Parquet files on HDFS or look at the values, the long integers are truncated at 4 bytes:

scala> df.select("rowid", "timeval").orderBy("rowid").show(10)
+-----+-----------+
|rowid|    timeval|
+-----+-----------+
|    1|  258418968|
|    2|  259360968|
|    3|  259719968|
|    4|  260051968|
|    5|  260381968|
|    6|  261134968|
|    7|  261494968|
|    8|-1003312464|
|    9|  828136488|
|   10|  226897776|
+-----+-----------+
only showing top 10 rows

df.write.format("parquet").mode("error").save("/data/mytable")
val df2 = sqlContext.read.parquet("/data/mytable")

scala> df2.select("rowid", "timeval").orderBy("rowid").show(10)
+-----+-----------+
|rowid|    timeval|
+-----+-----------+
|    1|  258418968|
|    2|  259360968|
|    3|  259719968|
|    4|  260051968|
|    5|  260381968|
|    6|  261134968|
|    7|  261494968|
|    8|-1003312464|
|    9|  828136488|
|   10|  226897776|
+-----+-----------+
only showing top 10 rows

As seen above, some of the truncated integers will be interpreted as negative, depending what value ends up in their sign bit after being truncated.

But when Spark queries the data to look for those negative values, it seems to handle the long integer datatypes correctly:

scala> df.select("timeval").where(df("timeval") < 0).count()
res9: Long = 0

So it looks like Spark can properly interpret the values. Is there a way to coerce Spark to properly interpret and maintain the values as long integers?

BTW this issue happens in both Spark and Pyspark.

Thank you for your thoughts.

Don't have an account?
Coming from Hortonworks? Activate your account here