Support Questions
Find answers, ask questions, and share your expertise

Spark binaryFiles multi-page tiff

Highlighted

Spark binaryFiles multi-page tiff

New Contributor

I am trying to convert multi-page image tiff files to text using Tesseract library Java wrapper Tess4j in Spark to process many image files in parallel. The tiff files are in hdfs and I am reading the files in spark using binaryFiles.

When I run the doOCR method using java or commandline of C++ Tesseract, it returns text of each page in a multi-page tiff file. But the same program when tried in spark returns only one(last) page text.

val conf = new SparkConf().setAppName("OCR")
val spark = SparkSession.builder().config(conf).getOrCreate()
val path = "datafiles/tif/im.tif"
val files = spark.sparkContext.binaryFiles(path)
val data = files.mapPartitions((f) => {
  val tess = new Tesseract
  f.map(x => (x._1, tess.doOCR(ImageIO.read(new ByteArrayInputStream(x._2.toArray())))))
})
val r = data.first()
println(r._2)
data.saveAsTextFile("res")

Update:

The issue is with the Array[Byte]. Tessearct C++ and Java Tess4j API can recognize page segment(new page) from tiff file when the input file loaded using java file methods. But in spark, the files are loaded using binaryFiles method. The Array[Byte] returned from PortableDataStream.toArray of binaryFiles output creates an array of bytes.

When doOCR is called on this Array[Byte], it processes the bytes until the first page only and does not proceed further. I tried to use offsets and process from the byte after first page and some offsets after that.

Also, the multipage tiff is generated from copying a single page multiple times, so each page has same content. So from bytearray I am able to see where the second page starts by matching it with page one bytes.

@Kartik