Support Questions
Find answers, ask questions, and share your expertise

Sandbox HDF Nifi: How do I convert TSV files to CSV?

Currently I have a dataflow with the GetFile processor that taps into a directory path with TSV files. I want to convert these TSV files to CSV for later work using the ConvertCSVToAvro processor. I've created this python script with a .bash wrapper to test it:

import sys
import csv
 
tsvin = csv.reader(sys.stdin, dialect=csv.excel_tab)
commaout = csv.writer(sys.stdout, dialect=csv.excel)
for row in tsvin:
   commaout.writerow(row)

bash wrapper

for file in *.tsv
do
    python tsv2csv.py < $file > ${file%.*}.csv
done

I see the ExecuteScript processor as a possible option. How would I use it to execute this python script--would the processor know where to import from for example...or is there a better way to convert?

1 ACCEPTED SOLUTION

Accepted Solutions

Cloudera Employee

Hi, I would suggest to use the Record reader / writer processors. You can read using a CSVRecordReader (can specify tab as the delimiter) and then use ConverRecord to convert to another schema. you have to define a schema for the records though in avro format.

View solution in original post

1 REPLY 1

Cloudera Employee

Hi, I would suggest to use the Record reader / writer processors. You can read using a CSVRecordReader (can specify tab as the delimiter) and then use ConverRecord to convert to another schema. you have to define a schema for the records though in avro format.

View solution in original post