Comparing Data - which way to go - your opinion?


i need your help. At the moment I have 3 folders - ip, url and domain. Each folder contains several hundred csv files with the following format (the timestamps are in numberic format. Changed it just for the example):


Csv1 - badip

ip, timestamp, vendor,today,badip,today,badip,yesterday,badip


csv2 - cyberteam

ip, timestamp, vendor,yesterday,cyberteam,yesterday,cyberteam


And what i need is to compare all the csv files in one folder. For example i need to read csv 1, take the first row and compare it to every file. If the row is matching i need to write the row and all matching rows in a seperate file and compare the timestamps. The first date gets an 1 the second a 2 ... At the end i need to check which vendor has so many rows with a one ...


One friendly guy in this forum gave me the tip to use spark or flink over nifi and I wanted to check if this is the best solution for this process (which is the better match?). Is there another tool you could recommend? At best would be a tool which is easy to setup.


Should i load everything in sql and compare it in sql?


I am really scared to go for the wrong tool and regret it afterwards so I am happy for every tipp you got.