I have ahadoop location with 1 file containing all xml records approximately 40K, What kind of changes in file or staorage I need to do to make it run faster ? ANy pointers..
Total input size is 28 GB, and when I assign 10 cores and 10 executors it takes high time around 4+ hrs to extract the high level attributes of xml
Hi Manasee, how much slow is it comparing to other text based datasources in Spark?
Spark XML needs some improvements itself, for example, in avoiding type dispatching, optimize the parsing logics, etc. Other then that, it's quite similar with JSON and CSV datasource in Spark. So, I am less sure if other external factors can notably improve the performance but I think we should improve mainly the parsing logic in Spark XML itself.