I am wondering what would be the performance difference of using impala to query data from S3 or kudu table. I presume kudu table store its data on HDFS which I think it is more expensive comparing storing data on S3.. So based on these, what would be the best option?
Kudu actually doesn't store its data on HDFS, it's built to be completely independent of HDFS. That said, the "best" option depends largely on your use case.
As far as I know, S3 doesn't have many optimizations for updating data or for indexing brand new data coming in. This is exactly what Kudu is designed for, and if that's something that you don't need for your use-case, then S3 is likely the winner.