I have a large file that is formatted via fixed width with 4000 columns stored on an AWS S3 bucket. I was planning on using hdfs dfs -cp to copy down the file to hdfs. Once in hdfs, how should I access the the data with the large number of columns? Should I use impala, hive, or HBase to integrate this data?
The question is very generic, "should I use impala, hive or HBase?"
What kind of activies that you are planning with your data?
OLTP or OLAP
If OLTP, then try Hive or Impala
a) Hive will use MapReduce for processing(by default), the performance will be little slow compare to Impala
b) Impala will be faster, but it will use more memory. If your environment has other priority tasks then it is not recommened to use Impala during other priority tasks
If OLAP, then try HBase
a) Are you comfortable with Column based SQL? if your answer is yes, then HBase will be suitable for your task
Note: Hope you aware that after copy data to HDFS, you have to load data into database before use