Reply
Highlighted
Explorer
Posts: 8
Registered: ‎01-11-2017

Fixed width file format with 4000 columns

Hi

 

   I have a large file that is formatted via fixed width with 4000 columns stored on an AWS S3 bucket.  I was planning on using hdfs dfs -cp to copy down the file to hdfs.  Once in hdfs, how should I access the the data with the large number of columns?  Should I use impala, hive, or HBase to integrate this data?

 

Regards,

Peter

Posts: 642
Topics: 3
Kudos: 118
Solutions: 67
Registered: ‎08-16-2016

Re: Fixed width file format with 4000 columns

It depends more on how you plan to access the data. Do your users/you have inherent SQL knowledge? Do you know how to fetch data from HBase?

Is the data sparse (many empty fields)?

Are you going to be accessing single records? a group of records? aggregating specific columns?

The number of columns, 4k, is wide but not enough to break any of these systems.
Posts: 642
Topics: 3
Kudos: 118
Solutions: 67
Registered: ‎08-16-2016

Re: Fixed width file format with 4000 columns

I'll add that you will have to ingest the data into HBase or conform the data in the HFile format then bulk ingest it. You cannot simple access data in HDFS in HBase. That may help eliminate it.
Posts: 519
Topics: 14
Kudos: 91
Solutions: 45
Registered: ‎09-02-2016

Re: Fixed width file format with 4000 columns

The question is very generic, "should I use impala, hive or HBase?"

 

What kind of activies that you are planning with your data?
OLTP or OLAP

 

If OLTP, then try Hive or Impala
a) Hive will use MapReduce for processing(by default), the performance will be little slow compare to Impala
b) Impala will be faster, but it will use more memory. If your environment has other priority tasks then it is not recommened to use Impala during other priority tasks

 

If OLAP, then try HBase
a) Are you comfortable with Column based SQL? if your answer is yes, then HBase will be suitable for your task

 

Note: Hope you aware that after copy data to HDFS, you have to load data into database before use

 

 

Announcements
New solutions