Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Fixed width file format with 4000 columns

Fixed width file format with 4000 columns

Explorer

Hi

 

   I have a large file that is formatted via fixed width with 4000 columns stored on an AWS S3 bucket.  I was planning on using hdfs dfs -cp to copy down the file to hdfs.  Once in hdfs, how should I access the the data with the large number of columns?  Should I use impala, hive, or HBase to integrate this data?

 

Regards,

Peter

3 REPLIES 3
Highlighted

Re: Fixed width file format with 4000 columns

Champion
It depends more on how you plan to access the data. Do your users/you have inherent SQL knowledge? Do you know how to fetch data from HBase?

Is the data sparse (many empty fields)?

Are you going to be accessing single records? a group of records? aggregating specific columns?

The number of columns, 4k, is wide but not enough to break any of these systems.
Highlighted

Re: Fixed width file format with 4000 columns

Champion
I'll add that you will have to ingest the data into HBase or conform the data in the HFile format then bulk ingest it. You cannot simple access data in HDFS in HBase. That may help eliminate it.
Highlighted

Re: Fixed width file format with 4000 columns

Champion

The question is very generic, "should I use impala, hive or HBase?"

 

What kind of activies that you are planning with your data?
OLTP or OLAP

 

If OLTP, then try Hive or Impala
a) Hive will use MapReduce for processing(by default), the performance will be little slow compare to Impala
b) Impala will be faster, but it will use more memory. If your environment has other priority tasks then it is not recommened to use Impala during other priority tasks

 

If OLAP, then try HBase
a) Are you comfortable with Column based SQL? if your answer is yes, then HBase will be suitable for your task

 

Note: Hope you aware that after copy data to HDFS, you have to load data into database before use

 

 

Don't have an account?
Coming from Hortonworks? Activate your account here