Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

I have a zip file with 10k Mera files and 10k data files. What is the best way to ingest this in hive? Meta and data tables are separate

Highlighted

I have a zip file with 10k Mera files and 10k data files. What is the best way to ingest this in hive? Meta and data tables are separate

New Contributor

I have a zip file with 10k Mera files and 10k data files. What is the best way to ingest this in hive? Meta and data tables are separate

4 REPLIES 4

Re: I have a zip file with 10k Mera files and 10k data files. What is the best way to ingest this in hive? Meta and data tables are separate

Super Guru

what is a mera file?

Re: I have a zip file with 10k Mera files and 10k data files. What is the best way to ingest this in hive? Meta and data tables are separate

New Contributor

@sunile.manjee apologies for typo its meta files.

Re: I have a zip file with 10k Mera files and 10k data files. What is the best way to ingest this in hive? Meta and data tables are separate

Expert Contributor
@Nilesh Shrimant

You should experiment with several methods of Data -> HDFS -> Hive.

In its simplest form, if your data is concise, you can always upload to HDFS and create external Hive table feeding your HIVE CREATE EXTERNAL TABLE Statement with the necessary configurations to understand your data.

If your data needs processing and preparation I recommend Nifi. I use NiFi to do this (more than 50 million records) in several different manners. You will need to inspect all of the NiFi Hive Processors and decide which one fits best for your Use Case.

Re: I have a zip file with 10k Mera files and 10k data files. What is the best way to ingest this in hive? Meta and data tables are separate

Super Guru

Here are a few options

  1. Store data in its native format and create hive external tables on it.
    1. Not the best performance when it comes to queries the data but it may do the job or you
  2. Store data in ORC or Parquet format on HDFS, external Table
    1. Much better performance but built in optimizations for hive not there
  3. Store data in ORC or Parquet format and ingest into hive as a internal table
    1. Best performance

Tools you can for ingest

  1. NiFi
    1. Super easy
    2. you can convert data from one format to another in pipeline.
  2. Sqoop
    1. If source is in RDBMS
  3. Spark
    1. Super easy but no UI
    2. you can convert data from one format to another in pipeline
  4. Storm
    1. Super fast ingest, not as easy
    2. you can convert data from one format to another in pipeline