Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

PutHiveQL is creating a file for each line

Super Collaborator


I am forming my SQL as INSERT INTO dly_job_summary values('${now()}','${jobname}','${filename}','${status}','${comments}') from different flows and trying log in to an external table dly_job_summary which is located on HDFS. i am able to execute the SQLs using PutHiveQL , but if i go back to HDFS its creating a file for each SQL. i want all these go in to single file.

how to achieve that?



Super Guru

That is the way INSERT statements work in Hive over the JDBC driver, each statement is auto-committed so they go in as separate files. There are a couple of alternatives, but both involve merging your lines into a single larger file:

1) Use PutHiveStreaming. This processor currently accepts Avro files and inserts into ORC files on HDFS for use by Hive. However if your Avro file only has one record, this is actually a worse solution. An improvement is to have an Avro file full of many records, perhaps by converting each of a CSV line to Avro and then using MergeContent with an Avro merge strategy, or use MergeContent on CSV lines, then convert CSV to Avro.

2) Create a large CSV/Avro/ORC file by merging the lines as described above, then use PutHDFS to put the large file(s) onto HDFS. You can then use ReplaceText (to set the content to a Hive DDL line like CREATE EXTERNAL TABLE IF NOT EXISTS) -> PutHiveQL to create an external table atop your HDFS directory containing the files. Alternative to the CREATE TABLE you could send a LOAD DATA INFILE line to PutHiveQL, which would load from your HDFS temp directory into a managed Hive table, where you could set other table properties (such as being backed by Avro/ORC/Parquet).

3) If you need to append to an HDFS file from multiple incoming flow(s) or connection(s), you likely need PutHDFS with a Conflict Resolution Strategy of "append". You can then reconcile with a Hive table using the technique(s) in Option 2.

Super Collaborator

i am currently following the option 3 that you mentioned above , but was checking to see if there is any better solution.

Thank you..

Super Guru

MergeContent or MergeRecord should alleviate the need to append to an HDFS file, so I recommend option 2.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.