Created on 08-17-2016 09:16 AM - edited 09-16-2022 03:35 AM
We have a use case that want to use the binary data type in Hive table:
1. In HDFS directory (e.g /data/work/hive/test/), we have several blob files which we want to store in Hive table (table1) as binary data type.
2. We have another Hive table (table2) storing regular CSV data and row number is the same as the number of above blob files.
3. How we can combine these two tables as a new table (table3 with both tables' columns and rows)?
Created 08-26-2016 08:52 AM
Mr. Chen,
There is nothing that comes out-of-the-box for Hive that will achive this goal. However, you should be able to create a custom UDF to load the data given the file path. If you do create one, this may be an interesting component for hive-contrib.
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html
https://github.com/apache/hive/tree/master/contrib
CREATE TABLE table2( rowid INT, firstname STRING, lastname STRING, hdfs_path STRING); CREATE TABLE table3( rowid INT, mydata BINARY, firstname STRING, lastname STRING); INSERT INTO table3 SELECT rowid, LOAD_DATA(hdfs_path) AS mydata, firstname, lastname FROM table2;
Created 08-26-2016 07:58 AM
Mr. Chen,
Do the Blobs already have row numbers?
Is there a 1:1 relationship with the number of Blobs to the number of CSV entries or does every Blob belong to every CSV?
CREATE TABLE table1( rowid INT, mydata BINARY); CREATE TABLE table2( rowid INT, firstname STRING, lastname STRING); CREATE TABLE table3( rowid INT, mydata BINARY, firstname STRING, lastname STRING); INSERT INTO TABLE table1 VALUES (100, 'DataDataData'); INSERT INTO TABLE table2 VALUES (100, "John", "Doe"); INSERT INTO table3 SELECT a.rowid, a.mydata, b.firstname, b.lastname FROM table1 a INNER JOIN table2 b ON a.rowid=b.rowid; SELECT * FROM table3;
Created 08-26-2016 08:28 AM
Hi David,
Thanks a lot for the response.
The Blob table has no row numbers since we load the Blob files into Hive table from a HDFS directory. So I cannot use JOIN to combine the two tables.
And yes, there is a 1:1 relationship with the number of Blobs to the number of CSV entries.
Your solution is to use JOIN to combine two tables with same rowid but there is a problem that how can we have the rowid for the Blob tables (the row Blob data files are in one HDFS directory).
Created 08-26-2016 08:31 AM
How do you know which blob goes to which CSV entry?
Created 08-26-2016 08:36 AM
In each entry of the CSV file, the last column is the path (HDFS directory path with blob file name) of the blob file. So one blob file is related to each row entry in the CSV file. For the first concern, I just cannot load the blob files into the same Hive table based on the HDFS path.
Created 08-26-2016 08:52 AM
Mr. Chen,
There is nothing that comes out-of-the-box for Hive that will achive this goal. However, you should be able to create a custom UDF to load the data given the file path. If you do create one, this may be an interesting component for hive-contrib.
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html
https://github.com/apache/hive/tree/master/contrib
CREATE TABLE table2( rowid INT, firstname STRING, lastname STRING, hdfs_path STRING); CREATE TABLE table3( rowid INT, mydata BINARY, firstname STRING, lastname STRING); INSERT INTO table3 SELECT rowid, LOAD_DATA(hdfs_path) AS mydata, firstname, lastname FROM table2;
Created 08-26-2016 08:57 AM
David,
Thanks a lot for your information. Really appreciate it.
I will see if our team want to invest the time/resources to create a UDF in Hive. Normally, how long it takes to create a UDF like my case?
Created 08-26-2016 09:09 AM
There are many example out there about how to create a UDF and many examples about using an HDFS client library to download data. Pairing the two should be pretty straightforward. The one thing to lookout for is that this idea is predicated on the blobs not being particularly large (this will probably not work for files in the GB's of size).
Created 08-26-2016 09:10 AM
Yes, the blob file is not large and thanks again for the help.
Created 08-29-2017 08:42 AM
I know this is an old thread (so forgive me for resurrecting it). But you could use a join that joins using the INPUT__FILE__NAME (meta date column) of the blob with some regexp cast on it (if neccesary) to make the name munchable.
So let's consider a table where you load the CSV data:
CSV would read something like this:
col1, col2, col3, col4, blob_loc
some, random, data, in here. blob_loc
some2, random2, data2, in here2, blob_loc
BLOB table would be
<BLOB_content>, regexp_replace("", "", INPUT__FILE__NAME) as blob_loc
then just join on blob_loc
(typed this up in one minute, so forgive the formatting issues).