Support Questions

Mr. Chen · ‎08-17-2016

We have a use case that want to use the binary data type in Hive table:
1. In HDFS directory (e.g /data/work/hive/test/), we have several blob files which we want to store in Hive table (table1) as binary data type.
2. We have another Hive table (table2) storing regular CSV data and row number is the same as the number of above blob files.
3. How we can combine these two tables as a new table (table3 with both tables' columns and rows)?

David M. · ‎08-26-2016

Mr. Chen,

There is nothing that comes out-of-the-box for Hive that will achive this goal. However, you should be able to create a custom UDF to load the data given the file path. If you do create one, this may be an interesting component for hive-contrib.

http://stackoverflow.com/questions/27402442/read-an-hdfs-file-from-a-hive-udf-execution-error-return...

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-CreatingCustom...

https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html

https://github.com/apache/hive/tree/master/contrib

CREATE TABLE table2(
  rowid INT,
  firstname STRING,
  lastname STRING,
  hdfs_path STRING);

CREATE TABLE table3(
  rowid INT,
  mydata BINARY,
  firstname STRING,
  lastname STRING);

INSERT INTO table3 SELECT rowid, LOAD_DATA(hdfs_path) AS mydata, firstname, lastname FROM table2;

View solution in original post

David M. · ‎08-26-2016

Mr. Chen,

Do the Blobs already have row numbers?

Is there a 1:1 relationship with the number of Blobs to the number of CSV entries or does every Blob belong to every CSV?

CREATE TABLE table1(
  rowid INT,
  mydata BINARY);

CREATE TABLE table2(
  rowid INT,
  firstname STRING,
  lastname STRING);

CREATE TABLE table3(
  rowid INT,
  mydata BINARY,
  firstname STRING,
  lastname STRING);

INSERT INTO TABLE table1 VALUES (100, 'DataDataData');
INSERT INTO TABLE table2 VALUES (100, "John", "Doe"); 


INSERT INTO table3 SELECT a.rowid, a.mydata, b.firstname, b.lastname FROM table1 a INNER JOIN table2 b ON a.rowid=b.rowid;

SELECT * FROM table3;

Mr. Chen · ‎08-26-2016

Hi David,

Thanks a lot for the response.

The Blob table has no row numbers since we load the Blob files into Hive table from a HDFS directory. So I cannot use JOIN to combine the two tables.

And yes, there is a 1:1 relationship with the number of Blobs to the number of CSV entries.

Your solution is to use JOIN to combine two tables with same rowid but there is a problem that how can we have the rowid for the Blob tables (the row Blob data files are in one HDFS directory).

David M. · ‎08-26-2016

How do you know which blob goes to which CSV entry?

Mr. Chen · ‎08-26-2016

In each entry of the CSV file, the last column is the path (HDFS directory path with blob file name) of the blob file. So one blob file is related to each row entry in the CSV file. For the first concern, I just cannot load the blob files into the same Hive table based on the HDFS path.

David M. · ‎08-26-2016

Mr. Chen,

There is nothing that comes out-of-the-box for Hive that will achive this goal. However, you should be able to create a custom UDF to load the data given the file path. If you do create one, this may be an interesting component for hive-contrib.

http://stackoverflow.com/questions/27402442/read-an-hdfs-file-from-a-hive-udf-execution-error-return...

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-CreatingCustom...

https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html

https://github.com/apache/hive/tree/master/contrib

CREATE TABLE table2(
  rowid INT,
  firstname STRING,
  lastname STRING,
  hdfs_path STRING);

CREATE TABLE table3(
  rowid INT,
  mydata BINARY,
  firstname STRING,
  lastname STRING);

INSERT INTO table3 SELECT rowid, LOAD_DATA(hdfs_path) AS mydata, firstname, lastname FROM table2;

Mr. Chen · ‎08-26-2016

David,

Thanks a lot for your information. Really appreciate it.

I will see if our team want to invest the time/resources to create a UDF in Hive. Normally, how long it takes to create a UDF like my case?

David M. · ‎08-26-2016

There are many example out there about how to create a UDF and many examples about using an HDFS client library to download data. Pairing the two should be pretty straightforward. The one thing to lookout for is that this idea is predicated on the blobs not being particularly large (this will probably not work for files in the GB's of size).

Mr. Chen · ‎08-26-2016

Yes, the blob file is not large and thanks again for the help.

ChillTake · ‎08-29-2017

I know this is an old thread (so forgive me for resurrecting it). But you could use a join that joins using the INPUT__FILE__NAME (meta date column) of the blob with some regexp cast on it (if neccesary) to make the name munchable.

So let's consider a table where you load the CSV data:

CSV would read something like this:

col1, col2, col3, col4, blob_loc

some, random, data, in here. blob_loc

some2, random2, data2, in here2, blob_loc

BLOB table would be

<BLOB_content>, regexp_replace("", "", INPUT__FILE__NAME) as blob_loc

then just join on blob_loc

(typed this up in one minute, so forgive the formatting issues).

Cloudera Community

Support Questions

How to use Binary Data Type in Hive