Support Questions

Find answers, ask questions, and share your expertise

How to use Binary Data Type in Hive

avatar
Explorer

We have a use case that want to use the binary data type in Hive table:
1. In HDFS directory (e.g /data/work/hive/test/), we have several blob files which we want to store in Hive table (table1) as binary data type.
2. We have another Hive table (table2) storing regular CSV data and row number is the same as the number of above blob files.
3. How we can combine these two tables as a new table (table3 with both tables' columns and rows)?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Mr. Chen,

 

There is nothing that comes out-of-the-box for Hive that will achive this goal.  However, you should be able to create a custom UDF to load the data given the file path. If you do create one, this may be an interesting component for hive-contrib.


http://stackoverflow.com/questions/27402442/read-an-hdfs-file-from-a-hive-udf-execution-error-return...

 

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-CreatingCustom...

 

https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html

 

https://github.com/apache/hive/tree/master/contrib

 

CREATE TABLE table2(
  rowid INT,
  firstname STRING,
  lastname STRING,
  hdfs_path STRING);

CREATE TABLE table3(
  rowid INT,
  mydata BINARY,
  firstname STRING,
  lastname STRING);

INSERT INTO table3 SELECT rowid, LOAD_DATA(hdfs_path) AS mydata, firstname, lastname FROM table2;

View solution in original post

10 REPLIES 10

avatar
Expert Contributor

Mr. Chen,

 

Do the Blobs already have row numbers?

 

Is there a 1:1 relationship with the number of Blobs to the number of CSV entries or does every Blob belong to every CSV?

 

 

CREATE TABLE table1(
  rowid INT,
  mydata BINARY);

CREATE TABLE table2(
  rowid INT,
  firstname STRING,
  lastname STRING);

CREATE TABLE table3(
  rowid INT,
  mydata BINARY,
  firstname STRING,
  lastname STRING);

INSERT INTO TABLE table1 VALUES (100, 'DataDataData');
INSERT INTO TABLE table2 VALUES (100, "John", "Doe"); 


INSERT INTO table3 SELECT a.rowid, a.mydata, b.firstname, b.lastname FROM table1 a INNER JOIN table2 b ON a.rowid=b.rowid;

SELECT * FROM table3;

avatar
Explorer

Hi David,

 

Thanks a lot for the response.

 

The Blob table has no row numbers since we load the Blob files into Hive table from a HDFS directory. So I cannot use JOIN to combine the two tables.

 

And yes, there is a 1:1 relationship with the number of Blobs to the number of CSV entries.

 

Your solution is to use JOIN to combine two tables with same rowid but there is a problem that how can we have the rowid for the Blob tables (the row Blob data files are in one HDFS directory).

avatar
Expert Contributor

How do you know which blob goes to which CSV entry?

avatar
Explorer

In each entry of the CSV file, the last column is the path (HDFS directory path with blob file name) of the blob file. So one blob file is related to each row entry in the CSV file. For the first concern, I just cannot load the blob files into the same Hive table based on the HDFS path.

avatar
Expert Contributor

Mr. Chen,

 

There is nothing that comes out-of-the-box for Hive that will achive this goal.  However, you should be able to create a custom UDF to load the data given the file path. If you do create one, this may be an interesting component for hive-contrib.


http://stackoverflow.com/questions/27402442/read-an-hdfs-file-from-a-hive-udf-execution-error-return...

 

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-CreatingCustom...

 

https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html

 

https://github.com/apache/hive/tree/master/contrib

 

CREATE TABLE table2(
  rowid INT,
  firstname STRING,
  lastname STRING,
  hdfs_path STRING);

CREATE TABLE table3(
  rowid INT,
  mydata BINARY,
  firstname STRING,
  lastname STRING);

INSERT INTO table3 SELECT rowid, LOAD_DATA(hdfs_path) AS mydata, firstname, lastname FROM table2;

avatar
Explorer

David,

 

Thanks a lot for your information. Really appreciate it.

 

I will see if our team want to invest the time/resources to create a UDF in Hive. Normally, how long it takes to create a UDF like my case?

avatar
Expert Contributor

There are many example out there about how to create a UDF and many examples about using an HDFS client library to download data.  Pairing the two should be pretty straightforward.  The one thing to lookout for is that this idea is predicated on the blobs not being particularly large (this will probably not work for files in the GB's of size).

avatar
Explorer

Yes, the blob file is not large and thanks again for the help.

 

 

avatar
Explorer

I know this is an old thread (so forgive me for resurrecting it). But you could use a join that joins using the INPUT__FILE__NAME (meta date column) of the blob with some regexp cast on it (if neccesary) to make the name munchable.

 

So let's consider a table where you load the CSV data:

CSV would read something like this:

col1, col2, col3, col4, blob_loc

some, random, data, in here. blob_loc

some2, random2, data2, in here2, blob_loc

 

BLOB table would be

<BLOB_content>, regexp_replace("", "", INPUT__FILE__NAME) as blob_loc

 

then just join on blob_loc

 

(typed this up in one minute, so forgive the formatting issues).