Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Master Guru
su hdfs
hadoop fs -mkdir /udf
hadoop fs -put urldetector-1.0-jar-with-dependencies.jar /udf/
hadoop fs -put libs/url-detector-0.1.15.jar /udf/
hadoop fs -chown -R hdfs /udf
hadoop fs -chgrp -R hdfs /udf
hadoop fs -chmod -R 775 /udf

Create Hadoop Directories and upload the two necessary libraries.

CREATE FUNCTION urldetector as 'com.dataflowdeveloper.detection.URLDetector' USING JAR 'hdfs:///udf/urldetector-1.0-jar-with-dependencies.jar', JAR 'hdfs:///udf/url-detector-0.1.15.jar';

Create Hive Function with those HDFS referenced JARs

select http_user_agent,urldetector(remote_host)asurls,remote_host from AccessLogs limit 100;

Test the UDF via Hive QL

@Description(name="urldetector", value="_FUNC_(string) - detectsurls")

public final class URLDetector extends UDF{}

Java Header for the UDF

set hive.cli.print.header=true;
add jar urldetector-1.0-jar-with-dependencies.jar;CREATE TEMPORARY FUNCTION urldetector as 'com.dataflowdeveloper.detection.URLDetector';select urldetector(description) from sample_07 limit 100;

You can test with a temporary function through Hive CLI before making the function permanent.

mvn compile assembly:single

Build the Jar File for Deployment

The library from LinkedIn (https://github.com/linkedin/URL-Detector) must be compiled and the JAR used in your code and deployed to Hive.

References

See: https://github.com/tspannhw/URLDetector for full source code.

1,724 Views