Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Super Guru

I was going to just do a REST call to the web service used in my NiFi.

My example is on github with full scripts an source code.

So I created a semi-useful quick prototype Hive UDF in Java called ProfanityRemover that converts many non-business friendly terms into asterisks (*). It's a small list for performance purposes (like 2,000 with some variations for spacing), but blocks the common ones. It does have a higher than you would like incidence of false positives. To this right you could use a commercial API or write some machine learning.

Warning! src/main/resources and src/test/resources in github contain a list of offensive words.

Building a Hive UDF

To Build an Eclipse Project

	mvn eclipse:eclipse

To Build

	./build.sh

To Build for Command-Line Usage (outside of Hive)

	./buildfirst.sh

(or)

	mvn clean compile assembly:single

generates

target/deprofaner-1.0-jar-with-dependencies.jar

Copy deprofaner*jar to directory to run from or /usr/hdp/current/hive-client/lib/

	mkdir -p /opt/demo/udf

Copy src/main/resources/terms.txt to /opt/demo/udf/terms.txt

In Hive

	hive> set hive.cli.print.header=true;
	hive> add jar deprofaner-1.0-jar-with-dependencies.jar;
	Added [deprofaner-1.0-jar-with-dependencies.jar] to class path
	Added resources: [deprofaner-1.0-jar-with-dependencies.jar]
	hive> CREATE TEMPORARY FUNCTION cleaner as 'com.dataflowdeveloper.deprofaner.ProfanityRemover';
	OK
	select cleaner('clean this <curseword> up now') from sample_07 limit 1;
	OK
	_c0
	clean this **** up now
	Time taken: 6.279 seconds, Fetched: 1 row(s)

Check logs in /var/log/hive/hiveserver2.log

I set the Hive CLI Print Header for more details on output.

To make this a Permanent UDF

Run scripts/install.sh, which creates an HDFS directory with open permissions and puts our built JAR up there.

	set hive.cli.print.header=true;
	CREATE FUNCTION cleaner as 'com.dataflowdeveloper.deprofaner.ProfanityRemover' USING JAR 'hdfs:///udf/deprofaner-1.0-jar-with-dependencies.jar';

This is a working example of a Hive UDF.

The primary code is pretty short:

@Description(name = "profanityremover", value = "_FUNC_(string) - sanitizes text by replacing profanities ")
public final class ProfanityRemover extends UDF {
	/**
	 * UDF Evaluation
	 * 
	 * @param s
	 *            Text passed in
	 * @return Text cleaned
	 */
	public Text evaluate(final Text s) {
		if (s == null) {
			return null;
		}
		String cleaned = Util.filterOutProfanity(s.toString());		
		return new Text(cleaned);
	}
}

There's not much to writing a simple UDF (that is extending the UDF class), there are some other classes to extend for more functionality. But for writing a basic function this works really well. You just need to implement one method: evaluate.

Then you build a Jar. See the build.sh and pom.xml for Maven build details.

Deploy the Jar.

hive> add jar deprofaner-1.0-jar-with-dependencies.jar;  

Create the function.

hive> CREATE TEMPORARY FUNCTION cleaner as 'com.dataflowdeveloper.deprofaner.ProfanityRemover';

Use it like any other function.

Pretty cool.

15,813 Views
Comments
Super Guru

The Brickhouse Collection of UDFs from Klout includes functions for collapsing multiple rows into one, generating top K lists, a distributed cache, bloom counters, JSON functions and HBase tools.

Facebook UDF Collection (HIVE-1545) including functions for unescape, find in an array and finding a max in a set of columns.

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎06-15-2016 07:20 PM
Updated by:
 
Contributors
Top Kudoed Authors