Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Master Guru

I was going to just do a REST call to the web service used in my NiFi.

My example is on github with full scripts an source code.

So I created a semi-useful quick prototype Hive UDF in Java called ProfanityRemover that converts many non-business friendly terms into asterisks (*). It's a small list for performance purposes (like 2,000 with some variations for spacing), but blocks the common ones. It does have a higher than you would like incidence of false positives. To this right you could use a commercial API or write some machine learning.

Warning! src/main/resources and src/test/resources in github contain a list of offensive words.

Building a Hive UDF

To Build an Eclipse Project

	mvn eclipse:eclipse

To Build

	./build.sh

To Build for Command-Line Usage (outside of Hive)

	./buildfirst.sh

(or)

	mvn clean compile assembly:single

generates

target/deprofaner-1.0-jar-with-dependencies.jar

Copy deprofaner*jar to directory to run from or /usr/hdp/current/hive-client/lib/

	mkdir -p /opt/demo/udf

Copy src/main/resources/terms.txt to /opt/demo/udf/terms.txt

In Hive

	hive> set hive.cli.print.header=true;
	hive> add jar deprofaner-1.0-jar-with-dependencies.jar;
	Added [deprofaner-1.0-jar-with-dependencies.jar] to class path
	Added resources: [deprofaner-1.0-jar-with-dependencies.jar]
	hive> CREATE TEMPORARY FUNCTION cleaner as 'com.dataflowdeveloper.deprofaner.ProfanityRemover';
	OK
	select cleaner('clean this <curseword> up now') from sample_07 limit 1;
	OK
	_c0
	clean this **** up now
	Time taken: 6.279 seconds, Fetched: 1 row(s)

Check logs in /var/log/hive/hiveserver2.log

I set the Hive CLI Print Header for more details on output.

To make this a Permanent UDF

Run scripts/install.sh, which creates an HDFS directory with open permissions and puts our built JAR up there.

	set hive.cli.print.header=true;
	CREATE FUNCTION cleaner as 'com.dataflowdeveloper.deprofaner.ProfanityRemover' USING JAR 'hdfs:///udf/deprofaner-1.0-jar-with-dependencies.jar';

This is a working example of a Hive UDF.

The primary code is pretty short:

@Description(name = "profanityremover", value = "_FUNC_(string) - sanitizes text by replacing profanities ")
public final class ProfanityRemover extends UDF {
	/**
	 * UDF Evaluation
	 * 
	 * @param s
	 *            Text passed in
	 * @return Text cleaned
	 */
	public Text evaluate(final Text s) {
		if (s == null) {
			return null;
		}
		String cleaned = Util.filterOutProfanity(s.toString());		
		return new Text(cleaned);
	}
}

There's not much to writing a simple UDF (that is extending the UDF class), there are some other classes to extend for more functionality. But for writing a basic function this works really well. You just need to implement one method: evaluate.

Then you build a Jar. See the build.sh and pom.xml for Maven build details.

Deploy the Jar.

hive> add jar deprofaner-1.0-jar-with-dependencies.jar;  

Create the function.

hive> CREATE TEMPORARY FUNCTION cleaner as 'com.dataflowdeveloper.deprofaner.ProfanityRemover';

Use it like any other function.

Pretty cool.

18,936 Views
Comments
avatar
Master Guru

The Brickhouse Collection of UDFs from Klout includes functions for collapsing multiple rows into one, generating top K lists, a distributed cache, bloom counters, JSON functions and HBase tools.

Facebook UDF Collection (HIVE-1545) including functions for unescape, find in an array and finding a max in a set of columns.