Created on 06-15-2016 07:20 PM
I was going to just do a REST call to the web service used in my NiFi.
My example is on github with full scripts an source code.
So I created a semi-useful quick prototype Hive UDF in Java called ProfanityRemover that converts many non-business friendly terms into asterisks (*). It's a small list for performance purposes (like 2,000 with some variations for spacing), but blocks the common ones. It does have a higher than you would like incidence of false positives. To this right you could use a commercial API or write some machine learning.
Warning! src/main/resources and src/test/resources in github contain a list of offensive words.
To Build an Eclipse Project
mvn eclipse:eclipse
To Build
./build.sh
To Build for Command-Line Usage (outside of Hive)
./buildfirst.sh
(or)
mvn clean compile assembly:single
generates
target/deprofaner-1.0-jar-with-dependencies.jar
Copy deprofaner*jar to directory to run from or /usr/hdp/current/hive-client/lib/
mkdir -p /opt/demo/udf
Copy src/main/resources/terms.txt to /opt/demo/udf/terms.txt
In Hive
hive> set hive.cli.print.header=true; hive> add jar deprofaner-1.0-jar-with-dependencies.jar; Added [deprofaner-1.0-jar-with-dependencies.jar] to class path Added resources: [deprofaner-1.0-jar-with-dependencies.jar] hive> CREATE TEMPORARY FUNCTION cleaner as 'com.dataflowdeveloper.deprofaner.ProfanityRemover'; OK select cleaner('clean this <curseword> up now') from sample_07 limit 1; OK _c0 clean this **** up now Time taken: 6.279 seconds, Fetched: 1 row(s)
Check logs in /var/log/hive/hiveserver2.log
I set the Hive CLI Print Header for more details on output.
To make this a Permanent UDF
Run scripts/install.sh, which creates an HDFS directory with open permissions and puts our built JAR up there.
set hive.cli.print.header=true; CREATE FUNCTION cleaner as 'com.dataflowdeveloper.deprofaner.ProfanityRemover' USING JAR 'hdfs:///udf/deprofaner-1.0-jar-with-dependencies.jar';
This is a working example of a Hive UDF.
The primary code is pretty short:
@Description(name = "profanityremover", value = "_FUNC_(string) - sanitizes text by replacing profanities ") public final class ProfanityRemover extends UDF { /** * UDF Evaluation * * @param s * Text passed in * @return Text cleaned */ public Text evaluate(final Text s) { if (s == null) { return null; } String cleaned = Util.filterOutProfanity(s.toString()); return new Text(cleaned); } }
There's not much to writing a simple UDF (that is extending the UDF class), there are some other classes to extend for more functionality. But for writing a basic function this works really well. You just need to implement one method: evaluate.
Then you build a Jar. See the build.sh and pom.xml for Maven build details.
Deploy the Jar.
hive> add jar deprofaner-1.0-jar-with-dependencies.jar;
Create the function.
hive> CREATE TEMPORARY FUNCTION cleaner as 'com.dataflowdeveloper.deprofaner.ProfanityRemover';
Use it like any other function.
Pretty cool.
Created on 06-15-2016 08:33 PM
The Brickhouse Collection of UDFs from Klout includes functions for collapsing multiple rows into one, generating top K lists, a distributed cache, bloom counters, JSON functions and HBase tools.
Facebook UDF Collection (HIVE-1545) including functions for unescape, find in an array and finding a max in a set of columns.