About TimothySpann

TimothySpann · ‎06-28-2016

On My Mac, this works for me. cat /etc/hosts 127.0.0.1localhost sandbox.hortonworks.com sandbox

TimothySpann · ‎06-27-2016

is there an update for 2.4?

TimothySpann · ‎06-17-2016

Lipstick Installation Resources: http://www.graphviz.org/Download_linux_rhel.php https://github.com/Netflix/Lipstick/wiki/Getting-Started Commands sudo yum list available 'graphviz*' sudo yum -y install 'graphviz*' ./gradlew assemble I always like to rename gradlew, avengers; then ./gradlew run-app Hit your browser to view: http://localhost:9292/ Make sure you add that port/open firewall/etc... 2016-06-17 02:36:44,558 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 2.4.0root2016-06-17 02:36:402016-06-17 02:36:44HASH_JOIN,FILTER,LIMIT Success! Job Stats (time in seconds): JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTimeMedianMapTimeMaxReduceTimeMinReduceTimeAvgReduceTimeMedianReducetimeAliasFeatureOutputs job_local2036219587_000121n/an/an/an/an/an/an/an/afruit_names_join,fruits,limited,namesHASH_JOIN job_local406327028_000211n/an/an/an/an/an/an/an/afruit_namesfile:/tmp/temp195796189/tmp-2027262369, Input(s): Successfully read 3 records from: "file:///opt/demo/certification/pig/Lipstick/quickstart/1.dat" Successfully read 3 records from: "file:///opt/demo/certification/pig/Lipstick/quickstart/2.dat" Output(s): Successfully stored 1 records in: "file:/tmp/temp195796189/tmp-2027262369" Counters: Total records written : 1 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_local2036219587_0001->job_local406327028_0002, job_local406327028_0002 2016-06-17 02:36:44,568 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 2016-06-17 02:36:44,571 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized 2016-06-17 02:36:44,582 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2016-06-17 02:36:44,583 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 (orange,ORANGE) It's a very nice looking visualization.

TimothySpann · ‎06-17-2016

Yes ascending is usually a default in most languages. https://pig.apache.org/docs/r0.14.0/basic.html#order-by Usage Note: ORDER BY is NOT stable; if multiple records have the same ORDER BY key, the order in which these records are returned is not defined and is not guarantted to be the same from one run to the next.

TimothySpann · ‎06-16-2016

Sometimes it's easy to share files: https://forums.virtualbox.org/viewtopic.php?t=15679 You just pick a directory, set Auto-mount "Yes" and Access to "Full" and hit OK. For some of us, depending on versions of VirtualBox, the VM and host operating system; things might not work. It also can get broken when host operating system or VM updates. Log into the VM as root and try this (if it's HDP sandbox or another running Centos). cd /opt/VBoxGuestAdditions-*/init sudo ./vboxadd setup modprobe -a vboxguest vboxsf vboxvideo rm -rf /media/sf_Downloads mkdir /media/sf_Downloads mount -t vboxsf Downloads /media/sf_Downloads For me that worked and my Downloads directory was shared so I could move files to my Sandbox and off for development. There are some other things you can try and certainly rebooting everyone helps. For me, this worked fine.

TimothySpann · ‎06-16-2016

Most code for current big data projects and for the code you are going to write is going to be JVM based (Java and Scala mostly). There is certainly a ton of R, Python, Shell and other languages. For this tutorial we will focus on JVM tools. The great thing about that is that Java and Scala Static Code Analysis Tools will work for analyzing your code. JUnit test are great for testing the basic code and making sure you isolate out functionality from Hadoop and Spark specific interfacing. General Java Tools for Testing http://junit.org/ http://checkstyle.sourceforge.net/ http://pmd.github.io/pmd-5.4.2/pmd-java/rules/index.html Testing Hadoop (A Great Overview) https://github.com/mfjohnson/HadoopTesting https://www.infoq.com/articles/HadoopMRUnit Example: I have a Hive UDF written in Java that I can test via Junit to ensure that the main functionality works. (See: UtilTest) import static org.junit.Assert.assertEquals; import org.junit.Test; /** * Test method for * {@link com.dataflowdeveloper.deprofaner.ProfanityRemover#fillWithCharacter( * int, java.lang.String)}. */ @Test public void testFillWithCharacterIntString() { assertEquals("XXXXX", Util.fillWithCharacter(5, "X") ); } As you can see this is just a plain old JUnit Test, but it's one step in the process to make sure you can test your code before it is deployed. Also Jenkins and other CI tools are great at running JUnits are part of their continuous build and integration process. A great way to test your application is with a small Hadoop cluster or simulated one. Testing against a Sandbox downloaded on your laptop is a great way as well. Testing Integration with a Mini-Cluster https://github.com/hortonworks/mini-dev-cluster https://github.com/sakserv/hadoop-mini-clusters Testing Hbase Applications Artem Ervits has a great article on Hbase Unit Testing. https://community.hortonworks.com/repos/15674/variety-of-hbase-unit-testing-utilities.html https://github.com/dbist/HBaseUnitTest Testing Apache NiFi Processors http://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2.0.1/bk_DeveloperGuide/content/instantiate-testrunner.html http://www.nifi.rocks/developing-a-custom-apache-nifi-processor-unit-tests-partI/ Testing Apache NiFi Scripts https://github.com/mattyb149/nifi-script-tester http://funnifi.blogspot.com/2016/06/testing-executescript-processor-scripts.html Testing Oozie https://oozie.apache.org/docs/4.2.0/ENG_MiniOozie.html Testing Hive Scripts https://cwiki.apache.org/confluence/display/Hive/Unit+Testing+Hive+SQL http://hakunamapdata.com/beetest-a-simple-utility-for-testing-apache-hive-scripts-locally-for-non-java-developers/ https://github.com/klarna/HiveRunner https://github.com/edwardcapriolo/hive_test http://finraos.github.io/HiveQLUnit/ Testing Hive UDF http://blog.matthewrathbone.com/2013/08/10/guide-to-writing-hive-udfs.html https://cwiki.apache.org/confluence/display/Hive/PluginDeveloperKit Using org.apache.hive.pdk.HivePdkUnitTest and org.apache.hive.pdk.HivePdkUnitTests in your Hive plugin so that it will be included in unit tests. Testing Pig Scripts http://pig.apache.org/docs/r0.8.1/pigunit.html http://www.slideshare.net/Skillspeed/hdfs-and-big-data-tdd-using-pig-unit-webinar http://www.slideshare.net/SwissHUG/practical-pig-and-pig-unit-michael-noll-july-2012 Testing Apache Spark Applications http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/ https://github.com/holdenk/spark-testing-base http://www.slideshare.net/hkarau/effective-testing-for-spark-programs-strata-ny-2015 https://developer.ibm.com/hadoop/2016/03/07/testing-your-apache-spark-code-with-junit-4-0-and-intellij/ http://www.slideshare.net/knoldus/unit-testing-of-spark-applications Testing Apache Storm Applications Debugging an Apache Storm Topology https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java

TimothySpann · ‎06-15-2016

The Brickhouse Collection of UDFs from Klout includes functions for collapsing multiple rows into one, generating top K lists, a distributed cache, bloom counters, JSON functions and HBase tools. Facebook UDF Collection (HIVE-1545) including functions for unescape, find in an array and finding a max in a set of columns. UDF Collection for Various String distances, Text classification and other Text Mining. UDF for anonymizing data with Apache Pig. Hive UDF for various functions like array count Curve Computing UDF Ngram Functions UDF Hive UDFs Similar to Oracle Funcitons A collection of UDFs for GeocodeIP, Haversine Distance, DecodeURL UDFs Hive Funnel Analysis UDF by Yahoo (tracking user conversion rates across actions) Hive UDF Collection by LivingSocial for Min and Max Date, MySQL Style Like, and more. Hive UDF with Yahoo Data sketches is for Stochastic Streaming Algorithms called Data Sketches. Hive UDF to Count Business Days. User Agent String Parser Hive UDF Date Range Generator Hive UDF

TimothySpann · ‎06-15-2016

I was going to just do a REST call to the web service used in my NiFi. My example is on github with full scripts an source code. So I created a semi-useful quick prototype Hive UDF in Java called ProfanityRemover that converts many non-business friendly terms into asterisks (*). It's a small list for performance purposes (like 2,000 with some variations for spacing), but blocks the common ones. It does have a higher than you would like incidence of false positives. To this right you could use a commercial API or write some machine learning. Warning! src/main/resources and src/test/resources in github contain a list of offensive words. Building a Hive UDF To Build an Eclipse Project mvn eclipse:eclipse To Build ./build.sh To Build for Command-Line Usage (outside of Hive) ./buildfirst.sh (or) mvn clean compile assembly:single generates target/deprofaner-1.0-jar-with-dependencies.jar Copy deprofaner*jar to directory to run from or /usr/hdp/current/hive-client/lib/ mkdir -p /opt/demo/udf Copy src/main/resources/terms.txt to /opt/demo/udf/terms.txt In Hive hive> set hive.cli.print.header=true; hive> add jar deprofaner-1.0-jar-with-dependencies.jar; Added [deprofaner-1.0-jar-with-dependencies.jar] to class path Added resources: [deprofaner-1.0-jar-with-dependencies.jar] hive> CREATE TEMPORARY FUNCTION cleaner as 'com.dataflowdeveloper.deprofaner.ProfanityRemover'; OK select cleaner('clean this <curseword> up now') from sample_07 limit 1; OK _c0 clean this **** up now Time taken: 6.279 seconds, Fetched: 1 row(s) Check logs in /var/log/hive/hiveserver2.log I set the Hive CLI Print Header for more details on output. To make this a Permanent UDF Run scripts/install.sh, which creates an HDFS directory with open permissions and puts our built JAR up there. set hive.cli.print.header=true; CREATE FUNCTION cleaner as 'com.dataflowdeveloper.deprofaner.ProfanityRemover' USING JAR 'hdfs:///udf/deprofaner-1.0-jar-with-dependencies.jar'; This is a working example of a Hive UDF. The primary code is pretty short: @Description(name = "profanityremover", value = "_FUNC_(string) - sanitizes text by replacing profanities ") public final class ProfanityRemover extends UDF { /** * UDF Evaluation * * @param s * Text passed in * @return Text cleaned */ public Text evaluate(final Text s) { if (s == null) { return null; } String cleaned = Util.filterOutProfanity(s.toString()); return new Text(cleaned); } } There's not much to writing a simple UDF (that is extending the UDF class), there are some other classes to extend for more functionality. But for writing a basic function this works really well. You just need to implement one method: evaluate. Then you build a Jar. See the build.sh and pom.xml for Maven build details. Deploy the Jar. hive> add jar deprofaner-1.0-jar-with-dependencies.jar; Create the function. hive> CREATE TEMPORARY FUNCTION cleaner as 'com.dataflowdeveloper.deprofaner.ProfanityRemover'; Use it like any other function. Pretty cool.

TimothySpann · ‎06-10-2016

I call twitter with a filter on some terms, grab the key twitter attributes then call a filter to remove profanities. http://www.purgomalum.com/service/plain?text= or http://www.purgomalum.com/service/json=text work and are free REST API services. For fun, I send the tweet as a search keyword to Guardian using their API (you need to register for a key). http://content.guardianapis.com/search?order-by=newest&q=${tweet}&api-key=StuffNumbers

TimothySpann · ‎06-10-2016

It looks like you have to wait for Spark 2.0 with structured streaming

Online	Offline
Last Visited	‎02-05-2026 01:38 AM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎02-05-2026 01:38 AM
Posts	1,973
Kudos received	1121

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

Re: Zeppelin Ambari View not being displayed withi...

Re: Big Data Wrangling on HDP with Trifacta - How ...

Installing Lipstick on HDP 2.4

Re: Order by Operator in Pig

Mapping Directories from Sandbox VM on VirtualBox ...

Test Driven Development for Big Data (Unofficial G...

Re: Creating a Hive UDF in Java

Creating a Hive UDF in Java

Enriching and Munging Twitter Data with HDF

Re: Receiving AVRO Messages through KAFKA in a Spa...