About TimothySpann

mpayne · ‎06-16-2016

@Timothy Spann there is a JIRA created for bringing Calcite into a NiFi processor - https://issues.apache.org/jira/browse/NIFI-1280. The idea originally was to use it to filter out specific columns in incoming CSV data. However, as we looked at it, we found that there is a lot more that this can do! Initially, it will likely be used to simply run SQL over CSV data, with each incoming FlowFile being transformed into an outgoing FlowFile. Eventually, I would like to see additional data formats being introduced into this, so that SQL could be run over any number of different data formats to filter, transform, etc.

TimothySpann · ‎06-16-2016

Most code for current big data projects and for the code you are going to write is going to be JVM based (Java and Scala mostly). There is certainly a ton of R, Python, Shell and other languages. For this tutorial we will focus on JVM tools. The great thing about that is that Java and Scala Static Code Analysis Tools will work for analyzing your code. JUnit test are great for testing the basic code and making sure you isolate out functionality from Hadoop and Spark specific interfacing. General Java Tools for Testing http://junit.org/ http://checkstyle.sourceforge.net/ http://pmd.github.io/pmd-5.4.2/pmd-java/rules/index.html Testing Hadoop (A Great Overview) https://github.com/mfjohnson/HadoopTesting https://www.infoq.com/articles/HadoopMRUnit Example: I have a Hive UDF written in Java that I can test via Junit to ensure that the main functionality works. (See: UtilTest) import static org.junit.Assert.assertEquals; import org.junit.Test; /** * Test method for * {@link com.dataflowdeveloper.deprofaner.ProfanityRemover#fillWithCharacter( * int, java.lang.String)}. */ @Test public void testFillWithCharacterIntString() { assertEquals("XXXXX", Util.fillWithCharacter(5, "X") ); } As you can see this is just a plain old JUnit Test, but it's one step in the process to make sure you can test your code before it is deployed. Also Jenkins and other CI tools are great at running JUnits are part of their continuous build and integration process. A great way to test your application is with a small Hadoop cluster or simulated one. Testing against a Sandbox downloaded on your laptop is a great way as well. Testing Integration with a Mini-Cluster https://github.com/hortonworks/mini-dev-cluster https://github.com/sakserv/hadoop-mini-clusters Testing Hbase Applications Artem Ervits has a great article on Hbase Unit Testing. https://community.hortonworks.com/repos/15674/variety-of-hbase-unit-testing-utilities.html https://github.com/dbist/HBaseUnitTest Testing Apache NiFi Processors http://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2.0.1/bk_DeveloperGuide/content/instantiate-testrunner.html http://www.nifi.rocks/developing-a-custom-apache-nifi-processor-unit-tests-partI/ Testing Apache NiFi Scripts https://github.com/mattyb149/nifi-script-tester http://funnifi.blogspot.com/2016/06/testing-executescript-processor-scripts.html Testing Oozie https://oozie.apache.org/docs/4.2.0/ENG_MiniOozie.html Testing Hive Scripts https://cwiki.apache.org/confluence/display/Hive/Unit+Testing+Hive+SQL http://hakunamapdata.com/beetest-a-simple-utility-for-testing-apache-hive-scripts-locally-for-non-java-developers/ https://github.com/klarna/HiveRunner https://github.com/edwardcapriolo/hive_test http://finraos.github.io/HiveQLUnit/ Testing Hive UDF http://blog.matthewrathbone.com/2013/08/10/guide-to-writing-hive-udfs.html https://cwiki.apache.org/confluence/display/Hive/PluginDeveloperKit Using org.apache.hive.pdk.HivePdkUnitTest and org.apache.hive.pdk.HivePdkUnitTests in your Hive plugin so that it will be included in unit tests. Testing Pig Scripts http://pig.apache.org/docs/r0.8.1/pigunit.html http://www.slideshare.net/Skillspeed/hdfs-and-big-data-tdd-using-pig-unit-webinar http://www.slideshare.net/SwissHUG/practical-pig-and-pig-unit-michael-noll-july-2012 Testing Apache Spark Applications http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/ https://github.com/holdenk/spark-testing-base http://www.slideshare.net/hkarau/effective-testing-for-spark-programs-strata-ny-2015 https://developer.ibm.com/hadoop/2016/03/07/testing-your-apache-spark-code-with-junit-4-0-and-intellij/ http://www.slideshare.net/knoldus/unit-testing-of-spark-applications Testing Apache Storm Applications Debugging an Apache Storm Topology https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java

jomach · ‎04-20-2017

, I had the same problem. Check : https://issues.apache.org/jira/browse/HIVE-16398 and https://issues.apache.org/jira/browse/AMBARI-9821 There is a problem with hive.aux.jars.path

TimothySpann · ‎06-15-2016

The Brickhouse Collection of UDFs from Klout includes functions for collapsing multiple rows into one, generating top K lists, a distributed cache, bloom counters, JSON functions and HBase tools. Facebook UDF Collection (HIVE-1545) including functions for unescape, find in an array and finding a max in a set of columns. UDF Collection for Various String distances, Text classification and other Text Mining. UDF for anonymizing data with Apache Pig. Hive UDF for various functions like array count Curve Computing UDF Ngram Functions UDF Hive UDFs Similar to Oracle Funcitons A collection of UDFs for GeocodeIP, Haversine Distance, DecodeURL UDFs Hive Funnel Analysis UDF by Yahoo (tracking user conversion rates across actions) Hive UDF Collection by LivingSocial for Min and Max Date, MySQL Style Like, and more. Hive UDF with Yahoo Data sketches is for Stochastic Streaming Algorithms called Data Sketches. Hive UDF to Count Business Days. User Agent String Parser Hive UDF Date Range Generator Hive UDF

carter · ‎09-10-2016

Here is a sample Maven project that handles all dependencies. Includes instructions on using and adding your own UDFs. https://github.com/cartershanklin/sample-hiveudf

awesome · ‎01-25-2019

Got the same issue. Run hive command "desc formatted TABLE_NAME" found : # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat The error is caused by SerDe Library. Drop table & recreate table, now the SerDe Library is org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe . Now it's working.

dassi_fongang · ‎10-17-2017

Was anybody ever able to get the Spark jobs to show on HDP 2.6?

vvaks · ‎06-12-2016

@Timothy Spann An in-memory data grid is much more than just a cache. Some key capabilities are: Very granular control over the data being stored Technology agnostic serialization that enables access to cached data from several different tools (Java, C#, C++, ect) Loading of data on cache miss from any backing store Write-through/Write-Behind to any backing store Ability to off-load processing of instruction sets on individual cached entries or in map/reduce style batch Eventing framework providing notification of changes to individual entries or job execution Tiered caching (on-heap, off-heap, disk) HBase is an excellent NoSQL columnar data store but when it comes to dealing with data in memory, all it offers is an LRU caching and eviction scheme with no very little control over what data gets and stays cached. In fact the only control knob is how much memory is allocated for caching per region server. Given that HBase actually stores data with durability, it is often a great choice for access for OLTP use cases. In fact, In-memory data grids are rarely used without a backing store like HBase. However, for application acceleration, processing, and functionality offload, an In-memory data grid can provide capabilities that HBase alone cannot.

elserj · ‎06-13-2016

Accumulo would work for the same reasons that HBase does.

abajwa · ‎06-10-2016

Could you provide the github link or upload the template xml before we publish this? Also would be good to show what the tweet looks like before/after processing

Online	Offline
Last Visited	‎05-20-2024 05:42 PM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎05-20-2024 05:42 PM
Posts	1,973
Kudos received	1122

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

Re: How will Apache Calcite be integrated into Apa...

Test Driven Development for Big Data (Unofficial G...

Re: java.io.IOException: Previous writer likely fa...

Re: Creating a Hive UDF in Java

Re: Hive UDF Templates

Re: Error while inserting the data to another HIVE...

Re: Has Anyone Used Dr. Elephant with HDP 2.4?

Re: Choice for a Web Application Cache

Re: Where to store a really wide table?

Re: Enriching and Munging Twitter Data with HDF