About TimothySpann

TimothySpann · ‎12-13-2016

IMGUR moved from MySQL to HBase for their notifications. https://medium.com/imgur-engineering/imgur-notifications-from-mysql-to-hbase-9dba6fc44183#.x1xf6lbsz HBase and Phoenix are very easy to use for Java developers. An HDF cluster running NIFI, Storm and Kafka is minimal administration and all the tools are Java based and Java oriented. Here are some articles I wrote on accessing Phoenix and Hive data with Java. https://community.hortonworks.com/articles/53629/writing-a-spring-boot-microservices-to-access-hive.html https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html https://community.hortonworks.com/repos/65142/linkextractor.html https://community.hortonworks.com/articles/65239/mp3-jukebox-with-nifi-1x.html You can use Spark to leverage your HDFS, Hive, Phoenix/HBase data https://databricks.com/blog/2016/02/02/an-illustrated-guide-to-advertising-analytics.html https://github.com/warshmellow/adtech-dash

TimothySpann · ‎12-13-2016

Update your libcurl and curl on those machines.

TimothySpann · ‎12-13-2016

Polybase Configuration Issue it seems https://msdn.microsoft.com/en-us/library/dn935026.aspx The maximum number of concurrent PolyBase queries is 32. When 32 concurrent queries are running, each query can read a maximum of 33,000 files from the external file location. The root folder and each subfolder also count as a file. If the degree of concurrency is less than 32, the external file location can contain more than 33,000 files. Is this a single file or multiple files you are accessing. What is the Hive DDL? I would recommend Hive with ORC format and fewer files. See: https://sqlwithmanoj.com/2016/06/09/polybase-error-in-sql-server-2016-row-size-exceeds-the-defined-maximum-dms-row-size-larger-than-the-limit-of-32768-bytes/ https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-service-capacity-limits Loads Category Description Maximum Polybase Loads Bytes per row 32,768 Polybase loads are limited to loading rows both smaller than 32K and cannot load to VARCHR(MAX), NVARCHAR(MAX) or VARBINARY(MAX). While this limit exists today, it will be removed fairly soon. see: https://blogs.msdn.microsoft.com/sqlcat/2016/06/21/polybase-setup-errors-and-possible-solutions/ Check for errors on the Hadoop server. Which version of HDP or HDInsight? Hive version? Is there any memory issues? Check logs and ambari.

TimothySpann · ‎12-13-2016

HBase has more developers and Phoenix and more support at this point. Accumulo has some cool features and is a fully supported piece of Hadoop / Hortonworks Data Platform. https://accumulo.apache.org/features/ http://hortonworks.com/apache/accumulo/ Also look and see what you might run on top of it https://twitter.com/ApacheFluo runs on Accumulo http://opentsdb.net/ runs on HBase HBase has column-level security Do you need cell level security then go with Accumulo. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_dataintegration/content/hbase-cell-level-acls.html https://www.quora.com/How-do-we-compare-Apache-HBase-vs-Apache-Accumulo HBase Cell Security https://blogs.apache.org/hbase/entry/hbase_cell_security

TimothySpann · ‎12-13-2016

Configure your replication higher https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools You probably need more replicas for faster throughput https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/

TimothySpann · ‎12-12-2016

I have gigabytes of data in them 2.9G494 6.8M664 6.5M625 5.7M640 5.5M661 5.5M649 5.5M569 5.2M412 5.0M678 5.0M622

TimothySpann · ‎12-12-2016

There's a lot to secure in a big platform, so knowledge is the first step. Then making sure to follow best practices on Kerberos, Atlas, HDFS and Knox. http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_security/content/understanding_data_lake_security.html http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_security/content/knox_gateway_security.html http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_command-line-installation/content/configuring_for_secure_clusters_falcon.html A security talk I did a while ago: http://www.slideshare.net/bunkertor/hadoop-security-54483815

TimothySpann · ‎11-23-2016

one easy way to do this is to wrap lookups in a REST API and call it as a step. (InvokeHTTP) another way is to wrap lookups in a command line call and call it as a step (ExecuteStreamCommand) Another option is with a custom processor Another option is to create a custom UDF function in Hive that converts data and then run that. Another option is to do ETL lookup transformations in Spark, Storm, Flink and call via Site-To-Site or Kafka Load the lookup values into the DistributedMapCache and use them for replacements PutDistributedMapCache FetchDistributedMapCache Load lookup tables via SQL ExecuteScript or ExecuteCommand for looking up data to replace https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ReplaceTextWithMapping/index.html to pull mappings from a file created from your tables. https://community.hortonworks.com/questions/36464/how-to-use-nifi-to-incrementally-ingest-data-from.html Or https://community.hortonworks.com/questions/37733/which-nifi-processor-can-send-a-flowfile-to-a-runn.html Lookup Table Service https://github.com/aperepel/nifi-csv-bundle/blob/master/nifi-csv-processors/src/main/java/org/apache/nifi/processors/lookup/GetLookupTable.java Use HBase for your lookups https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.hbase.GetHBase/

TimothySpann · ‎11-23-2016

how about for HDP 2.5

TimothySpann · ‎11-22-2016

Hivemall: Machine Learning on Hive, Pig and Spark SQL Install HiveMall https://github.com/myui/hivemall/wiki/Installation Pick latest release https://github.com/myui/hivemall/releases # Setup Your Environment $HOME/.hiverc add jar /home/myui/tmp/hivemall-core-xxx-with-dependencies.jar; source /home/myui/tmp/define-all.hive; # Create a directory in HDFS for the JAR hadoop fs -mkdir -p /apps/hivemall hdfs dfs -chmod -R 777 /apps/hivemall cp hivemall-core-0.4.2-rc.2-with-dependencies.jar hivemall-with-dependencies.jar hdfs dfs -put hivemall-with-dependencies.jar /apps/hivemall/ hdfs dfs -put hivemall-with-dependencies.jar /apps/hive/warehouse/ hdfs dfs -put hivemall-core-0.4.2-rc.2-with-dependencies.jar /apps/hivemall show functions "hivemall.*"; +-----------------------------------------+--+ | tab_name | +-----------------------------------------+--+ | hivemall.add_bias | | hivemall.add_feature_index | | hivemall.amplify | | hivemall.angular_distance | | hivemall.angular_similarity | | hivemall.argmin_kld | | hivemall.array_avg | ... | hivemall.x_rank | | hivemall.zscore | +-----------------------------------------+--+ 149 rows selected (0.054 seconds) Once installed the hivemall database will be filled with great functions to use for general processing as well as machine learning via SQL. An example function is for Base91 encoding: select hivemall.base91(hivemall.deflate('aaaaaaaaaaaaaaaabbbbccc')); +----------------------+--+ | _c0 | +----------------------+--+ | AA+=kaIM|WTt!+wbGAA | +----------------------+--+ A more useful example is I ran tokenize on messages in a Hive table that I store tweets in. select hivemall.tokenize(tweets.msg) from tweets limit 10; | ["water","pipe","break","#TEST","#TEST","#WATERMAINBREAK","FakeMockTown","NJ","https","//t","co/hLYaJnvAdH"] | | ["RT","@CNNNewsource","Main","water","pipe","break","causes","flooding","sinkhole","swallows","car","in","Hoboken","NJ","NE-009MO","https","//t","co/SDALHbs7kx"] | | ["RT","@PaaSDev","#TEST","water","pipe","break","#TEST","Water","Main","Break","in","Fakeville","NJ","https","//t","co/ekbNXK1VgI"] | | ["Water","break","on","a","mountain","run","tonight","#saopaulo","#correr","#run","sdfdf,"https","//t","co/dvND6BkXl4"] | | ["RT","@PaaSDev","water","pipe","break","#TEST","#TEST","#WATERMAINBREAK","FakeMockTown","NJ","https","//t","co/hLYaJnvAdH"] | | ["Route","33","In","Wilton","Closed","Due","To","Water","Main","Break","https","//t","co/UQMksljRUm","https","//t","co/HRhin2QyOk"] | | ["water","pipe","break","nj","#TEST","#TEST","#WATERMAINBREAK","https","//t","co/kvYNTG7wHf"] | | ["water","pipe","break","nj","#TEST","test","https","//t","co/zjgjSaNvUz"] | | ["#TEST","#watermainbreak","water","main","break","pipe","test","nj","https","//t","co/qZEdnhlgYG"] | | ["Customers","of","Langley","Water","and","Sewer","District","under","boil","water","advisory","-","Aiken","Standard","https","//t","co/yh3COaC70M","https","//t","co/LPRHBrtaTA"] | 10 rows selected (4.848 seconds) For more examples of usage: https://github.com/myui/hivemall/wiki/webspam-dataset I will be using HiveMall in future projects, I am expecting to include into an NiFi workflow for process NLP and other machine learning operations. The project has just joined Apache.

Online	Offline
Last Visited	‎05-20-2024 05:42 PM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎05-20-2024 05:42 PM
Posts	1,973
Kudos received	1122

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

Re: Ad Serving & Analytics Architecture

Re: Hiveserver2 failing with /tmp/tmp23xNhA 2>/t...

Re: Inserting From external Data Table to Hive Tab...

Re: hbase vs accumulo

Re: getting following messages in the kafka logs, ...

In Apache NiFi 1.0, Can I delete older content rep...

Re: HDP Security vulnerabilities and threats analy...

Re: ETL lookup transformation equivalent in Nifi

Re: Is Scala 2.11 compitable with HDP2.3.4?

Apache Hive with Apache Hivemall