1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1945 | 04-03-2024 06:39 AM | |
| 3056 | 01-12-2024 08:19 AM | |
| 1670 | 12-07-2023 01:49 PM | |
| 2448 | 08-02-2023 07:30 AM | |
| 3406 | 03-29-2023 01:22 PM |
12-13-2016
04:59 PM
IMGUR moved from MySQL to HBase for their notifications. https://medium.com/imgur-engineering/imgur-notifications-from-mysql-to-hbase-9dba6fc44183#.x1xf6lbsz HBase and Phoenix are very easy to use for Java developers. An HDF cluster running NIFI, Storm and Kafka is minimal administration and all the tools are Java based and Java oriented. Here are some articles I wrote on accessing Phoenix and Hive data with Java. https://community.hortonworks.com/articles/53629/writing-a-spring-boot-microservices-to-access-hive.html https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html https://community.hortonworks.com/repos/65142/linkextractor.html https://community.hortonworks.com/articles/65239/mp3-jukebox-with-nifi-1x.html You can use Spark to leverage your HDFS, Hive, Phoenix/HBase data https://databricks.com/blog/2016/02/02/an-illustrated-guide-to-advertising-analytics.html https://github.com/warshmellow/adtech-dash
... View more
12-13-2016
03:41 PM
Update your libcurl and curl on those machines.
... View more
12-13-2016
03:34 PM
1 Kudo
Polybase Configuration Issue it seems https://msdn.microsoft.com/en-us/library/dn935026.aspx The maximum number of concurrent PolyBase queries is 32. When 32 concurrent queries are running, each query can read a maximum of 33,000 files from the external file location. The root folder and each subfolder also count as a file. If the degree of concurrency is less than 32, the external file location can contain more than 33,000 files. Is this a single file or multiple files you are accessing. What is the Hive DDL? I would recommend Hive with ORC format and fewer files. See: https://sqlwithmanoj.com/2016/06/09/polybase-error-in-sql-server-2016-row-size-exceeds-the-defined-maximum-dms-row-size-larger-than-the-limit-of-32768-bytes/ https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-service-capacity-limits Loads Category Description Maximum Polybase Loads Bytes per row 32,768
Polybase loads are limited to loading rows both smaller than 32K and cannot load to VARCHR(MAX), NVARCHAR(MAX) or VARBINARY(MAX). While this limit exists today, it will be removed fairly soon. see: https://blogs.msdn.microsoft.com/sqlcat/2016/06/21/polybase-setup-errors-and-possible-solutions/ Check for errors on the Hadoop server. Which version of HDP or HDInsight? Hive version? Is there any memory issues? Check logs and ambari.
... View more
12-13-2016
02:13 PM
HBase has more developers and Phoenix and more support at this point. Accumulo has some cool features and is a fully supported piece of Hadoop / Hortonworks Data Platform. https://accumulo.apache.org/features/ http://hortonworks.com/apache/accumulo/ Also look and see what you might run on top of it https://twitter.com/ApacheFluo runs on Accumulo http://opentsdb.net/ runs on HBase HBase has column-level security Do you need cell level security then go with Accumulo. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_dataintegration/content/hbase-cell-level-acls.html https://www.quora.com/How-do-we-compare-Apache-HBase-vs-Apache-Accumulo HBase Cell Security https://blogs.apache.org/hbase/entry/hbase_cell_security
... View more
12-13-2016
01:17 PM
Configure your replication higher https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools You probably need more replicas for faster throughput https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
... View more
12-12-2016
06:29 PM
1 Kudo
I have gigabytes of data in them 2.9G494
6.8M664
6.5M625
5.7M640
5.5M661
5.5M649
5.5M569
5.2M412
5.0M678
5.0M622
... View more
Labels:
- Labels:
-
Apache NiFi
12-12-2016
04:22 PM
There's a lot to secure in a big platform, so knowledge is the first step. Then making sure to follow best practices on Kerberos, Atlas, HDFS and Knox. http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_security/content/understanding_data_lake_security.html http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_security/content/knox_gateway_security.html http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_command-line-installation/content/configuring_for_secure_clusters_falcon.html A security talk I did a while ago: http://www.slideshare.net/bunkertor/hadoop-security-54483815
... View more
11-23-2016
09:31 PM
1 Kudo
one easy way to do this is to wrap lookups in a REST API and call it as a step. (InvokeHTTP) another way is to wrap lookups in a command line call and call it as a step (ExecuteStreamCommand) Another option is with a custom processor Another option is to create a custom UDF function in Hive that converts data and then run that. Another option is to do ETL lookup transformations in Spark, Storm, Flink and call via Site-To-Site or Kafka Load the lookup values into the DistributedMapCache and use them for replacements
PutDistributedMapCache FetchDistributedMapCache Load lookup tables via SQL ExecuteScript or ExecuteCommand for looking up data to replace https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ReplaceTextWithMapping/index.html to pull mappings from a file created from your tables. https://community.hortonworks.com/questions/36464/how-to-use-nifi-to-incrementally-ingest-data-from.html Or https://community.hortonworks.com/questions/37733/which-nifi-processor-can-send-a-flowfile-to-a-runn.html Lookup Table Service https://github.com/aperepel/nifi-csv-bundle/blob/master/nifi-csv-processors/src/main/java/org/apache/nifi/processors/lookup/GetLookupTable.java Use HBase for your lookups https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.hbase.GetHBase/
... View more
11-23-2016
04:39 PM
how about for HDP 2.5
... View more
11-22-2016
11:00 PM
2 Kudos
Hivemall: Machine Learning on Hive, Pig and Spark SQL Install HiveMall https://github.com/myui/hivemall/wiki/Installation Pick latest release https://github.com/myui/hivemall/releases # Setup Your Environment $HOME/.hiverc
add jar /home/myui/tmp/hivemall-core-xxx-with-dependencies.jar;
source /home/myui/tmp/define-all.hive;
# Create a directory in HDFS for the JAR
hadoop fs -mkdir -p /apps/hivemall
hdfs dfs -chmod -R 777 /apps/hivemall
cp hivemall-core-0.4.2-rc.2-with-dependencies.jar hivemall-with-dependencies.jar
hdfs dfs -put hivemall-with-dependencies.jar /apps/hivemall/
hdfs dfs -put hivemall-with-dependencies.jar /apps/hive/warehouse/
hdfs dfs -put hivemall-core-0.4.2-rc.2-with-dependencies.jar /apps/hivemall show functions "hivemall.*";
+-----------------------------------------+--+
| tab_name |
+-----------------------------------------+--+
| hivemall.add_bias |
| hivemall.add_feature_index |
| hivemall.amplify |
| hivemall.angular_distance |
| hivemall.angular_similarity |
| hivemall.argmin_kld |
| hivemall.array_avg |
...
| hivemall.x_rank |
| hivemall.zscore |
+-----------------------------------------+--+
149 rows selected (0.054 seconds)
Once installed the hivemall database will be filled with great functions to use for general processing as well as machine learning via SQL. An example function is for Base91 encoding: select hivemall.base91(hivemall.deflate('aaaaaaaaaaaaaaaabbbbccc'));
+----------------------+--+
| _c0 |
+----------------------+--+
| AA+=kaIM|WTt!+wbGAA |
+----------------------+--+ A more useful example is I ran tokenize on messages in a Hive table that I store tweets in. select hivemall.tokenize(tweets.msg) from tweets limit 10;
| ["water","pipe","break","#TEST","#TEST","#WATERMAINBREAK","FakeMockTown","NJ","https","//t","co/hLYaJnvAdH"] |
| ["RT","@CNNNewsource","Main","water","pipe","break","causes","flooding","sinkhole","swallows","car","in","Hoboken","NJ","NE-009MO","https","//t","co/SDALHbs7kx"] |
| ["RT","@PaaSDev","#TEST","water","pipe","break","#TEST","Water","Main","Break","in","Fakeville","NJ","https","//t","co/ekbNXK1VgI"] |
| ["Water","break","on","a","mountain","run","tonight","#saopaulo","#correr","#run","sdfdf,"https","//t","co/dvND6BkXl4"] |
| ["RT","@PaaSDev","water","pipe","break","#TEST","#TEST","#WATERMAINBREAK","FakeMockTown","NJ","https","//t","co/hLYaJnvAdH"] |
| ["Route","33","In","Wilton","Closed","Due","To","Water","Main","Break","https","//t","co/UQMksljRUm","https","//t","co/HRhin2QyOk"] |
| ["water","pipe","break","nj","#TEST","#TEST","#WATERMAINBREAK","https","//t","co/kvYNTG7wHf"] |
| ["water","pipe","break","nj","#TEST","test","https","//t","co/zjgjSaNvUz"] |
| ["#TEST","#watermainbreak","water","main","break","pipe","test","nj","https","//t","co/qZEdnhlgYG"] |
| ["Customers","of","Langley","Water","and","Sewer","District","under","boil","water","advisory","-","Aiken","Standard","https","//t","co/yh3COaC70M","https","//t","co/LPRHBrtaTA"] |
10 rows selected (4.848 seconds) For more examples of usage: https://github.com/myui/hivemall/wiki/webspam-dataset
I will be using HiveMall in future projects, I am expecting to include into an NiFi workflow for process NLP and other machine learning operations.
The project has just joined Apache.
... View more
Labels: