Member since
03-16-2016
707
Posts
1753
Kudos Received
203
Solutions
05-30-2021
01:17 AM
[hdfs@c****-node* hive-testbench-hive14]$ ./tpcds-build.sh Building TPC-DS Data Generator make: Nothing to be done for `all’. TPC-DS Data Generator built, you can now use tpcds-setup.sh to generate data. [hdfs@c4237-node2 hive-testbench-hive14]$ ./tpcds-setup.sh 2 TPC-DS text data generation complete. Loading text data into external tables. make: *** [time_dim] Error 1 make: *** Waiting for unfinished jobs.... make: *** [date_dim] Error 1 Data loaded into database tpcds_bin_partitioned_orc_2. INFO : OK +---------------------+ | database_name | +---------------------+ | default | | information_schema | | sys | +---------------------+ 3 rows selected (1.955 seconds) 0: jdbc:hive2://c4237-node2.coelab.cloudera.c> tpcds_bin_partitioned_orc_2 database is not created, I have some issues in testing the tpcds queries sudo -u hdfs -s 13 cd /home/hdfs 14 wget https://github.com/hortonworks/hive-testbench/archive/hive14.zip 15 unzip hive14.zip 17 export JAVA_HOME=/usr/jdk64/jdk1.8.0_77 18 export PATH=$JAVA_HOME/bin:$PATH ./tpcds-build.sh beeline -i testbench.settings -u "jdbc:hive2://c****-node9.coe***.*****.com:10500/tpcds_bin_partitioned_orc_2" I'm not able to test the tpcds queries, any help would be appreciated.
... View more
04-11-2019
01:18 PM
A very informative post, Thanks for sharing! As per my observation, Kafka is more Network intensive application and with that being said I have question on Active-Active network bond configuration with Kafka. Is this something recommended and what are the considerations if i decide to go for it. Thanks again!
... View more
02-05-2018
04:12 AM
14 Kudos
Apache NiFi evolution from version 1.2 included in HDF 3.0 and version 1.5 included in HDF is significant. I find myself quite often puzzled when required to provide differences between releases and just reading the release notes history at https://cwiki.apache.org/confluence/display/NIFI/Release+Notes and looking at the latest list of NiFi processors is not trivial to determine which new processors were added. I put together matrix which I hope will help developers to take advantage of new processor to improve old and develop new flows. In a nutshell, main functionality added is around: AzureEventHub Kafka 0.11 and 1.0 processors Record processors RethinkDB Flatten Json Execute Spark Interactive Execute Groovy Script My favorite improvements are
around record processors, flattening JSON and executing Spark
interactively. The following is a table of the matrix, arranged alphabetically from A-D: See here for the Matrix Table from E-J See here for the Matrix Tabke from K-Z For NiFi 1.5 NiFi 1.4 NiFi 1.3 NiFi 1.2 AttributeRollingWindow AttributeRollingWindow AttributeRollingWindow AttributeRollingWindow AttributesToJSON AttributesToJSON AttributesToJSON AttributesToJSON Base64EncodeContent Base64EncodeContent Base64EncodeContent Base64EncodeContent CaptureChangeMySQL CaptureChangeMySQL CaptureChangeMySQL CaptureChangeMySQL CompareFuzzyHash CompareFuzzyHash CompareFuzzyHash CompareFuzzyHash CompressContent CompressContent CompressContent CompressContent ConnectWebSocket ConnectWebSocket ConnectWebSocket ConnectWebSocket ConsumeAMQP ConsumeAMQP ConsumeAMQP ConsumeAMQP ConsumeAzureEventHub ConsumeEWS ConsumeEWS ConsumeEWS ConsumeEWS ConsumeIMAP ConsumeIMAP ConsumeIMAP ConsumeIMAP ConsumeJMS ConsumeJMS ConsumeJMS ConsumeJMS ConsumeKafka ConsumeKafka ConsumeKafka ConsumeKafka ConsumeKafka_0_10 ConsumeKafka_0_10 ConsumeKafka_0_10 ConsumeKafka_0_10 ConsumeKafka_0_11 ConsumeKafka_0_11 ConsumeKafkaRecord_0_10 ConsumeKafkaRecord_0_10 ConsumeKafkaRecord_0_10 ConsumeKafkaRecord_0_10 ConsumeKafkaRecord_0_11 ConsumeKafkaRecord_0_11 ConsumeKafka_1_0 ConsumeKafkaRecord_1_0 ConsumeMQTT ConsumeMQTT ConsumeMQTT ConsumeMQTT ConsumePOP3 ConsumePOP3 ConsumePOP3 ConsumePOP3 ConsumeWindowsEventLog ConsumeWindowsEventLog ConsumeWindowsEventLog ConsumeWindowsEventLog ControlRate ControlRate ControlRate ControlRate ConvertAvroSchema ConvertAvroSchema ConvertAvroSchema ConvertAvroSchema ConvertAvroToJSON ConvertAvroToJSON ConvertAvroToJSON ConvertAvroToJSON ConvertAvroToORC ConvertAvroToORC ConvertAvroToORC ConvertAvroToORC ConvertCharacterSet ConvertCharacterSet ConvertCharacterSet ConvertCharacterSet ConvertCSVToAvro ConvertCSVToAvro ConvertCSVToAvro ConvertCSVToAvro ConvertExcelToCSVProcessor ConvertExcelToCSVProcessor ConvertExcelToCSVProcessor ConvertExcelToCSVProcessor ConvertJSONToAvro ConvertJSONToAvro ConvertJSONToAvro ConvertJSONToAvro ConvertJSONToSQL ConvertJSONToSQL ConvertJSONToSQL ConvertJSONToSQL ConvertRecord ConvertRecord ConvertRecord ConvertRecord CreateHadoopSequenceFile CreateHadoopSequenceFile CreateHadoopSequenceFile CreateHadoopSequenceFile CountText DebugFlow DebugFlow DebugFlow DebugFlow DeleteDynamoDB DeleteDynamoDB DeleteDynamoDB DeleteDynamoDB DeleteGCSObject DeleteGCSObject DeleteGCSObject DeleteGCSObject DeleteHDFS DeleteHDFS DeleteHDFS DeleteHDFS DeleteElasticsearch5 DeleteElasticsearch5 DeleteRethinkDB DeleteRethinkDB DeleteS3Object DeleteS3Object DeleteS3Object DeleteS3Object DeleteMongo DeleteSQS DeleteSQS DeleteSQS DeleteSQS DetectDuplicate DetectDuplicate DetectDuplicate DetectDuplicate DistributeLoad DistributeLoad DistributeLoad DistributeLoad DuplicateFlowFile DuplicateFlowFile DuplicateFlowFile DuplicateFlowFile
... View more
Labels:
10-06-2017
07:20 PM
6 Kudos
Introduction This is a continuation of an article I wrote about 1 year ago: https://community.hortonworks.com/articles/60580/jmeter-setup-for-hive-load-testing-draft.htmlhttps://www.blazemeter.com/blog/windows-authentication-apache-jmeter Steps 1) Enable Kerberos on your cluster Perform all steps specified here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_security/content/configuring_amb_hdp_for_kerberos.html and connect successfully to hive service via command line using your user keytab. That implies a valid ticket. 2) Install JMeter See previous article mentioned in Introduction. 3) Set Hive User keytab in jaas.conf JMETER_HOME/bin/jaas.conf Your jaas.conf should look something like this: JMeter {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=false
doNotPrompt=true
useKeyTab=true
keyTab="/etc/security/keytabs/hive.service.keytab"
principal="hive/server.example.com@EXAMPLE.COM"
debug=true;
}; 4) JMeter Setup There are 2 files under /bin folder of the JMeter installation which are used for Kerberos configuration: krb5.conf - file of .ini format which contains Kerberos configuration details jaas.conf - file which holds configuration details of Java Authentication and Authorization service These files aren’t being used by default, so you have to tell JMeter where they are via system properties such as: -Djava.security.krb5.conf=krb5.conf
-Djava.security.auth.login.config=jaas.conf Alternatively you can add the next two lines to the system.properties file which is located at the same /bin folder. java.security.krb5.conf=krb5.conf
java.security.auth.login.config=jaas.conf I suggest using full paths to files. 5) Manage Issues If you encounter any issues: - enable debug by adding the following to your command: -Dsun.security.krb5.debug=true
-Djava.security.debug=gssloginconfig,configfile,configparser,logincontext - check jmeter.log to see whether all properties are set as expected and map to existent file paths. 6) Turn-off Subject Credentials -Djavax.security.auth.useSubjectCredsOnly=false 7) Example of JMeter Command JVM_ARGS="-Xms1024m
-Xmx1024m" bin/jmeter -Dsun.security.krb5.debug=true
-Djavax.security.auth.useSubjectCredsOnly=false
-Djava.security.debug=gssloginconfig,configfile,configparser,logincontext
-Djava.security.krb5.conf=/path/to/krb5.conf
-Djava.security.auth.login.config=/path/to/jaas.conf -n -t t1.jmx -l results -e
-o output This could be simplified if you add those two lines mentioned earlier to be added to system.properties.
... View more
Labels:
02-13-2019
07:16 PM
I installed geomesa 1.3.5 in 10 node cluster. We are using kerberos to secure the cluster. Does geomesa works with kerberos?
... View more
07-19-2017
09:26 PM
9 Kudos
Overview The following versions of Apache Kafka have been incorporated in HDP 2.2.0 to 2.6.1: 0.8.1, 0.8.2, 0.9.0, 0.10.0, 0.10.1. Apache Kafka is now at 0.11. Hortonworks is working to make Kafka easier for enterprises to use. New focus areas include creation of a Kafka Admin Panel to create/delete topics and manage user permissions, easier and safer distribution of security tokens and support for multiple ways of publishing/consuming data via a Kafka REST server/API. Here are a few areas of strong contribution: Operations: Rack awareness for Increased resilience and availability such that replicas are isolated so they are guaranteed to span multiple racks or availability zones. Automated replica leader election for automated, even distribution of leaders in a cluster capability by detecting uneven distribution with some brokers serving more data compared to others and makes adjustments. Message Timestamps so every message in Kafka now has a timestamp field that indicates the time at which the message was produced. SASL improvements including external authentication servers and support of multiple types of SASL authentication on one server Ambari Views for visualization of Kafka operational metrics Security: Kafka security encompasses multiple needs – the need to encrypt the data flowing through Kafka and preventing rogue agents from publishing data to Kafka, as well as the ability to manage access to specific topics on an individual or group level. As a result, latest updates in Kafka support wire encryption via SSL, Kerberos based authentication and granular authorization options via Apache Ranger or other pluggable authorization system. This article lists below new features beyond Hortonworks contribution. At the high level, the following have been added by the overall community.
Kafka Streams API Kafka Connect API New unified Consumer API Transport encryption using TLS/SSL Kerberos/SASL Authentication support Access Control Lists Timestamps on messages Reduced client dependence on zookeeper (offsets stored in Kafka topic) Client interceptors New Features Since HDP 2.2 Here is the list of NEW FEATURES as they have been included in the release notes. Kafka 0.8.1:
https://archive.apache.org/dist/kafka/0.8.1/RELEASE_NOTES.html
[KAFKA-330] -
Add delete topic support [KAFKA-554] -
Move all per-topic configuration into ZK and add to the CreateTopicCommand [KAFKA-615] -
Avoid fsync on log segment roll [KAFKA-657] -
Add an API to commit offsets [KAFKA-925] -
Add optional partition key override in producer [KAFKA-1092] -
Add server config parameter to separate bind address and ZK hostname [KAFKA-1117] -
tool for checking the consistency among replicas Kafka 0.8.2: https://archive.apache.org/dist/kafka/0.8.2.0/RELEASE_NOTES.html
[KAFKA-1000] -
Inbuilt consumer offset management feature for kakfa [KAFKA-1227] -
Code dump of new producer [KAFKA-1384] -
Log Broker state [KAFKA-1443] -
Add delete topic to topic commands and update DeleteTopicCommand [KAFKA-1512] -
Limit the maximum number of connections per ip address [KAFKA-1597] -
New metrics: ResponseQueueSize and BeingSentResponses [KAFKA-1784] -
Implement a ConsumerOffsetClient library Kafka 0.9.0: https://archive.apache.org/dist/kafka/0.9.0.0/RELEASE_NOTES.html
[KAFKA-1499] -
Broker-side compression configuration [KAFKA-1785] -
Consumer offset checker should show the offset manager and offsets
partition [KAFKA-2120] -
Add a request timeout to NetworkClient [KAFKA-2187] -
Introduce merge-kafka-pr.py script Kafka 0.10.0: https://archive.apache.org/dist/kafka/0.10.0.0/RELEASE_NOTES.html
[KAFKA-2832] -
support exclude.internal.topics in new consumer [KAFKA-3046] -
add ByteBuffer Serializer&Deserializer [KAFKA-3490] -
Multiple version support for ducktape performance tests Kafka 0.10.0.1: https://archive.apache.org/dist/kafka/0.10.0.1/RELEASE_NOTES.html
[KAFKA-3538] -
Abstract the creation/retrieval of Producer for stream sinks for unit
testing Kafka 0.10.1: https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html
[KAFKA-1464] - Add a throttling option to the
Kafka replication tool [KAFKA-3176] - Allow console consumer to
consume from particular partitions when new consumer is used. [KAFKA-3492] - support quota based on
authenticated user name [KAFKA-3776] - Unify store and downstream
caching in streams [KAFKA-3858] - Add functions to print stream
topologies [KAFKA-3909] - Queryable state for Kafka
Streams [KAFKA-4015] - Change cleanup.policy config
to accept a list of valid policies [KAFKA-4093] - Cluster id Final Notes Apache Kafka shines in use cases like: replacement for a more traditional message broker user activity tracking pipeline as a set of real-time publish-subscribe feeds (the original use case) operational monitoring data log aggregation stream processing event sourcing commit log Apache Kafka continues to be a dynamic and extremely popular project with more and more adoption.
... View more
04-03-2017
12:47 AM
@Rohan Pednekar This is true also for any scan that requires evaluation before retrieving anything. I am not sure why this would be an HCC article. This is merely one paragraph of what could have been a well-written article about tips and tricks when dealing with HBase. I recommend looking at some of the featured articles in HCC and write that quality. This section you published could be very useful in a larger article. Thanks for your efforts.
... View more
03-06-2017
10:16 PM
13 Kudos
Demonstrate how easy is to create a simple data flow with NiFi, stream to Hive and visualize via Zeppelin. Pre-requisites
Apache NiFi 1.1.0.2.1.0.0-165, included with Hortonworks DataFlow 2.1.0 Apache Zeppelin 0.6.0.2.5.0.0-1245, included with Hortonworks Data Platform 2.5.0 My repo for Apache NiFi "CSVToHive.xml" template, customer demographics data (customer_demographics.header, customer_demographics.csv), "Customer Demographics.json" Apache Zeppelin notebook, customer_demographics_orc_table_ddl.hql database and table DDLs Apache Hive 1.2.1 included with HDP 2.5.0 Hive configured to support ACID transactions and demo database and customer_demographics created using customer_demographics_orc_table_ddl.hql Steps Import NiFi Template Assuming NiFi is started and the UI available at <NiFiMasterHostName>:8086:/nifi, import the template CSVToHive.xml: screen-shot-2017-03-06-at-74106-pm.png Create Data Folder and Upload Data Files In your home directory create /home/username/customer_demographics and upload data files specified above. Grant appropriate access to your NiFi user to be able to access it and process it via GetFile processor. Change the directory path specified in GetFile processor to match your path. Also, change the "Keep Source File" property of the GetFile processor to false as such the file is processed once and then deleted. For test reasons, I kept it as true. also, you will have to adjust Hive Metastore URI to match your environment host name. Import Zeppelin Notebook Execute NiFi Flow Start all processors or start one processor at the time and follow the flow. The outcome is that each record of your CSV file will be posted to Hive demo.customer_demographics table via Hive Streaming API. As you noticed from the DDL, the Hive table is transactional. Enabling the global ACID feature of Hive and creating the table as transactional and bucketed is a requirement for this to work. Also, the data format required to allow using PutHiveStreaming processor is Avro, as such we converted the CSV to Avro. At one of the intermediary steps we could infer the Avro schema or define the CSV file header, the later option has been selected for this demo. Execute Zeppelin Notebook During the demo you could change from NiFi to Zeppelin showing how the data is posted in Hive and how is reflected in Zeppelin by re-executing the HiveQL blocks. The markdown (md) and shell (sh) blocks were included only for demonstration purposes, showing how a data engineer, a data analyst or a data scientist can benefit from the use of Zeppelin.
... View more
Labels:
02-22-2017
12:07 AM
9 Kudos
Introduction Geospatial data is generated in huge volumes with the rise
of the Internet of Things. IoT sensor networks are pushing the geospatial data
rates even higher. There has been an explosion of sensor networks on the
ground, mobile devices carried by people or mounted on vehicles, drones flying
overhead, tethered aerostats (such as Google’s Project Loon), atmosats at high
altitude, and microsats in orbit. Opportunity Geospatial analytics can provide us with the tools and
methods we need to make sense of all that data and put it to use in solving
problems we face at all scales. Challenges Geospatial work requires atypical data types (e.g., points,
shapefiles, map projections), potentially many layers of detail to process and
visualize, and specialized algorithms—not your typical ETL (extract, transform,
load) or reporting work. Apache Spark Role in
Geospatial Development While Spark might seem to be influencing the evolution of
accessory tools, it’s also becoming a default in the geospatial analytics
industry. For example, consider the development of Azavea’s open source
geospatial library GeoTrellis. GeoTrellis was written in Scala and designed to
handle large-scale raster operations. GeoTrellis recently adopted Spark as its
distributed computation engine and, in combination with Amazon Web Services,
scaled the existing raster processing to support even larger datasets. Spark
brings amazing scope to the GeoTrellis project, and GeoTrellis supplies the
geospatial capabilities that Spark lacks. This reciprocal partnership is an
important contribution to the data engineering ecosystem, and particularly to
the frameworks in development for supporting Big Data. About GeoTrellis GeoTrellisis a Scala library and framework
that uses Spark to work with raster data. It is released under the Apache 2
License. GeoTrellis
reads, writes, and operates on raster data as fast as possible. It implements
manyMap Algebraoperations
as well as vector to raster or raster to vector operations. GeoTrellis
also provides tools to render rasters into PNGs or to store metadata about
raster files as JSON. It aims to provide raster processing at web speeds
(sub-second or less) with RESTful endpoints as well as provide fast batch
processing of large raster data sets. Getting
Started GeoTrellis is currently available for Scala 2.11 and Spark
2.0+. To get started with SBT,
simply add the following to your build.sbt file: libraryDependencies += "org.locationtech.geotrellis" %% "geotrellis-raster" % "1.0.0" geotrellis-raster is just one submodule that you can
depend on. To grab the latest
snapshot build, add our snapshot repository: resolvers += "LocationTech GeoTrellis Snapshots" at "https://repo.locationtech.org/content/repositories/geotrellis-snapshots" GeoTrellis Modules
geotrellis-proj4 :
Coordinate Reference systems and reproject (Scala wrapper around Proj4j)
geotrellis-vector :
Vector data types and operations (Scala wrapper around JTS)
geotrellis-raster :
Raster data types and operations
geotrellis-spark :
Geospatially enables Spark; save to and from HDFS
geotrellis-s3 :
S3 backend for geotrellis-spark
geotrellis-accumulo :
Accumulo backend for geotrellis-spark
geotrellis-cassandra :
Cassandra backend for geotrellis-spark
geotrellis-hbase :
HBase backend for geotrellis-spark
geotrellis-spark-etl :
Utilities for writing ETL (Extract-Transform-Load), or "ingest"
applications for geotrellis-spark
geotrellis-geotools :
Conversions to and from GeoTools Vector and Raster data
geotrellis-geomesa :
Experimental GeoMesa integration
geotrellis-geowave :
Experimental GeoWave integration
geotrellis-shapefile :
Read shapefiles into GeoTrellis data types via GeoTools
geotrellis-slick :
Read vector data out of PostGIS viaLightBend Slick
geotrellis-vectortile :
Experimental vector tile support, including reading and writing
geotrellis-raster-testkit :
Testkit for testing geotrellis-raster types
geotrellis-vector-testkit :
Testkit for testing geotrellis-vector types
geotrellis-spark-testkit :
Testkit for testing geotrellis-spark code A more
complete feature list can be found at https://github.com/locationtech/geotrellis,
GeoTrellis Features section. Hello Raster with
GeoTrellis scala> import geotrellis.raster._
import geotrellis.raster._
scala> import geotrellis.raster.op.focal._
import geotrellis.raster.op.focal._
scala> val nd = NODATA
nd: Int = -2147483648
scala> val input = Array[Int](
| nd, 7, 1, 1, 3, 5, 9, 8, 2,
| 9, 1, 1, 2, 2, 2, 4, 3, 5,
|
| 3, 8, 1, 3, 3, 3, 1, 2, 2,
| 2, 4, 7, 1, nd, 1, 8, 4, 3)
2, 2, 4, 3, 5, 3, 8, 1, 3, 3, 3, 1, 2, 2, 2, 4, 7, 1, -2147483648, 1, 8, 4, 3)
scala> val iat = IntArrayTile(input, 9, 4) // 9 and 4 here specify columns and rows
iat: geotrellis.raster.IntArrayTile = IntArrayTile([I@278434d0,9,4)
// The asciiDraw method is mostly useful when you're working with small tiles
// which can be taken in at a glance
scala> iat.asciiDraw()
res0: String =
" ND 7 1 1 3 5 9 8 2
9 1 1 2 2 2 4 3 5
3 8 1 3 3 3 1 2 2
2 4 7 1 ND 1 8 4 3
"
scala> val focalNeighborhood = Square(1) // a 3x3 square neighborhood
focalNeighborhood: geotrellis.raster.op.focal.Square =
O O O
O O O
O O O
scala> val meanTile = iat.focalMean(focalNeighborhood)
meanTile: geotrellis.raster.Tile = DoubleArrayTile([D@7e31c125,9,4)
scala> meanTile.getDouble(0, 0) // Should equal (1 + 7 + 9) / 3
res1: Double = 5.666666666666667
Documentation
Further
examples and documentation of GeoTrellis use-cases can be found in the docs/ folder
Scaladocs for the latest version of the
project can be found here: http://geotrellis.github.com/scaladocs/latest/#geotrellis.package References Geospatial Data and Analysisby Aurelia Moser; Bill Day; Jon BrunerPublished by O'Reilly Media, Inc., 2017 http://geotrellis.io/
... View more
Labels:
12-29-2016
07:30 AM
1 Kudo
For the beginners, like myself: If you added a new processor, or change the processor name, you will need to add or change the name in the .Processor file in <Home Dir>/Documents/nifi/ChakraProcessor/HWX/nifi-demo-processors/src/main/resources/META-INF/services. If you don't do this, the processor will not be loaded.
... View more