Member since
03-16-2016
707
Posts
1753
Kudos Received
203
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5181 | 09-21-2018 09:54 PM | |
6594 | 03-31-2018 03:59 AM | |
2001 | 03-31-2018 03:55 AM | |
2207 | 03-31-2018 03:31 AM | |
4908 | 03-27-2018 03:46 PM |
03-20-2017
06:00 PM
2 Kudos
@dvt isoft Not necessarily. That would be only if your blocks will be 100% filled with data. Let's say you have a 1024 MB file and the block size is 128 MB. That would be exactly 8 blocks at 100%. Let's say you have 968 MB file and the block size is128 MB. That is still 8 blocks but with lower usage. A block once used by a file cannot be reused for a different file. That's why loading small files could be a waste. Just imagine 100 files of each 100 KB will be using 100 blocks for 128 MB, 10x more than the examples I provided above. You need to understand your files, block % usage etc. The command you execute shows the blocks empty x size/block ... I know that is confusing 🙂 +++ If this is helpful please vote and accept as the best answer.
... View more
02-22-2019
10:32 PM
I am also facing the same issue in HDP3... Is it resolved...????
... View more
03-17-2017
01:05 PM
@Constantin Stanca Thank for the follow up. This was very helpful in addition to Umair's answer. I changed the query to pick up the max rowid for the new records. Again, very helpful tip!
... View more
03-19-2019
07:12 AM
Hello @Constantin Stanca , Where do we set these configuration variables for them to take effect?
... View more
03-17-2017
04:46 PM
Great answer like usual ! Just tested your suggestion and it works perfectly ! Thank you so much !
... View more
03-30-2017
03:48 AM
5 Kudos
It seems that the following improvement addressed the requirement in v 2.6. https://issues.apache.org/jira/browse/YARN-1051
... View more
03-06-2017
10:16 PM
13 Kudos
Demonstrate how easy is to create a simple data flow with NiFi, stream to Hive and visualize via Zeppelin. Pre-requisites
Apache NiFi 1.1.0.2.1.0.0-165, included with Hortonworks DataFlow 2.1.0 Apache Zeppelin 0.6.0.2.5.0.0-1245, included with Hortonworks Data Platform 2.5.0 My repo for Apache NiFi "CSVToHive.xml" template, customer demographics data (customer_demographics.header, customer_demographics.csv), "Customer Demographics.json" Apache Zeppelin notebook, customer_demographics_orc_table_ddl.hql database and table DDLs Apache Hive 1.2.1 included with HDP 2.5.0 Hive configured to support ACID transactions and demo database and customer_demographics created using customer_demographics_orc_table_ddl.hql Steps Import NiFi Template Assuming NiFi is started and the UI available at <NiFiMasterHostName>:8086:/nifi, import the template CSVToHive.xml: screen-shot-2017-03-06-at-74106-pm.png Create Data Folder and Upload Data Files In your home directory create /home/username/customer_demographics and upload data files specified above. Grant appropriate access to your NiFi user to be able to access it and process it via GetFile processor. Change the directory path specified in GetFile processor to match your path. Also, change the "Keep Source File" property of the GetFile processor to false as such the file is processed once and then deleted. For test reasons, I kept it as true. also, you will have to adjust Hive Metastore URI to match your environment host name. Import Zeppelin Notebook Execute NiFi Flow Start all processors or start one processor at the time and follow the flow. The outcome is that each record of your CSV file will be posted to Hive demo.customer_demographics table via Hive Streaming API. As you noticed from the DDL, the Hive table is transactional. Enabling the global ACID feature of Hive and creating the table as transactional and bucketed is a requirement for this to work. Also, the data format required to allow using PutHiveStreaming processor is Avro, as such we converted the CSV to Avro. At one of the intermediary steps we could infer the Avro schema or define the CSV file header, the later option has been selected for this demo. Execute Zeppelin Notebook During the demo you could change from NiFi to Zeppelin showing how the data is posted in Hive and how is reflected in Zeppelin by re-executing the HiveQL blocks. The markdown (md) and shell (sh) blocks were included only for demonstration purposes, showing how a data engineer, a data analyst or a data scientist can benefit from the use of Zeppelin.
... View more
Labels:
02-24-2017
05:45 PM
2 Kudos
@Faruk Berksoz Kafka - YES for all scenarios. Kafka is not for storing. Kafka is for transport. Your data still needs to land somewhere, e.g. As you mentioned that is HBase via Phoenix, but it could also be HDFS or Hive. 1. Yes. Flume is ok for ingest, but you still need something else to post to Kafka (Kafka Producer), e.g. KafkaConnect. 2. No. Spark Streaming is appropriate for consumer applications, not really for your use case which is about ingest and post to Kafka. 3. No. Same response as for #2 4. No. Storm is for appropriate for consumer applications, not really for your use case which is about ingest and post to Kafka. 5. Could work, not recommended. The most common architectures are: a) Flume-> KafkaConnect-> Kafka; consumer applications are built using either Storm or Spark Streaming. Other options are available, but less used. b) Nifi -> Kafka -> Storm; consumer applications are built using Storm; this is Hortonworks DataFlow stack c) Others (Attunity, Syncsort) -> Kafka -> consumer applications built in Storm or Spark Streaming Since I am biased, I would say go with b) - Storm or Spark Streaming, or both. I'm saying that only because I am biased but because each of the components scale amazingly and because I used Flume before and don't want to go back there once I've seen what I can achieve with NiFi. Additionally, HDF will evolve an integrated platform for stream analytics with visual definition of flows and analytics requiring the least programming. You will be amazed of the functionality provided out of box and via visual definition and that is only months away. Flume is less and less used. NiFi does what Flume does and much beyond. With NiFi writing the producers to Kafka is trivial. Think beyond your current use case. What other use cases can this enable?... One more thing. For landing data to HBase you can still use NiFi and its Phoenix connector to HBase. Another scalable approach.
... View more
02-22-2017
12:07 AM
9 Kudos
Introduction Geospatial data is generated in huge volumes with the rise
of the Internet of Things. IoT sensor networks are pushing the geospatial data
rates even higher. There has been an explosion of sensor networks on the
ground, mobile devices carried by people or mounted on vehicles, drones flying
overhead, tethered aerostats (such as Google’s Project Loon), atmosats at high
altitude, and microsats in orbit. Opportunity Geospatial analytics can provide us with the tools and
methods we need to make sense of all that data and put it to use in solving
problems we face at all scales. Challenges Geospatial work requires atypical data types (e.g., points,
shapefiles, map projections), potentially many layers of detail to process and
visualize, and specialized algorithms—not your typical ETL (extract, transform,
load) or reporting work. Apache Spark Role in
Geospatial Development While Spark might seem to be influencing the evolution of
accessory tools, it’s also becoming a default in the geospatial analytics
industry. For example, consider the development of Azavea’s open source
geospatial library GeoTrellis. GeoTrellis was written in Scala and designed to
handle large-scale raster operations. GeoTrellis recently adopted Spark as its
distributed computation engine and, in combination with Amazon Web Services,
scaled the existing raster processing to support even larger datasets. Spark
brings amazing scope to the GeoTrellis project, and GeoTrellis supplies the
geospatial capabilities that Spark lacks. This reciprocal partnership is an
important contribution to the data engineering ecosystem, and particularly to
the frameworks in development for supporting Big Data. About GeoTrellis GeoTrellisis a Scala library and framework
that uses Spark to work with raster data. It is released under the Apache 2
License. GeoTrellis
reads, writes, and operates on raster data as fast as possible. It implements
manyMap Algebraoperations
as well as vector to raster or raster to vector operations. GeoTrellis
also provides tools to render rasters into PNGs or to store metadata about
raster files as JSON. It aims to provide raster processing at web speeds
(sub-second or less) with RESTful endpoints as well as provide fast batch
processing of large raster data sets. Getting
Started GeoTrellis is currently available for Scala 2.11 and Spark
2.0+. To get started with SBT,
simply add the following to your build.sbt file: libraryDependencies += "org.locationtech.geotrellis" %% "geotrellis-raster" % "1.0.0" geotrellis-raster is just one submodule that you can
depend on. To grab the latest
snapshot build, add our snapshot repository: resolvers += "LocationTech GeoTrellis Snapshots" at "https://repo.locationtech.org/content/repositories/geotrellis-snapshots" GeoTrellis Modules
geotrellis-proj4 :
Coordinate Reference systems and reproject (Scala wrapper around Proj4j)
geotrellis-vector :
Vector data types and operations (Scala wrapper around JTS)
geotrellis-raster :
Raster data types and operations
geotrellis-spark :
Geospatially enables Spark; save to and from HDFS
geotrellis-s3 :
S3 backend for geotrellis-spark
geotrellis-accumulo :
Accumulo backend for geotrellis-spark
geotrellis-cassandra :
Cassandra backend for geotrellis-spark
geotrellis-hbase :
HBase backend for geotrellis-spark
geotrellis-spark-etl :
Utilities for writing ETL (Extract-Transform-Load), or "ingest"
applications for geotrellis-spark
geotrellis-geotools :
Conversions to and from GeoTools Vector and Raster data
geotrellis-geomesa :
Experimental GeoMesa integration
geotrellis-geowave :
Experimental GeoWave integration
geotrellis-shapefile :
Read shapefiles into GeoTrellis data types via GeoTools
geotrellis-slick :
Read vector data out of PostGIS viaLightBend Slick
geotrellis-vectortile :
Experimental vector tile support, including reading and writing
geotrellis-raster-testkit :
Testkit for testing geotrellis-raster types
geotrellis-vector-testkit :
Testkit for testing geotrellis-vector types
geotrellis-spark-testkit :
Testkit for testing geotrellis-spark code A more
complete feature list can be found at https://github.com/locationtech/geotrellis,
GeoTrellis Features section. Hello Raster with
GeoTrellis scala> import geotrellis.raster._
import geotrellis.raster._
scala> import geotrellis.raster.op.focal._
import geotrellis.raster.op.focal._
scala> val nd = NODATA
nd: Int = -2147483648
scala> val input = Array[Int](
| nd, 7, 1, 1, 3, 5, 9, 8, 2,
| 9, 1, 1, 2, 2, 2, 4, 3, 5,
|
| 3, 8, 1, 3, 3, 3, 1, 2, 2,
| 2, 4, 7, 1, nd, 1, 8, 4, 3)
2, 2, 4, 3, 5, 3, 8, 1, 3, 3, 3, 1, 2, 2, 2, 4, 7, 1, -2147483648, 1, 8, 4, 3)
scala> val iat = IntArrayTile(input, 9, 4) // 9 and 4 here specify columns and rows
iat: geotrellis.raster.IntArrayTile = IntArrayTile([I@278434d0,9,4)
// The asciiDraw method is mostly useful when you're working with small tiles
// which can be taken in at a glance
scala> iat.asciiDraw()
res0: String =
" ND 7 1 1 3 5 9 8 2
9 1 1 2 2 2 4 3 5
3 8 1 3 3 3 1 2 2
2 4 7 1 ND 1 8 4 3
"
scala> val focalNeighborhood = Square(1) // a 3x3 square neighborhood
focalNeighborhood: geotrellis.raster.op.focal.Square =
O O O
O O O
O O O
scala> val meanTile = iat.focalMean(focalNeighborhood)
meanTile: geotrellis.raster.Tile = DoubleArrayTile([D@7e31c125,9,4)
scala> meanTile.getDouble(0, 0) // Should equal (1 + 7 + 9) / 3
res1: Double = 5.666666666666667
Documentation
Further
examples and documentation of GeoTrellis use-cases can be found in the docs/ folder
Scaladocs for the latest version of the
project can be found here: http://geotrellis.github.com/scaladocs/latest/#geotrellis.package References Geospatial Data and Analysisby Aurelia Moser; Bill Day; Jon BrunerPublished by O'Reilly Media, Inc., 2017 http://geotrellis.io/
... View more
Labels: