About cstanca

cstanca · ‎03-20-2017

@dvt isoft Not necessarily. That would be only if your blocks will be 100% filled with data. Let's say you have a 1024 MB file and the block size is 128 MB. That would be exactly 8 blocks at 100%. Let's say you have 968 MB file and the block size is128 MB. That is still 8 blocks but with lower usage. A block once used by a file cannot be reused for a different file. That's why loading small files could be a waste. Just imagine 100 files of each 100 KB will be using 100 blocks for 128 MB, 10x more than the examples I provided above. You need to understand your files, block % usage etc. The command you execute shows the blocks empty x size/block ... I know that is confusing 🙂 +++ If this is helpful please vote and accept as the best answer.

CloudBoy · ‎02-22-2019

I am also facing the same issue in HDP3... Is it resolved...????

Kelvin_Mitchell · ‎03-17-2017

@Constantin Stanca Thank for the follow up. This was very helpful in addition to Umair's answer. I changed the query to pick up the max rowid for the new records. Again, very helpful tip!

akkimusiclover · ‎03-19-2019

Hello @Constantin Stanca , Where do we set these configuration variables for them to take effect?

melmoumni_exter · ‎03-17-2017

Great answer like usual ! Just tested your suggestion and it works perfectly ! Thank you so much !

cstanca · ‎03-30-2017

It seems that the following improvement addressed the requirement in v 2.6. https://issues.apache.org/jira/browse/YARN-1051

cristianvlean · ‎03-08-2017

Excellent!! This is what i wanted. Thanks

cstanca · ‎03-06-2017

Demonstrate how easy is to create a simple data flow with NiFi, stream to Hive and visualize via Zeppelin. Pre-requisites Apache NiFi 1.1.0.2.1.0.0-165, included with Hortonworks DataFlow 2.1.0 Apache Zeppelin 0.6.0.2.5.0.0-1245, included with Hortonworks Data Platform 2.5.0 My repo for Apache NiFi "CSVToHive.xml" template, customer demographics data (customer_demographics.header, customer_demographics.csv), "Customer Demographics.json" Apache Zeppelin notebook, customer_demographics_orc_table_ddl.hql database and table DDLs Apache Hive 1.2.1 included with HDP 2.5.0 Hive configured to support ACID transactions and demo database and customer_demographics created using customer_demographics_orc_table_ddl.hql Steps Import NiFi Template Assuming NiFi is started and the UI available at <NiFiMasterHostName>:8086:/nifi, import the template CSVToHive.xml: screen-shot-2017-03-06-at-74106-pm.png Create Data Folder and Upload Data Files In your home directory create /home/username/customer_demographics and upload data files specified above. Grant appropriate access to your NiFi user to be able to access it and process it via GetFile processor. Change the directory path specified in GetFile processor to match your path. Also, change the "Keep Source File" property of the GetFile processor to false as such the file is processed once and then deleted. For test reasons, I kept it as true. also, you will have to adjust Hive Metastore URI to match your environment host name. Import Zeppelin Notebook Execute NiFi Flow Start all processors or start one processor at the time and follow the flow. The outcome is that each record of your CSV file will be posted to Hive demo.customer_demographics table via Hive Streaming API. As you noticed from the DDL, the Hive table is transactional. Enabling the global ACID feature of Hive and creating the table as transactional and bucketed is a requirement for this to work. Also, the data format required to allow using PutHiveStreaming processor is Avro, as such we converted the CSV to Avro. At one of the intermediary steps we could infer the Avro schema or define the CSV file header, the later option has been selected for this demo. Execute Zeppelin Notebook During the demo you could change from NiFi to Zeppelin showing how the data is posted in Hive and how is reflected in Zeppelin by re-executing the HiveQL blocks. The markdown (md) and shell (sh) blocks were included only for demonstration purposes, showing how a data engineer, a data analyst or a data scientist can benefit from the use of Zeppelin.

cstanca · ‎02-24-2017

@Faruk Berksoz Kafka - YES for all scenarios. Kafka is not for storing. Kafka is for transport. Your data still needs to land somewhere, e.g. As you mentioned that is HBase via Phoenix, but it could also be HDFS or Hive. 1. Yes. Flume is ok for ingest, but you still need something else to post to Kafka (Kafka Producer), e.g. KafkaConnect. 2. No. Spark Streaming is appropriate for consumer applications, not really for your use case which is about ingest and post to Kafka. 3. No. Same response as for #2 4. No. Storm is for appropriate for consumer applications, not really for your use case which is about ingest and post to Kafka. 5. Could work, not recommended. The most common architectures are: a) Flume-> KafkaConnect-> Kafka; consumer applications are built using either Storm or Spark Streaming. Other options are available, but less used. b) Nifi -> Kafka -> Storm; consumer applications are built using Storm; this is Hortonworks DataFlow stack c) Others (Attunity, Syncsort) -> Kafka -> consumer applications built in Storm or Spark Streaming Since I am biased, I would say go with b) - Storm or Spark Streaming, or both. I'm saying that only because I am biased but because each of the components scale amazingly and because I used Flume before and don't want to go back there once I've seen what I can achieve with NiFi. Additionally, HDF will evolve an integrated platform for stream analytics with visual definition of flows and analytics requiring the least programming. You will be amazed of the functionality provided out of box and via visual definition and that is only months away. Flume is less and less used. NiFi does what Flume does and much beyond. With NiFi writing the producers to Kafka is trivial. Think beyond your current use case. What other use cases can this enable?... One more thing. For landing data to HBase you can still use NiFi and its Phoenix connector to HBase. Another scalable approach.

cstanca · ‎02-22-2017

Introduction Geospatial data is generated in huge volumes with the rise of the Internet of Things. IoT sensor networks are pushing the geospatial data rates even higher. There has been an explosion of sensor networks on the ground, mobile devices carried by people or mounted on vehicles, drones flying overhead, tethered aerostats (such as Google’s Project Loon), atmosats at high altitude, and microsats in orbit. Opportunity Geospatial analytics can provide us with the tools and methods we need to make sense of all that data and put it to use in solving problems we face at all scales. Challenges Geospatial work requires atypical data types (e.g., points, shapefiles, map projections), potentially many layers of detail to process and visualize, and specialized algorithms—not your typical ETL (extract, transform, load) or reporting work. Apache Spark Role in Geospatial Development While Spark might seem to be influencing the evolution of accessory tools, it’s also becoming a default in the geospatial analytics industry. For example, consider the development of Azavea’s open source geospatial library GeoTrellis. GeoTrellis was written in Scala and designed to handle large-scale raster operations. GeoTrellis recently adopted Spark as its distributed computation engine and, in combination with Amazon Web Services, scaled the existing raster processing to support even larger datasets. Spark brings amazing scope to the GeoTrellis project, and GeoTrellis supplies the geospatial capabilities that Spark lacks. This reciprocal partnership is an important contribution to the data engineering ecosystem, and particularly to the frameworks in development for supporting Big Data. About GeoTrellis GeoTrellisis a Scala library and framework that uses Spark to work with raster data. It is released under the Apache 2 License. GeoTrellis reads, writes, and operates on raster data as fast as possible. It implements manyMap Algebraoperations as well as vector to raster or raster to vector operations. GeoTrellis also provides tools to render rasters into PNGs or to store metadata about raster files as JSON. It aims to provide raster processing at web speeds (sub-second or less) with RESTful endpoints as well as provide fast batch processing of large raster data sets. Getting Started GeoTrellis is currently available for Scala 2.11 and Spark 2.0+. To get started with SBT, simply add the following to your build.sbt file: libraryDependencies += "org.locationtech.geotrellis" %% "geotrellis-raster" % "1.0.0" geotrellis-raster is just one submodule that you can depend on. To grab the latest snapshot build, add our snapshot repository: resolvers += "LocationTech GeoTrellis Snapshots" at "https://repo.locationtech.org/content/repositories/geotrellis-snapshots" GeoTrellis Modules geotrellis-proj4 : Coordinate Reference systems and reproject (Scala wrapper around Proj4j) geotrellis-vector : Vector data types and operations (Scala wrapper around JTS) geotrellis-raster : Raster data types and operations geotrellis-spark : Geospatially enables Spark; save to and from HDFS geotrellis-s3 : S3 backend for geotrellis-spark geotrellis-accumulo : Accumulo backend for geotrellis-spark geotrellis-cassandra : Cassandra backend for geotrellis-spark geotrellis-hbase : HBase backend for geotrellis-spark geotrellis-spark-etl : Utilities for writing ETL (Extract-Transform-Load), or "ingest" applications for geotrellis-spark geotrellis-geotools : Conversions to and from GeoTools Vector and Raster data geotrellis-geomesa : Experimental GeoMesa integration geotrellis-geowave : Experimental GeoWave integration geotrellis-shapefile : Read shapefiles into GeoTrellis data types via GeoTools geotrellis-slick : Read vector data out of PostGIS viaLightBend Slick geotrellis-vectortile : Experimental vector tile support, including reading and writing geotrellis-raster-testkit : Testkit for testing geotrellis-raster types geotrellis-vector-testkit : Testkit for testing geotrellis-vector types geotrellis-spark-testkit : Testkit for testing geotrellis-spark code A more complete feature list can be found at https://github.com/locationtech/geotrellis, GeoTrellis Features section. Hello Raster with GeoTrellis scala> import geotrellis.raster._ import geotrellis.raster._ scala> import geotrellis.raster.op.focal._ import geotrellis.raster.op.focal._ scala> val nd = NODATA nd: Int = -2147483648 scala> val input = Array[Int]( | nd, 7, 1, 1, 3, 5, 9, 8, 2, | 9, 1, 1, 2, 2, 2, 4, 3, 5, | | 3, 8, 1, 3, 3, 3, 1, 2, 2, | 2, 4, 7, 1, nd, 1, 8, 4, 3) 2, 2, 4, 3, 5, 3, 8, 1, 3, 3, 3, 1, 2, 2, 2, 4, 7, 1, -2147483648, 1, 8, 4, 3) scala> val iat = IntArrayTile(input, 9, 4) // 9 and 4 here specify columns and rows iat: geotrellis.raster.IntArrayTile = IntArrayTile([I@278434d0,9,4) // The asciiDraw method is mostly useful when you're working with small tiles // which can be taken in at a glance scala> iat.asciiDraw() res0: String = " ND 7 1 1 3 5 9 8 2 9 1 1 2 2 2 4 3 5 3 8 1 3 3 3 1 2 2 2 4 7 1 ND 1 8 4 3 " scala> val focalNeighborhood = Square(1) // a 3x3 square neighborhood focalNeighborhood: geotrellis.raster.op.focal.Square = O O O O O O O O O scala> val meanTile = iat.focalMean(focalNeighborhood) meanTile: geotrellis.raster.Tile = DoubleArrayTile([D@7e31c125,9,4) scala> meanTile.getDouble(0, 0) // Should equal (1 + 7 + 9) / 3 res1: Double = 5.666666666666667 Documentation Further examples and documentation of GeoTrellis use-cases can be found in the docs/ folder Scaladocs for the latest version of the project can be found here: http://geotrellis.github.com/scaladocs/latest/#geotrellis.package References Geospatial Data and Analysisby Aurelia Moser; Bill Day; Jon BrunerPublished by O'Reilly Media, Inc., 2017 http://geotrellis.io/

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: Disk size used is bigger than replication numb...

Re: Unable to start the Hive2 Interactive Server

Re: Files Duplicating using QueryDatabaseTable and...

Re: How to abort HIVE-compactions

Re: Issue with Nifi Merge Content : Files stay in ...

Re: How to setup time-based queue capacity?

Re: Reporting Tool

Customer Demographics Demo with Apache Nifi, Hive ...

Re: Real-Time Data Ingestion for mission critical ...

Open Source Geospatial Analytics with Apache Spar...