Member since
03-16-2016
707
Posts
1753
Kudos Received
203
Solutions
02-22-2017
12:07 AM
9 Kudos
Introduction Geospatial data is generated in huge volumes with the rise
of the Internet of Things. IoT sensor networks are pushing the geospatial data
rates even higher. There has been an explosion of sensor networks on the
ground, mobile devices carried by people or mounted on vehicles, drones flying
overhead, tethered aerostats (such as Google’s Project Loon), atmosats at high
altitude, and microsats in orbit. Opportunity Geospatial analytics can provide us with the tools and
methods we need to make sense of all that data and put it to use in solving
problems we face at all scales. Challenges Geospatial work requires atypical data types (e.g., points,
shapefiles, map projections), potentially many layers of detail to process and
visualize, and specialized algorithms—not your typical ETL (extract, transform,
load) or reporting work. Apache Spark Role in
Geospatial Development While Spark might seem to be influencing the evolution of
accessory tools, it’s also becoming a default in the geospatial analytics
industry. For example, consider the development of Azavea’s open source
geospatial library GeoTrellis. GeoTrellis was written in Scala and designed to
handle large-scale raster operations. GeoTrellis recently adopted Spark as its
distributed computation engine and, in combination with Amazon Web Services,
scaled the existing raster processing to support even larger datasets. Spark
brings amazing scope to the GeoTrellis project, and GeoTrellis supplies the
geospatial capabilities that Spark lacks. This reciprocal partnership is an
important contribution to the data engineering ecosystem, and particularly to
the frameworks in development for supporting Big Data. About GeoTrellis GeoTrellisis a Scala library and framework
that uses Spark to work with raster data. It is released under the Apache 2
License. GeoTrellis
reads, writes, and operates on raster data as fast as possible. It implements
manyMap Algebraoperations
as well as vector to raster or raster to vector operations. GeoTrellis
also provides tools to render rasters into PNGs or to store metadata about
raster files as JSON. It aims to provide raster processing at web speeds
(sub-second or less) with RESTful endpoints as well as provide fast batch
processing of large raster data sets. Getting
Started GeoTrellis is currently available for Scala 2.11 and Spark
2.0+. To get started with SBT,
simply add the following to your build.sbt file: libraryDependencies += "org.locationtech.geotrellis" %% "geotrellis-raster" % "1.0.0" geotrellis-raster is just one submodule that you can
depend on. To grab the latest
snapshot build, add our snapshot repository: resolvers += "LocationTech GeoTrellis Snapshots" at "https://repo.locationtech.org/content/repositories/geotrellis-snapshots" GeoTrellis Modules
geotrellis-proj4 :
Coordinate Reference systems and reproject (Scala wrapper around Proj4j)
geotrellis-vector :
Vector data types and operations (Scala wrapper around JTS)
geotrellis-raster :
Raster data types and operations
geotrellis-spark :
Geospatially enables Spark; save to and from HDFS
geotrellis-s3 :
S3 backend for geotrellis-spark
geotrellis-accumulo :
Accumulo backend for geotrellis-spark
geotrellis-cassandra :
Cassandra backend for geotrellis-spark
geotrellis-hbase :
HBase backend for geotrellis-spark
geotrellis-spark-etl :
Utilities for writing ETL (Extract-Transform-Load), or "ingest"
applications for geotrellis-spark
geotrellis-geotools :
Conversions to and from GeoTools Vector and Raster data
geotrellis-geomesa :
Experimental GeoMesa integration
geotrellis-geowave :
Experimental GeoWave integration
geotrellis-shapefile :
Read shapefiles into GeoTrellis data types via GeoTools
geotrellis-slick :
Read vector data out of PostGIS viaLightBend Slick
geotrellis-vectortile :
Experimental vector tile support, including reading and writing
geotrellis-raster-testkit :
Testkit for testing geotrellis-raster types
geotrellis-vector-testkit :
Testkit for testing geotrellis-vector types
geotrellis-spark-testkit :
Testkit for testing geotrellis-spark code A more
complete feature list can be found at https://github.com/locationtech/geotrellis,
GeoTrellis Features section. Hello Raster with
GeoTrellis scala> import geotrellis.raster._
import geotrellis.raster._
scala> import geotrellis.raster.op.focal._
import geotrellis.raster.op.focal._
scala> val nd = NODATA
nd: Int = -2147483648
scala> val input = Array[Int](
| nd, 7, 1, 1, 3, 5, 9, 8, 2,
| 9, 1, 1, 2, 2, 2, 4, 3, 5,
|
| 3, 8, 1, 3, 3, 3, 1, 2, 2,
| 2, 4, 7, 1, nd, 1, 8, 4, 3)
2, 2, 4, 3, 5, 3, 8, 1, 3, 3, 3, 1, 2, 2, 2, 4, 7, 1, -2147483648, 1, 8, 4, 3)
scala> val iat = IntArrayTile(input, 9, 4) // 9 and 4 here specify columns and rows
iat: geotrellis.raster.IntArrayTile = IntArrayTile([I@278434d0,9,4)
// The asciiDraw method is mostly useful when you're working with small tiles
// which can be taken in at a glance
scala> iat.asciiDraw()
res0: String =
" ND 7 1 1 3 5 9 8 2
9 1 1 2 2 2 4 3 5
3 8 1 3 3 3 1 2 2
2 4 7 1 ND 1 8 4 3
"
scala> val focalNeighborhood = Square(1) // a 3x3 square neighborhood
focalNeighborhood: geotrellis.raster.op.focal.Square =
O O O
O O O
O O O
scala> val meanTile = iat.focalMean(focalNeighborhood)
meanTile: geotrellis.raster.Tile = DoubleArrayTile([D@7e31c125,9,4)
scala> meanTile.getDouble(0, 0) // Should equal (1 + 7 + 9) / 3
res1: Double = 5.666666666666667
Documentation
Further
examples and documentation of GeoTrellis use-cases can be found in the docs/ folder
Scaladocs for the latest version of the
project can be found here: http://geotrellis.github.com/scaladocs/latest/#geotrellis.package References Geospatial Data and Analysisby Aurelia Moser; Bill Day; Jon BrunerPublished by O'Reilly Media, Inc., 2017 http://geotrellis.io/
... View more
Labels:
12-29-2016
06:15 AM
1 Kudo
This is an issue with Ambari version prior to 2.2.0. The article should have clarified it. The JIRA specifies that it is fixed in 2.2.0, however, search engines will not pull this link in searches as such the exposure of the article is extremely limited.
... View more
12-26-2016
09:02 PM
2 Kudos
Introduction
h2o is a package for running H2O via its REST API from within R. This package allows the user to run basic H2O commands using R commands. No actual data is stored in the R workspace; and no actual work is carried out by R. R only saves the named objects, which uniquely identify the data set, model, etc. on the server. When the user makes a request, R queries the server via the REST API, which returns a JSON file with the relevant information that R then displays in the console.
Scope
I tested this installation guide on CentOS 7.2, but it
should work on similar RedHat/Fedora/Centos…
Steps
1. Install R
sudo yum install R
2. Install Java
https://www.java.com/en/download/help/linux_x64rpm_install.xml
3. Start R and install dependencies
install.packages(RCurl)
install.packages(bitops)
install.packages(rjson)
install.packages(statmod)
install.packages(tools)
4. Install h20 package and load library for use
install.packages("h2o").
library(h2o)
If this is your first time using CRAN4 it will ask for a
mirror to use. If you want H2O installed site-wide (i.e., usable by all users
on that machine), run R as root, sudo R, then type
install.packages("h2o").
5. Test H2O installation
Type:
library(h2o)
If nothing complains, launch h2o:
h2o.init().
If all went well then you’ll see lots of output about how it
is starting up H2O on your behalf, and then it should tell you all about your
cluster. If not, the error message should be telling you what dependency is
missing, or what the problem is. Post a note to this article and I will get
back to you.
Tips
#1 - The version of H2O on CRAN might be up to a month or two
behind the latest and greatest. Unless you are affected by a bug that you know
has been fixed, don’t worry about it.
#2- h2o.init() will only use two cores on your machine and maybe
a quarter of your system memory, 6 by default. To resize resource, use h2o.shutdown() and start it again:
a) using all your cores:
h2o.init(nthreads = -1)
b) using all your cores and 4 GB:
h2o.init(nthreads = -1, max_mem_size = "4g")
#3 - To run H2O on your local machine, you could call h2o.init without any
arguments, and H2O will be automatically launched at localhost:54321, where the
IP is "127.0.0.1" and the port is 54321.
#4 - If H2O is running on a
cluster, you must provide the IP and port of the remote machine as arguments to
the h2o.init() call. The operation will be done on the server associated with
the data object where H2O is running, not within the R environment. Tutorials
H2O Tutorial on the Hortonworks Data Platform Sandbox:
http://hortonworks.com/blog/oxdata-h2o-tutorial-hortonworks-sandbox/
Walk-Though Tutorials for Web UI:
http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/tutorial/top.html
... View more
Labels:
12-23-2016
02:59 AM
12 Kudos
Introduction The producer sends data directly to the broker that is the leader for the partition without any intervening routing tier. Optimization Approach Batching is one of the big drivers of efficiency, and to enable batching the Kafka producer will attempt to accumulate data in memory and to send out larger batches in a single request. The batching can be configured to accumulate no more than a fixed number of messages and to wait no longer than some fixed latency bound (say 64k or 10 ms). This allows the accumulation of more bytes to send, and few larger I/O operations on the servers. This buffering is configurable and gives a mechanism to trade off a small amount of additional latency for better throughput. In order to find the optimal batch size and latency, iterative test supported by producer statistics monitoring is needed. Enable Monitoring Start the producer with the JMX parameters enabled: JMX_PORT=10102 bin/kafka-console-producer.sh --broker-list localhost:9092--topic testtopic Producer Metrics Use jconsole application via JMX at port number 10102. Tip: run jconsole application remotely to avoid impact on broker machine. See metrics in MBeans tab. The <strong>clientId</strong> parameter is the producer client ID for which you want the statistics. <strong>kafka.producer:type=ProducerRequestMetrics,name=ProducerRequestRateAndTimeMs,clientId=console-producer</strong> This MBean give values for the rate of producer requests taking place as well as latencies involved in that process. It gives latencies as a mean, the 50th, 75th, 95th, 98th, 99th, and 99.9thlatency percentiles. It also gives the time taken to produce the data as a mean, one minute average, five minute average, and fifteen minute average. It gives the count as well. <strong>kafka.producer:type=ProducerRequestMetrics,name=ProducerRequestSize,clientId=console-producer</strong> This MBean gives the request size for the producer. It gives the count, mean, max, min, standard deviation, and the 50th, 75th, 95th, 98th, 99th, and 99.9thpercentile of request sizes. <strong>kafka.producer:type=ProducerStats,name=FailedSendsPerSec,clientId=console-producer</strong> This gives the number of failed sends per second. It gives this value of counts, the mean rate, one minute average, five minute average, and fifteen minute average value for the failed requests per second. <strong>kafka.producer:type=ProducerStats,name=SerializationErrorsPerSec,clientId=console-producer</strong> This gives the number of serialization errors per second. It gives this value of counts, mean rate, one minute average, five minute average, and fifteen minute average value for the serialization errors per second. <strong>kafka.producer:type=ProducerTopicMetrics,name=MessagesPerSec,clientId=console-producer</strong> This gives the number of messages produced per second. It gives this value of counts, mean rate, one-minute average, five-minute average, and fifteen-minute average for the messages produced per second. References
https://kafka.apache.org/documentation.html#monitoring
Apache Kafka Cookbook by Saurabh Minni, 2015
... View more
Labels:
12-20-2016
03:14 AM
13 Kudos
Pre-requisites Hortonworks Data Platform 2.5 on CentOS 7.2 Python distribution that comes with HDP 2.5 - Python 2.7.5 Download and install pip #wget https://bootstrap.pypa.io/get-pip.py
Install add-on package #pip install requests Start Python CLI (default version) #python Import pre-reqs >>>import requests >>>import json >>>import sys Environment Variables Set Ambari domain variable to the IP address or FQDN of your Ambari node. >>>AMBARI_DOMAIN = '127.0.0.1' Set Ambari port, Ambari user and password variables to match your specifics. >>>AMBARI_PORT = '8080' >>>AMBARI_USER_ID = 'admin' >>>AMBARI_USER_PW = 'admin' Set the following variable to the IP address or FQDN of your ResourceManager node. >>>RM_DOMAIN = '127.0.0.1' Set Resource Manager port variable >>>RM_PORT = '8088' Ambari REST API Call Examples Let's find Cluster Name, Cluster Version, Stack and Stack Version: >>>restAPI='/api/v1/clusters' >>>url="http://"+AMBARI_DOMAIN+":"+AMBARI_PORT+restAPI >>>r=requests.get(url, auth=(AMBARI_USER_ID, AMBARI_USER_PW)) >>>json_data=json.loads(r.text) >>>CLUSTER_NAME = json_data["items"][0]["Clusters"]["cluster_name"] >>>print(CLUSTER_NAME) >>>CLUSTER_VERSION =json_data["items"][0]["Clusters"]["version"] >>>print(CLUSTER_VERSION) >>>STACK = CLUSTER_VERSION.split('-')[0] >>>print(STACK) >>>STACK_VERSION = CLUSTER_VERSION.split('-')[1] >>>print(STACK_VERSION) >>>CLUSTER_INFO=json_data >>>print(CLUSTER_INFO) Let's find HDP stack repository: >>>restAPI = "/api/v1/stacks/"+STACK+"/versions/"+STACK_VERSION+"/operating_systems/redhat7/repositories/"+CLUSTER_VERSION >>>url = "http://"+AMBARI_DOMAIN+":"+AMBARI_PORT+restAPI >>>r= requests.get(url, auth=(AMBARI_USER_ID, AMBARI_USER_PW)) >>>json_data=json.loads(r.text) >>>print(json_data) >>>REPOSITORY_NAME=json_data["Repositories"]["latest_base_url"] >>>print(REPOSITORY_NAME) A more elegant approach is to create utility functions. See my repo: https://github.com/cstanca1/HDP-restAPI/ restAPIFunctions.py script in the repo defines a number of useful functions that I have collected over time. Run restAPIFunctions.py The same example presented above can now be implemented with a single line call to get CLUSTER_NAME, CLUSTER_VERSION and CLUSTER_INFO using getClusterVersionAndName() function: >>>CLUSTER_NAME,CLUSTER_VERSION,CLUSTER_INFO = getClusterVersionAndName()
>>>print(CLUSTER_NAME)
>>>print(CLUSTER_VERSION)
>>>print(CLUSTER_INFO) Resource Manager REST API Call Examples >>>RM_INFO=getResourceManagerInfo()
>>>RM_SCHEDULER_INFO=getRMschedulerInfo()
>>>print(RM_INFO)
>>>print(RM_SCHEDULER_INFO) Other Functions These are other functions included in restAPIFunctions.py script getServiceActualConfigurations()
getClusterRepository()
getAmbariHosts()
getResourceManagerInfo()
getRMschedulerInfo()
getAppsSummary()
getNodesSummary()
getServiceConfigTypes()
getResourceManagerMetrics()
getCheckClusterForRollingUpgrades()
... View more
Labels:
12-18-2016
10:17 PM
13 Kudos
Background Tungsten became the default in Spark 1.5 and can be enabled in earlier versions by setting spark.sql.tungsten.enabled to true (or disabled in later versions by setting this to false). Even without Tungsten, Spark SQL uses a columnar storage format with Kryo serialization to minimize storage cost. Goal The goal of Project Tungsten is to improve Spark execution by optimizing Spark jobs for CPU and memory efficiency (as opposed to network and disk I/O which are considered fast enough). Scope Tungsten focuses on the hardware architecture of the platform Spark runs on, including but not limited to JVM, LLVM, GPU, NVRAM, etc. Optimization Features Off-Heap Memory Management using binary in-memory data representation aka Tungsten row format and managing memory explicitly, Cache Locality which is about cache-aware computations with cache-aware layout for high cache hit rates, Whole-Stage Code Generation (aka CodeGen). Design Improvements Tungsten includes specialized in-memory data structures
tuned for the type of operations required by Spark, improved code generation,
and a specialized wire protocol. Tungsten’s representation is substantially smaller than
objects serialized using Java or even Kryo serializers. As Tungsten does not depend on Java objects, both on-heap
and off-heap allocations are supported. Not only is the format more compact, serialization times can
be substantially faster than with native serialization. Since Tungsten no longer depends on working with Java
objects, you can use either on-heap (in the JVM) or off-heap storage. If you
use off-heap storage, it is important to leave enough room in your containers
for the off-heap allocations - which you can get an approximate idea for from
the web ui. Tungsten’s data structures are also created closely in mind
with the kind of processing for which they are used. The classic example of this is with sorting, a common and
expensive operation. The on-wire representation is implemented so that sorting
can be done without having to deserialize the data again. By
avoiding the memory and GC overhead of regular Java objects, Tungsten is able
to process larger data sets than the same hand-written aggregations. Benefits The following Spark jobs will benefit from Tungsten: Dataframes: Java, Scala, Python, R SparkSQL queries Some RDD API programs via general serialization and compression optimizations Next Steps In the future Tungsten may make it more feasible to use certain non-JVM libraries. For many simple operations the cost of using BLAS, or similar linear algebra packages, from the JVM is dominated by the cost of copying the data off-heap. References: Project
Tungsten: Bringing Apache Spark Closer to Bare Metal High Performance Spark by Holden Karau; Rachel
Warren Slides: Deep Dive Into Project Tungsten - Josh Rosen Video: Deep Dive into Project Tungsten Bringing Spark Closer to Bare Metal -Josh Rosen (Databricks)
... View more
Labels:
12-17-2016
01:33 AM
11 Kudos
Introduction Many organizations have come to rely on Hadoop for dealing with the ever-increasing quantities of data that they gather. Today, it is clear what problems Hadoop can solve, however, cloud is still not the first choice for Hadoop deployment. Pros and cons for Hadoop in the cloud have been shared across multiple blogs and books, but the question is always coming-up in discussions with enterprises considering Hadoop in the Cloud. Thus, I thought it would be useful to collate together a few pros and cons, as well as mention a pragmatic approach to consider a hybrid cloud for organizations that have made significant investments on-prem. For organizations going to Hadoop for the first time, Cloud is probably a better bet especially if they don't have a lot of IT expertise and a great stream of revenue exists and needs to be exploited immediately. Pro Cloud Lack of space. You don’t have space to keep racks of physical servers, along with the necessary power and cooling. Flexibility. It is much easier to reorganize instances, or expand or contract your footprint, for changing business needs. Everything is controlled through cloud provider APIs and web consoles. Changes can be scripted and put into effect manually or even automatically and dynamically based on current conditions. New usage patterns. Cloud providers abstract computing resources such that they are not tied to physical configurations, which means they can be managed in ways that are otherwise impractical. For example, individuals could have their own instances, clusters, and even networks to work with, without much managerial overhead. The overall budget for CPU cores in your cloud provider account can be concentrated in a set of large instances, a larger set of smaller instances, or some mixture, and can even change over time. When an instance malfunctions, instead of troubleshooting what went wrong, you can just tear it down and replace it. Worldwide availability. The largest cloud providers have data centers around the world. You can use resources close to where you work, or close to where your customers are, for the best performance. You can set up redundant clusters, or even entire computing environments, in multiple data centers, so that if local problems occur in one data center, you can shift to working elsewhere. Data retention restrictions. If you have data that is required by law to be stored within specific geographic areas, you can keep it in clusters that are hosted in data centers in those areas. Cloud provider features. Each major cloud provider offers an ecosystem of features to support the core functions of computing, networking, and storage. To use those features most effectively, your clusters should run in the cloud provider as well. Capacity. Very few customers tax the infrastructure of a major cloud provider. You can establish large systems in the cloud that are not nearly as easy to put together, not to mention maintain, on-prem. Pro On-Prem Simplicity. Cloud providers start you off with reasonable defaults, but then it is up to you to figure out how all of their features work and when they are appropriate. It takes a lot of experience to become proficient at picking the right types of instances and arranging networks properly. High levels of control. Beyond the general geographic locations of cloud provider data centers and the hardware specifications that providers reveal for their resources, it is not possible to have exacting, precise control over your cloud architecture. You cannot tell exactly where the physical devices sit, or what the devices near them are doing, or how data across them shares the same physical network1. When the cloud provider has internal problems such as network outages, there’s not much you can do but wait. Unique hardware needs. You cannot have cloud providers attach specialized peripherals or dongles to their hardware for you. If your application requires resources that exceed what a cloud provider offers, you will need to host that part on-prem away from your Hadoop clusters. Saving money. For one thing, you are still paying for the resources you use. The hope is that the economy of scale that a cloud provider can achieve makes it more economical for you to pay to “rent” their hardware than to run your own. You will also still need people who understand system administration and networking to take care of your cloud infrastructure. Inefficient architectures can cost a lot of money in storage and data transfer costs, or instances that are running idle. Best of Both Instead of running your clusters and associated applications completely in the cloud or completely on-prem, the overall system is split between the two - Hybrid Cloud. Data channels are established between the cloud and on-prem worlds to connect the components needed to perform work. Examples Suppose there is a large, existing on-prem data processing system, perhaps using Hadoop clusters, which works well. In order to expand its capacity for running new analyses, rather than adding more on-prem hardware, Hadoop clusters can be created in the cloud. Data needed for the analyses is copied up to the Hadoop clusters where it is analyzed, and the results are sent back on-prem. The cloud clusters can be brought up and torn down in response to demand, which helps to keep costs lower. Assume the management of vast amounts of incoming data that needs to be centralized and processed. To avoid having one single choke point where all of the raw data is sent, a set of cloud clusters can share the load, perhaps each in a geographic location convenient to where the data is generated. These clusters can perform pre-processing of the data, such as cleaning and summarization, to reduce the work that the final centralized system must perform. References Moving Hadoop to the Cloud by Bill Havanki, published by O'Reilly Media, Inc., 2017.
... View more
Labels:
11-28-2016
05:48 PM
@apappu You are using "code" blocks for non-code regular text. For example you describe textually each step using a code block. The same issue with final note. It is text, not code. Also, the article, should include a structure like: Problem Description Assumptions Steps Conclusions Could you clean-up the article for that, also spell checks and resubmit? Our articles need to have a publisher quality.
... View more
11-24-2016
04:00 AM
@ambud.sharma Voted up :). Before, it was counter-intuitive.
... View more
11-24-2016
02:10 AM
10 Kudos
Behavior The number of cells returned to the client are
normally filtered based on the table configuration; however, when using the RAW
=> true parameter, you can retrieve all of the versions kept by HBase, unless
there was a major compaction or a flush to disk event in meanwhile. Demonstration Create a table with a single column family: create 't1', 'f1' Configure it to retain a maximum version count of 3: alter 't1',NAME=>'f1',VERSIONS=>3 Perform 4 puts: put 't1','r1','f1:c1',1
put 't1','r1','f1:c1',2
put 't1','r1','f1:c1',3
put 't1','r1','f1:c1',4 Scan with RAW=>true. I used VERSIONS as 100 for a catch-all. It could have been anything greater than 3 (number of versions set previously). Unless specified, only the latest version is returned by scan command. scan 't1',{RAW=>true,VERSIONS=>100} The above scan returns all four versions. ROW COLUMN+CELL
r1 column=f1:c1,timestamp=1479950685181, value=4
r1 column=f1:c1,timestamp=1479950685155, value=3
r1 column=f1:c1,timestamp=1479950685132, value=2
r1 column=f1:c1,timestamp=1479950627736, value=1 Flush to disk: flush ‘t1’ Then scan: scan 't1',{RAW=>true,VERSIONS=>100} Three versions are returned. ROW COLUMN+CELL
r1 column=f1:c1,timestamp=1479952079260, value=4
r1 column=f1:c1,timestamp=1479952079234, value=3
r1 column=f1:c1,timestamp=1479952079209, value=2
Do four more puts: put 't1','r1','f1:c1',5
put 't1','r1','f1:c1',6
put 't1','r1','f1:c1',7
put 't1','r1','f1:c1',8
Flush to disk: flush ‘t1’ Scan: scan 't1',{RAW=>true,VERSIONS=>100} Six versions are returned: ROW COLUMN+CELL
r1 column=f1:c1,timestamp=1479952349970, value=8
r1 column=f1:c1,timestamp=1479952349925, value=7
r1 column=f1:c1,timestamp=1479952349895, value=6
r1 column=f1:c1,timestamp=1479952079260, value=4
r1 column=f1:c1,timestamp=1479952079234, value=3
r1 column=f1:c1,timestamp=1479952079209, value=2
Force major compaction: major_compact ‘t1’ Scan: scan 't1',{RAW=>true,VERSIONS=>100} Three versions are returned: ROW COLUMN+CELL
r1 column=f1:c1,timestamp=1479952349970, value=8
r1 column=f1:c1,timestamp=1479952349925, value=7
r1 column=f1:c1,timestamp=1479952349895, value=6
Conclusion When deciding the number of versions to retain,
it is best to treat that number as the minimum version count available at a
given time and not as a given constant. Until a flush to disk and a major
compaction, number of versions available is higher than the configured for the
table.
... View more
Labels: