About ccasano

ccasano · ‎02-26-2016

Would it ever make sense to put any of the NiFi Cluster repositories (Flow File, Content, Provenance) in a NAS like Isilon? I know disk can be the bottleneck but you also want these repositories on drives with a strong RAID, thus my question.

ccasano · ‎01-06-2016

Hey Wes - A few things to consider when sizing. Data is obviously 1, but the characteristics of the data are even more important for thinking about ingest performance and index sizing. For instance, if there is a lot of free form text, # of attributes, # of rows, etc. all of these way in on the indexing process and index size. Also there are other items in SOLR such as facets that can increase the index size. So definitely look at the shape of the data to get an idea of the index size as well as the features of SOLR that you may be using that can affect index size (i.e. faceting). Also, If you have a sample data set, you can try indexing it to see what the index size is and try to extrapolate from here. Also, however big your index is, make sure you have 3 times that on disk for commits and snapshots. The other item to look at (which is also the 2nd part of your question) is the amount of concurrency / query requests. SOLR is built to return data very quickly but lots of concurrency/request on an under replicated index can certainly create latency and has more impact on the heap than indexing. Also, bad queries are probably more at fault for being latent than SOLR itself. Index fields will always be returned quickly especially if you’re doing a field query (fq=) as opposed to a general query (q=), but both are pretty fast. If you can figure out the number of requests in a 10 second window, this may help you consider the number of replicas you may need for responding to queries without latency. As far as caching, OS caching (fitting the index in memory) will do more for you then working with java heap. In your case, since the index will probably be rather large, you’ll want to use SOLR cloud and utilize shards and replicas to spread the index out across machines to try to keep the index in-memory. As far as HDFS vs local disk. There's a good post here on why to use one over the other. Also. HDFS and SOLR cloud both have data replication and they are mutually exclusive. So if you're using SOLR cloud, you definitely want to make sure the indexes in HDFS have a replication factor of 1. HTH

ccasano · ‎12-30-2015

@Andrea D'Orio You can point an F5 to all or any of the SOLR nodes. SOLR cloud is smart enough in distributing queries to the right shards and replicas. Round robin should be fine. Also, if you're using HDFS to store the indexes than the SOLR needs to sit on the data nodes or nodes with the HDFS client. https://doc.lucidworks.com/lucidworks-hdpsearch/2.3/Guide-Install.html

ccasano · ‎12-29-2015

Kylin Pronounced “KEY LIN” / “CHI LIN” - This project brings OLAP (Online Analytical Processing) to Big Data. It is a top-level project in Apache. Through it’s UI, you can create a logical model (dimensions/measures) from a star schema in Hive. Kylin will then create cube aggregates using MR and put the aggregates and cube metadata into HBase. Users can then query the cube data through the Kylin UI or a BI tool that uses the Kylin odbc driver. A good video from the committers overviewing the project: https://www.youtube.com/watch?v=7iDcF7pNhV4 Definitions Cube - A data structure containing dimensions and measures for quickly accessing aggregated information (measures) across many axis's (dimensions) Cuboid - A "slice" or subset of a cube Dimensions - Think of these as alphanumeric columns that sit in a group by clause of SQL. i.e. Location, Department, Time, etc. Measure - Think of these as metric/numerical values that sit in a select clause of SQL. i.e. Sum(value), Max(bonus), Min(effort) Technical Overview Kylin needs HBase, Hive and HDFS (Nice!) Regarding HDFS, it does alot of processing in MR by creating aggregate data for each N-Cuboid of a cube. These jobs output HFiles for HBase. In turn, HBase stores cube metadata and cube aggregates in HBase. This makes sense for quick fetching of aggregate data. For cube aggregate levels in HBase, dimensions are row keys in HBase, columns are the measure values. Hive is used for the data modeling. Data needs to be in star schema like format in Hive. Also, base level data resides in Hive and not the cube. The cube contains only aggregate data. The Good - Use Kylin if you have alot of interactive querying on a smaller number of dimensions, your measures/metrics are simple aggregates and the data doesn't need to be viewed in real-time. - Sql ansi compliant - Connectivity to BI tools - Can use hierarchies - Needs HDFS, HBase & Hive - Has a UI - Does incremental cube updates - Uses Calcite for Query optimizer Cautions - MR overhead with building cubes (“query yesterdays data”). Lots of shuffling. Does aggregations on the reduce side - No cell level security. Security at a cube and project level. - Simple measures only (counts, max, min and sum). No custom calcs, ratios, etc. - 20 dimensions seem like a practical upper limit - For larger cubes, it does pre-aggregation and then aggregation at runtime (may result in query latencies) - No Ambari view Security There is security on projects and cubes, no cell level security. One idea around security is to create smaller cubes (i.e. segments) to create security for users / groups. LDAP is also an option. What's in HBASE? Metadata and cube data. If you list the tables in HBase, you’ll see this: KYLIN_XXXXXXXXXXX (This is the Cube) kylin_metadata kylin_metadata_acl kylin_metadata_user Other Thoughts... Kylin has its own ODBC driver and can be used with Tableau / Excel. With Tableau, make sure you connect with Live data as opposed to import. Kylin only puts aggregates in Hbase, base level data is still in Hive. (I.e. Kylin doesn’t do table scans) eBay (26TB / 16B rows) -> 90% of queries with <5sec latency MDX adoption is very low, therefore its not currently supported You can build up a cube of cubes (daily -> weekly —>monthly, etc). These are called segments. The more segments the slower performance can get (more scans) Roadmap Streaming Cubes Spark 1) Thinking about using Spark to speed up cubing MR jobs 2) Source from SparkSQL instead of Hive 3) Route queries to SparkSQL

ccasano · ‎12-28-2015

This worked for me. A few other simple things I needed to do: create /kylin folder in HDFS add 7070 in port forwarding of sandbox VM make sure HBase is started

ccasano · ‎12-23-2015

The best you can do is export from a single component (i.e. table), take a screenshot of the dashboard or export the dashboard to load into another banana instance. The reason why you can't do an offline dashboard is because you would need the entire index. Dashboards typically contain summarized data and/or a subset of detailed records. In order for the dashboard to remain interactive (search, filter, faceting, etc) you would need the entire data set offline because it does all of the counts/aggregations on the fly.

ccasano · ‎12-10-2015

I landed on the same issue. Adding this setting to make this searchable since you have this in an image. oozie.authentication.kerberos.name.rules

ccasano · ‎11-30-2015

Wondering if anyone has any tools they use for monitoring Kafka besides Ambari Metrics and Yammer Metrics. For example: https://github.com/damienclaveau/kafka-graphite Wondering if we advise customers to create their own monitoring through Yammer Metrics or if there are other tools that we utilize or recommend. Many thanks, Chris

ccasano · ‎11-12-2015

Even if the path is right, check the permissions. Typically the hdfs user should be the owner the group should be hadoop for the data directory: chown -R hdfs /path-to-your-datanode-data-dir chgrp -R hadoop /path-to-your-datanode-data-dir chmod 750 /path-to-your-datanode-data-dir

ccasano · ‎11-12-2015

The following error is generated when adding a new data node to the cluster: WARN datanode.DataNode (DataNode.java:checkStorageLocations(2407)) - Invalid dfs.datanode.data.dir

Online	Offline
Last Visited	‎03-12-2019 11:28 AM

Member Since	‎09-28-2015 07:36 PM
Last Visited	‎03-12-2019 11:28 AM
Posts	48
Kudos received	106

Cloudera Community

Re: NiFi message when emptying queues: "Waiting fo...

Re: How to disable the Interpreter tab and for al...

Re: HDF be used to feed Logstash?

Re: Hive Update - how to update a txt file in HDFS...

Re: How to query/perform OLAP operations on cube c...

Nifi with Isilon?

Re: Solr Sizing and Query response time expectatio...

Re: Solr architecture for a production environment

A quick skinny on Apache Kylin

Re: How to Make Kylin work with HDP 2.3

Re: Solr Banana Dashboard charts and data download

Re: Bug in Sandbox 2.3.2 cannot contain trailing w...

Kafka Monitoring

Re: Data Node Not Starting

Data Node Not Starting