Member since
09-28-2015
48
Posts
117
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3790 | 07-05-2017 04:37 PM | |
1427 | 07-07-2016 03:40 AM | |
2035 | 04-28-2016 12:54 PM | |
2623 | 04-13-2016 02:32 AM | |
1660 | 04-11-2016 08:41 PM |
02-26-2016
03:09 AM
5 Kudos
Would it ever make sense to put any of the NiFi Cluster repositories (Flow File, Content, Provenance) in a NAS like Isilon? I know disk can be the bottleneck but you also want these repositories on drives with a strong RAID, thus my question.
... View more
Labels:
- Labels:
-
Apache NiFi
01-06-2016
06:21 PM
8 Kudos
Hey Wes - A few things to consider when sizing. Data is obviously 1, but the characteristics of the data are even more important for thinking about ingest performance and index sizing. For instance, if there is a lot of free form text, # of attributes, # of rows, etc. all of these way in on the indexing process and index size. Also there are other items in SOLR such as facets that can increase the index size. So definitely look at the shape of the data to get an idea of the index size as well as the features of SOLR that you may be using that can affect index size (i.e. faceting). Also, If you have a sample data set, you can try indexing it to see what the index size is and try to extrapolate from here. Also, however big your index is, make sure you have 3 times that on disk for commits and snapshots. The other item to look at (which is also the 2nd part of your question) is the amount of concurrency / query requests. SOLR is built to return data very quickly but lots of concurrency/request on an under replicated index can certainly create latency and has more impact on the heap than indexing. Also, bad queries are probably more at fault for being latent than SOLR itself. Index fields will always be returned quickly especially if you’re doing a field query (fq=) as opposed to a general query (q=), but both are pretty fast. If you can figure out the number of requests in a 10 second window, this may help you consider the number of replicas you may need for responding to queries without latency. As far as caching, OS caching (fitting the index in memory) will do more for you then working with java heap. In your case, since the index will probably be rather large, you’ll want to use SOLR cloud and utilize shards and replicas to spread the index out across machines to try to keep the index in-memory. As far as HDFS vs local disk. There's a good post here on why to use one over the other. Also. HDFS and SOLR cloud both have data replication and they are mutually exclusive. So if you're using SOLR cloud, you definitely want to make sure the indexes in HDFS have a replication factor of 1. HTH
... View more
12-30-2015
07:46 PM
@Andrea D'Orio You can point an F5 to all or any of the SOLR nodes. SOLR cloud is smart enough in distributing queries to the right shards and replicas. Round robin should be fine. Also, if you're using HDFS to store the indexes than the SOLR needs to sit on the data nodes or nodes with the HDFS client. https://doc.lucidworks.com/lucidworks-hdpsearch/2.3/Guide-Install.html
... View more
12-29-2015
09:42 PM
8 Kudos
Kylin Pronounced “KEY LIN” / “CHI LIN” - This project brings OLAP (Online Analytical Processing) to Big Data. It is a top-level project in Apache. Through it’s UI, you can create a logical model (dimensions/measures) from a star schema in Hive. Kylin will then create cube aggregates using MR and put the aggregates and cube metadata into HBase. Users can then query the cube data through the Kylin UI or a BI tool that uses the Kylin odbc driver.
A good video from the committers overviewing the project: https://www.youtube.com/watch?v=7iDcF7pNhV4
Definitions Cube - A data structure containing dimensions and measures for quickly accessing aggregated information (measures) across many axis's (dimensions) Cuboid - A "slice" or subset of a cube Dimensions - Think of these as alphanumeric columns that sit in a group by clause of SQL. i.e. Location, Department, Time, etc. Measure - Think of these as metric/numerical values that sit in a select clause of SQL. i.e. Sum(value), Max(bonus), Min(effort)
Technical Overview
Kylin needs HBase, Hive and HDFS (Nice!) Regarding HDFS, it does alot of processing in MR by creating aggregate data for each N-Cuboid of a cube. These jobs output HFiles for HBase. In turn, HBase stores cube metadata and cube aggregates in HBase. This makes sense for quick fetching of aggregate data. For cube aggregate levels in HBase, dimensions are row keys in HBase, columns are the measure values. Hive is used for the data modeling. Data needs to be in star schema like format in Hive. Also, base level data resides in Hive and not the cube. The cube contains only aggregate data.
The Good - Use Kylin if you have alot of interactive querying on a smaller number of dimensions, your measures/metrics are simple aggregates and the data doesn't need to be viewed in real-time.
- Sql ansi compliant
- Connectivity to BI tools
- Can use hierarchies - Needs HDFS, HBase & Hive
- Has a UI - Does incremental cube updates
- Uses Calcite for Query optimizer
Cautions - MR overhead with building cubes (“query yesterdays data”). Lots of shuffling. Does aggregations on the reduce side
- No cell level security. Security at a cube and project level. - Simple measures only (counts, max, min and sum). No custom calcs, ratios, etc.
- 20 dimensions seem like a practical upper limit
- For larger cubes, it does pre-aggregation and then aggregation at runtime (may result in query latencies) - No Ambari view Security There is security on projects and cubes, no cell level security. One idea around security is to create smaller cubes (i.e. segments) to create security for users / groups. LDAP is also an option.
What's in HBASE? Metadata and cube data. If you list the tables in HBase, you’ll see this:
KYLIN_XXXXXXXXXXX (This is the Cube)
kylin_metadata
kylin_metadata_acl kylin_metadata_user
Other Thoughts...
Kylin has its own ODBC driver and can be used with Tableau / Excel. With Tableau, make sure you connect with Live data as opposed to import. Kylin only puts aggregates in Hbase, base level data is still in Hive. (I.e. Kylin doesn’t do table scans) eBay (26TB / 16B rows) -> 90% of queries with <5sec latency MDX adoption is very low, therefore its not currently supported You can build up a cube of cubes (daily -> weekly —>monthly, etc). These are called segments. The more segments the slower performance can get (more scans)
Roadmap Streaming Cubes
Spark 1) Thinking about using Spark to speed up cubing MR jobs 2) Source from SparkSQL instead of Hive 3) Route queries to SparkSQL
... View more
12-28-2015
08:41 PM
1 Kudo
This worked for me. A few other simple things I needed to do: create /kylin folder in HDFS add 7070 in port forwarding of sandbox VM make sure HBase is started
... View more
12-23-2015
06:50 PM
2 Kudos
The best you can do is export from a single component (i.e. table), take a screenshot of the dashboard or export the dashboard to load into another banana instance. The reason why you can't do an offline dashboard is because you would need the entire index. Dashboards typically contain summarized data and/or a subset of detailed records. In order for the dashboard to remain interactive (search, filter, faceting, etc) you would need the entire data set offline because it does all of the counts/aggregations on the fly.
... View more
12-10-2015
09:54 PM
I landed on the same issue. Adding this setting to make this searchable since you have this in an image. oozie.authentication.kerberos.name.rules
... View more
11-30-2015
07:32 PM
4 Kudos
Wondering if anyone has any tools they use for monitoring Kafka besides Ambari Metrics and Yammer Metrics. For example: https://github.com/damienclaveau/kafka-graphite Wondering if we advise customers to create their own monitoring through Yammer Metrics or if there are other tools that we utilize or recommend. Many thanks, Chris
... View more
Labels:
- Labels:
-
Apache Kafka
11-12-2015
03:17 AM
2 Kudos
Even if the path is right, check the permissions. Typically the hdfs user should be the owner the group should be hadoop for the data directory: chown -R hdfs /path-to-your-datanode-data-dir chgrp -R hadoop /path-to-your-datanode-data-dir chmod 750 /path-to-your-datanode-data-dir
... View more
11-12-2015
03:17 AM
2 Kudos
The following error is generated when adding a new data node to the cluster: WARN datanode.DataNode (DataNode.java:checkStorageLocations(2407)) - Invalid dfs.datanode.data.dir
... View more
Labels:
- Labels:
-
Apache Hadoop
- « Previous
- Next »