Member since
09-28-2015
48
Posts
117
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1734 | 07-05-2017 04:37 PM | |
699 | 07-07-2016 03:40 AM | |
877 | 04-28-2016 12:54 PM | |
1363 | 04-13-2016 02:32 AM | |
787 | 04-11-2016 08:41 PM |
07-05-2017
04:37 PM
1 Kudo
I've seen something similar to this before when you have a circular flow failure on a processor. Especially with an InvokeHTTP processor. The bad thought is..."Ah the request failed so let me re-route it back to the InvokeHTTP processor because maybe it work the 2nd, 3rd..nth time". The failure flow files loop back onto the originating processor and you get this type of behavior because the flow files typically get penalized, have long expiration periods and they get re-queued again. This becomes a continuous failure loop. A better practice to handle the failure into a new processor instead and pay attention to the amount of retries, expirations, etc. @Matt Clarke has a great post on this: https://community.hortonworks.com/questions/77336/nifi-best-practices-for-error-handling.html
... View more
03-30-2017
07:54 PM
Try using the EvaluateJsonPath processor if you want to pull out key attribute value pairs. You can then map those AVPs by creating additional fields in the properties of the PutSolrContentStream processor. The flow NiFi from this marquee post shows the design pattern. https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.html
... View more
03-30-2017
03:33 AM
when you started solr, did you specify localhost:2181/solr as the zk location? i.e. /solr.sh start -c -z localhost:2181/solr. If not, make sure the zk location you started in the solr setup is the same in the PutSolr properties. I've also had times where I had to recreate the index to get this error message to go away.
... View more
03-29-2017
05:01 PM
1 Kudo
Did you add the collection name to the properties in the PutSolrContentStream processor? what are the other properties you have in there?
... View more
03-29-2017
04:51 PM
1 Kudo
In the PutSolrContentStream processor, did include the Collection name in the properties? What are the other properties that you have set in processor.
... View more
03-29-2017
03:50 PM
6 Kudos
It comes down to your comfort level and the type of ETL you’re are trying to do to give your a proper recommendation. The biggest difference is that you have less GUI’s (but some good ones!) to work with for ETL in the Hortonworks stack. If your comfortable with some SQL, scripting and programming our stack is great for doing ETL at scale. Here’s a break down of the tools and where you can use them in our stack
ETL Options in Hortonworks Extraction / Load - Apache Sqoop, Apache NiFi, SyncSort Transformations - Apache Hive, Apache Spark, Apache Pig, Apache NiFi Other items to consider for ETL Work Orchestration - Ambari Workflow Manager (Oozie UI), Apache NiFi Data Discovery - Apache Zeppelin, Apache SOLR Additionally, ETL takes several forms in Hadoop. ELT is more of a common pattern. In a traditional Informatica ETL pattern, you would extract from source systems, transform in PowerCenter and land in target. In Hadoop, you’ll typically extract from source, land in Hadoop, transform, land in target (i.e. Hive). For this pattern, we would typically recommend Sqoop for EL and Hive, Spark or Pig for T. EtL (little t) is another pattern with streaming ingest pipelines. You’ll extract or capture the source, do light transformation (i.e. preparation, conversions, enrichment, etc) and then land into Hadoop. For these light transformations, they are not typically batch oriented. For this pattern, we would typically recommend Apache NiFi. Things that are not in the platform that you have to account for. Master Data Repository Cleansing Rules Enrichment Modules (i.e. address cleansing) Change Data Capture Reuseable Templates (except with NiFi) In some cases you can use external services for the items above. Or because the beauty of Open Source is that it’s highly extensible, build or leverage integrations into other tools that may assist with cleansing, enrichment, etc. If you go back to the days before commercial ETL tools existed, you can build all of the items mentioned above as part of your overall data management environment.
... View more
03-11-2017
10:20 PM
@Scott Shaw Thanks Scott. This helps for now in that there are other factors we have to include when sizing / estimating for concurrency.
... View more
03-08-2017
03:45 AM
5 Kudos
Outside of YARN queues, more node managers and HS2, is there a rule of thumb for scaling AtScale with more concurrent users? Does the Hybrid Query Service and Cache Manager have any scaling limits?
... View more
Labels:
- Labels:
-
Apache Hive
09-20-2016
04:27 PM
@Bryan Bende @Artem Ervits this is helpful, I think we could be onto something. For a coprocessor, would it make sense to emit to REST call to get the transaction to NiFi as opposed to having NiFi doing constant Gets? Not too familiar with HBase but co-processors reminds me of triggers which can be useful but slippery. For the two phase commit, I believe the NiFi processor that would receive the "triggered" data would then have to ACK which HBase before transmitting further down the flow.
... View more
09-20-2016
04:04 AM
4 Kudos
How would you perform a two phase commit between HBase and NiFi? Think of a trading system in FinServ. Once a piece of data in transacted (i.e. committed) in HBase (assume Omid / Tephra here), how can a push mechanism get that data into NiFi, and then NiFi can acknowledge that it received the data from HBase?
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache NiFi
09-19-2016
11:06 PM
4 Kudos
This has come up a few times. You’ll sometimes notice that after a Banana deployment in SOLR that you can’t save your dashboards in Banana. To enable this, you have to create an index that stores these dashboards. In order to enable this, all you need to do is run the following statement which will create a banana-int index. sh ${SOLR_HOME}/bin/solr create_core -c banana-int -d ../server/solr-webapp/webapp/banana/resources/banana-int-solr-5.0/conf Then restart SOLR... sh ${SOLR_HOME}/bin/solr restart Then you can... 1) Save your dashboard: 2) And access your saved dashboard: Happy searching!
... View more
- Find more articles tagged with:
- Banana
- dashboard
- Data Science & Advanced Analytics
- How-ToTutorial
- solr
Labels:
09-13-2016
04:18 AM
3 Kudos
Hi Arun - This post describes how you can use preemption with a fair policy in the capacity scheduler. It should give you a similar behavior to the fair scheduler but using a YARN capacity queue. https://community.hortonworks.com/articles/44079/yarn-pre-emption-with-spark-using-a-fair-policy.html
... View more
08-04-2016
11:04 AM
7 Kudos
Joining Collections in SOLR (Part 1)
Sometimes you may want to inner join data from one solr connection to another. There is a facility to perform this action using a join query in SOLR. The easiest way to perform the join is by linking a single attribute from one collection to another attribute in another collection. This join works very well for standalone indexes, but does not work well for distributed indexes. To do this in a distributed index, we’ll perform that in part II of this article.
To demonstrate, let’s say we have two collections. Sales, which contains the amount of sales by region. And in the other collection called People, which has people categorized by their region and a flag if they are a manager. Let’s say our goal is to find all of the sales by manager. To do this, we will join the collections using region as our join key, and also filter the people data by if they are a manager or not.
Here is the filter query (fq) in solr on how to make this happen:
fq={!join from=region_s to=region_s fromIndex=people}mgr_s:yes
Let's use an actual example to show the functionality...
First let’s create a sales collections and populate it:
curl "http://127.0.0.1:8983/solr/admin/cores?action=CREATE&name=sales&instanceDir=/opt/hostname-hdpsearch/solr/server/solr/sales&configSet=basic_configs"
We'll populate it with data using the Solr Admin UI. Select the Sales core, then choose Documents. Document Type should be CSV, paste the values below into the text box and then click Submit Document. Very simple way to index sample data.
id,region_s,sales_i
1,east,100000
2,west,200000
3,north,300000
4,south,400000
Now create our second collection, people:
curl "http://127.0.0.1:8983/solr/admin/cores?action=CREATE&name=people&instanceDir=/Users/ccasano/Applications/solr/solr-5.2.1/server/solr/people&configSet=basic_configs"
You can upload the following data as well into the people collection this time.
id,name_s,region_s,salary_i,mgr_s
1,chris,east,100000,yes
2,jen,west,200000,yes
3,james,east,75000,no
4,ruby,north,50000,yes
5,charlotte,west,120000,yes
Finally let’s run our join query to produce the results we are looking for.
http://localhost:8983/solr/sales/select?q=*:*&fq={!join from=region_s to=region_s fromIndex=people}mgr_s:yes
You should see the following results:
If you would like to run the same functionality using compounded join keys (i.e. 2 or more join keys). The best things to do is concatenate those keys on ingest to create a single join key.
Additionally, this functionality does not work with distributed indexes, i.e. multiple shards. If you try to attempt this on a distributed index with multiple shards, you’ll get the following error message:
"error": { "msg": "SolrCloud join: multiple shards not yet supported people", "code": 400
In Conclusion: Joins between SOLR collections are useful but should be taken with caution. As you can see, this query only works with simple non-distributed collections. Additionally, you can only display the fields from the sales collection and not the people collection which is a total bummer. A more common practice is to pre-join the information before it’s indexed. For joining collections with multiple shards, you could also try to attempt this with Spark. Stay tuned on how to do this in Part II of this post.
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- How-ToTutorial
- solr
Labels:
07-07-2016
11:25 PM
7 Kudos
Overview I recently encountered a question where someone asked to see how you can do preemption across YARN queues when a spark job is beyond it's queue's min guarantee. They had seen this before with the Fair Scheduler and Map Reduce, but wanted to apply the same experience here but with Spark and the Capacity Scheduler. This how-to article describes how to setup this experience.
Goal: Run large
Spark jobs in two separate capacity queues to produce an equal share of
resources for both jobs.
Hardware: 5 Nodes
of AWS EC2 r3.xlarge
Cluster
Configuration: HDP: 2.4.2, Spark: 1.6.1, 5 Node Managers, 20GB (20480MB) Yarn Containers yarn.scheduler.maximum-allocation-mb
= 20480 yarn.scheduler.minimum-allocation-mb
= 2560
High Level Setup: 1. Add preemption properties as per documentation 2. Create Two YARN Queues with Fair Ordering Child queue “test1” with a min capacity of 50%
and a max of 100% Child queue “test2” with a min capacity of 50%
and a max of 100% Root queue with a fair ordering policy 3. Run Spark jobs Run Spark job on test1 with a max size container
for as many spark executors as possible Run Spark job on test2 with a max size
containers using dynamic resource allocation
1) Add YARN preemption
properties The
following parameters should be applied to the yarn-site.xml file. This can be done manually or through
Ambari. These are the default preemption properties as provided per Hortonworks documentation. The following YARN
Preemption Parameters Applied should be applied to yarn-site.xml: yarn.resourcemanager.scheduler.monitor.enable=true
yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=3000
yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=15000
yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.1 Option 1: Manual Backup /etc/hadoop/conf/yarn-site.xml Update the /etc/hadoop/conf/yarn-site.xml with the following
parameters. Note: You must put these settings in an xml format. Restart YARN Option 2: Ambari To do this in Ambari,
follow the instructions below: The following parameters were added to yarn-site.xml,
which can be done thru Ambari -> Yarn -> Config. You can turn preemption on in the Settings
tab. This will set yarn.resourcemanager.scheduler.monitor.enable=true. The remaining properties need to be added in the Advanced
config tab in Ambari under “Custom yarn-site”.
Click “Add Property”. Then add
the following properties: yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=3000
yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=15000
yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.1 Restart
YARN 2) Create Two YARN
Queues with Fair Ordering The
following parameters are then added to the capacity-scheduler.xml file. You can do this manually or through the
Ambari View – Yarn Queue Manager. yarn.scheduler.capacity.maximum-am-resource-percent=0.2
yarn.scheduler.capacity.maximum-applications=10000
yarn.scheduler.capacity.node-locality-delay=40
yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
yarn.scheduler.capacity.queue-mappings-override.enable=false
yarn.scheduler.capacity.root.acl_administer_queue=*
yarn.scheduler.capacity.root.capacity=100
yarn.scheduler.capacity.root.queues=test1,test2
yarn.scheduler.capacity.root.ordering-policy=fair
yarn.scheduler.capacity.root.ordering-policy.fair.enable-size-based-weight=true
yarn.scheduler.capacity.root.accessible-node-labels=*
yarn.scheduler.capacity.root.test1.acl_submit_applications=*
yarn.scheduler.capacity.root.test1.minimum-user-limit-percent=100
yarn.scheduler.capacity.root.test1.maximum-capacity=100
yarn.scheduler.capacity.root.test1.user-limit-factor=1
yarn.scheduler.capacity.root.test1.state=RUNNING
yarn.scheduler.capacity.root.test1.capacity=50
yarn.scheduler.capacity.root.test1.ordering-policy=fifo
yarn.scheduler.capacity.root.test2.acl_administer_queue=*
yarn.scheduler.capacity.root.test2.acl_submit_applications=*
yarn.scheduler.capacity.root.test2.minimum-user-limit-percent=100
yarn.scheduler.capacity.root.test2.maximum-capacity=100
yarn.scheduler.capacity.root.test2.user-limit-factor=1
yarn.scheduler.capacity.root.test2.state=RUNNING
yarn.scheduler.capacity.root.test2.capacity=50
yarn.scheduler.capacity.root.test2.ordering-policy=fifo Option 1: Manually Backup
the original file: /etc/hadoop/conf/capacity-scheduler.xml Update
the file with settings above: /etc/hadoop/conf/capacity-scheduler.xml Run
the following command to refresh the queues from the /etc/hadoop/conf directory yarn
rmadim –refreshQueues Option 2: Ambari View Using
the YARN Queue Manager in Ambari, you can also apply the following settings to
the capacity-scheduler.xml using the GUI. Set
up the YARN queues as follows. Both the
test1 and test2 queues should look exactly the same and rollup to the root
queue. The
test1 and test2 queues should have the same configuration as below: The
root queue should have an Ordering Policy of Fair and Enable Sized Based Weight
Ordering. Restart YARN 3) Run Spark Jobs sudo su - hdfscd
/usr/hdp/current/spark-client Run the following Spark job and make sure it runs
over-capacity on the test1 queue. Notice how we specify 5 executors and large containers. ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --queue test1 --num-executors 5 --executor-memory 18G --executor-cores 2 lib/spark-examples*.jar 1000000 Confirm in the Resource Manager
UI (http://resource-manager-node:8088/cluster)
that it’s running over-capacity in test1 queue Run a second Spark job on
test2 queue. Notice how this job does not specify the number of executors. That's because we are using Dynamic Resource Allocation in Spark which became available in Spark 1.6. ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --queue test2 --executor-memory 18G --executor-cores 2 lib/spark-examples*.jar 1000000 Initially
you should see the following behavior in the resource manager: And
then…viola - In a few seconds, YARN will preempt and the 2 nd Spark
job will take some containers from the first job so that you have a fair
balance of resources across a root queue.
... View more
- Find more articles tagged with:
- capacity-scheduler
- Hadoop Core
- How-ToTutorial
- preemption
- Spark
- YARN
Labels:
07-07-2016
03:40 AM
1 Kudo
There's a JIRA and PR already out there for this. You might want to track it here. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-987?jql=project%20%3D%20ZEPPELIN and Pull Request: https://github.com/apache/zeppelin/pull/993
... View more
05-12-2016
12:06 PM
@drussell Thanks. I saw this list too but I'm not sure if it's least access. I can see places where you can create conditions in the policy so that you can only work in a specific VPC, such as: "Condition": {
"StringEquals": {
"ec2:vpc": "arn:aws:ec2:us-east-1:############:vpc/vpc-XXXXXX"
} Or even get really specific on resources. So instead of using "Resource":"*" in the policy, you can get it down to the instances in a certain availability zone. For example: "Resource": "arn:aws:ec2:us-east-1::instance/*" The list of roles are good but the resources and conditions on these roles are just as important. Especially as we discuss with Info Sec team and justify why.
... View more
05-12-2016
04:15 AM
2 Kudos
Many financial and highly regulated institutions use least privilege access models when rolling out software features to end users. With Cloudbreak, what is the least privilege access model that you would have to supply in AWS to make this work? Additionally, how can you ensure that the IAM accounts can manage their own VPC deployment while still ensuring that they can't change configuration for other VPC's using the same account.
... View more
Labels:
- Labels:
-
Hortonworks Cloudbreak
04-28-2016
12:54 PM
2 Kudos
Alex – This makes sense to me. If you’re tailing files or listening to
syslog, NiFi has great easy processors that can easily forward this information
to a downstream search tool (SOLR, ES, etc) or even persist the information in
long term storage (HDFS). You can
encrypt and compress the data as you capture it, send it over a secure wire and
do the simple event processing you need in order to route the information to
the appropriate end point. There are
also some processors such as ScanContent or RouteContent that can allow you to
route message based on patterns (think Regex and Whitelists) that you find in the
message payload (i.e. route errors here, info there) or create priorities for
those messages. The other place where NiFi helps tremendously is around data
conversion. For example, convert AvroToJSON or CSVtoAvro or
AttributesToJSON. These help you get
messages into the proper files streams to be indexed by your search tool. The one place I would look at closely is the amount of log
parsing you need to do. For unique
formats, you may need to create a custom processor in NiFi to assist you with
extracting log attributes. There are
processors such as EvaluateXPath or EvaluateXQuery that allow you to use XPath
to find attribute value pair information in XML and JSON which is extremely helpful
and may be all you need. Otherwise, it’s
really easy to get started and play around with your use case to see if there’s
a fit.
... View more
04-19-2016
04:09 AM
5 Kudos
This was tested on Yosemite 10.10.5 1) Install NiFi on your MacOS: http://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2/bk_HDF_InstallSetup/content/ch_HDF_installing.html 2) Setup your machine to foward syslog messages to port 1514.
Backup your current syslog configuration. mv /etc/syslog.conf /etc/syslog.conf.bkp
Edit your syslog.conf file to send all messages to UDP localhost port 1514 sudo vi /etc/syslog.conf
Add the following entry to /etc/syslog.conf *.* @127.0.0.1:1514 Restart syslogd sudo launchctl unload /System/Library/LaunchDaemons/com.apple.syslogd.plist
sudo launchctl load /System/Library/LaunchDaemons/com.apple.syslogd.plist Confirm syslogd is running. A result should display a process id (PID) for /usr/sbin/syslogd ps -ef | grep syslogd 3) Test with NiFi. Add a ListenSyslog processor to the canvas with following settings:
Protocol: UDP
Port: 1514
Local Network Interface: lo0
Connect the ListenSyslog process to an output port and have the relationship set to “success”. Start the ListenSyslog processor. You should see data get queued up and the Out statistics should show bytes flowing through the processor. Sometimes you need to help it along and send some messages to the syslogd server. If so, try typing this in the command line and then verify the data flowing in NiFi syslog -s test message
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- How-ToTutorial
- listensyslog
- mac
- NiFi
- syslog
Labels:
04-13-2016
02:32 AM
4 Kudos
Hi Babu - It's more of a common approach to write out a new file. HDFS is essentially an append only system so creating a new file that's a derivative of the original is a very common practice. You can write a MR program to output a file or use a Hive query to output a query results to a new file. For example, INSERT OVERWRITE DIRECTORY '/user/me/output' SELECT UPPER(myColumn) FROM myTable. This would create a new file(s) with a modified change that's like an update. In this case, we're upper casing the 'myColumn' in the myTable table.
... View more
04-11-2016
08:41 PM
2 Kudos
You should use the Kylin UI for the operations and the cube build. This includes selecting tables, defining dimensions and deploying the cube build. For querying, you should use the Kylin driver for access to query the cubes. You can grab that driver here: http://kylin.apache.org/docs15/tutorial/odbc.html. You can use that driver with your BI tools (Tableau, PowerBI, etc). MDX is not widely used so I don't believe it is supported with the Kylin.
... View more
03-25-2016
07:01 PM
1 Kudo
If you don’t have important data in Cloudbreak, you can drop the db using root and do a... cbd delete Switch back to the cloudbreak user and Edit the ‘Profile’ file in your cloud break deployment directory (i.e. /var/lib/cloudbreak-deployment). Then export the username and password you would like to use as the default user. export UAA_DEFAULT_USER_EMAIL=admin@example.com export UAA_DEFAULT_USER_PW=mypass This will override the default settings. Then... cbd init cbd start On startup, your default settings should be displayed as being changed.
... View more
- Find more articles tagged with:
- Cloud & Operations
- Cloudbreak
- configuration
- FAQ
Labels:
02-26-2016
03:09 AM
5 Kudos
Would it ever make sense to put any of the NiFi Cluster repositories (Flow File, Content, Provenance) in a NAS like Isilon? I know disk can be the bottleneck but you also want these repositories on drives with a strong RAID, thus my question.
... View more
Labels:
- Labels:
-
Apache NiFi
01-06-2016
06:21 PM
8 Kudos
Hey Wes - A few things to consider when sizing. Data is obviously 1, but the characteristics of the data are even more important for thinking about ingest performance and index sizing. For instance, if there is a lot of free form text, # of attributes, # of rows, etc. all of these way in on the indexing process and index size. Also there are other items in SOLR such as facets that can increase the index size. So definitely look at the shape of the data to get an idea of the index size as well as the features of SOLR that you may be using that can affect index size (i.e. faceting). Also, If you have a sample data set, you can try indexing it to see what the index size is and try to extrapolate from here. Also, however big your index is, make sure you have 3 times that on disk for commits and snapshots. The other item to look at (which is also the 2nd part of your question) is the amount of concurrency / query requests. SOLR is built to return data very quickly but lots of concurrency/request on an under replicated index can certainly create latency and has more impact on the heap than indexing. Also, bad queries are probably more at fault for being latent than SOLR itself. Index fields will always be returned quickly especially if you’re doing a field query (fq=) as opposed to a general query (q=), but both are pretty fast. If you can figure out the number of requests in a 10 second window, this may help you consider the number of replicas you may need for responding to queries without latency. As far as caching, OS caching (fitting the index in memory) will do more for you then working with java heap. In your case, since the index will probably be rather large, you’ll want to use SOLR cloud and utilize shards and replicas to spread the index out across machines to try to keep the index in-memory. As far as HDFS vs local disk. There's a good post here on why to use one over the other. Also. HDFS and SOLR cloud both have data replication and they are mutually exclusive. So if you're using SOLR cloud, you definitely want to make sure the indexes in HDFS have a replication factor of 1. HTH
... View more
12-30-2015
07:46 PM
@Andrea D'Orio You can point an F5 to all or any of the SOLR nodes. SOLR cloud is smart enough in distributing queries to the right shards and replicas. Round robin should be fine. Also, if you're using HDFS to store the indexes than the SOLR needs to sit on the data nodes or nodes with the HDFS client. https://doc.lucidworks.com/lucidworks-hdpsearch/2.3/Guide-Install.html
... View more
12-29-2015
09:42 PM
8 Kudos
Kylin Pronounced “KEY LIN” / “CHI LIN” - This project brings OLAP (Online Analytical Processing) to Big Data. It is a top-level project in Apache. Through it’s UI, you can create a logical model (dimensions/measures) from a star schema in Hive. Kylin will then create cube aggregates using MR and put the aggregates and cube metadata into HBase. Users can then query the cube data through the Kylin UI or a BI tool that uses the Kylin odbc driver.
A good video from the committers overviewing the project: https://www.youtube.com/watch?v=7iDcF7pNhV4
Definitions Cube - A data structure containing dimensions and measures for quickly accessing aggregated information (measures) across many axis's (dimensions) Cuboid - A "slice" or subset of a cube Dimensions - Think of these as alphanumeric columns that sit in a group by clause of SQL. i.e. Location, Department, Time, etc. Measure - Think of these as metric/numerical values that sit in a select clause of SQL. i.e. Sum(value), Max(bonus), Min(effort)
Technical Overview
Kylin needs HBase, Hive and HDFS (Nice!) Regarding HDFS, it does alot of processing in MR by creating aggregate data for each N-Cuboid of a cube. These jobs output HFiles for HBase. In turn, HBase stores cube metadata and cube aggregates in HBase. This makes sense for quick fetching of aggregate data. For cube aggregate levels in HBase, dimensions are row keys in HBase, columns are the measure values. Hive is used for the data modeling. Data needs to be in star schema like format in Hive. Also, base level data resides in Hive and not the cube. The cube contains only aggregate data.
The Good - Use Kylin if you have alot of interactive querying on a smaller number of dimensions, your measures/metrics are simple aggregates and the data doesn't need to be viewed in real-time.
- Sql ansi compliant
- Connectivity to BI tools
- Can use hierarchies - Needs HDFS, HBase & Hive
- Has a UI - Does incremental cube updates
- Uses Calcite for Query optimizer
Cautions - MR overhead with building cubes (“query yesterdays data”). Lots of shuffling. Does aggregations on the reduce side
- No cell level security. Security at a cube and project level. - Simple measures only (counts, max, min and sum). No custom calcs, ratios, etc.
- 20 dimensions seem like a practical upper limit
- For larger cubes, it does pre-aggregation and then aggregation at runtime (may result in query latencies) - No Ambari view Security There is security on projects and cubes, no cell level security. One idea around security is to create smaller cubes (i.e. segments) to create security for users / groups. LDAP is also an option.
What's in HBASE? Metadata and cube data. If you list the tables in HBase, you’ll see this:
KYLIN_XXXXXXXXXXX (This is the Cube)
kylin_metadata
kylin_metadata_acl kylin_metadata_user
Other Thoughts...
Kylin has its own ODBC driver and can be used with Tableau / Excel. With Tableau, make sure you connect with Live data as opposed to import. Kylin only puts aggregates in Hbase, base level data is still in Hive. (I.e. Kylin doesn’t do table scans) eBay (26TB / 16B rows) -> 90% of queries with <5sec latency MDX adoption is very low, therefore its not currently supported You can build up a cube of cubes (daily -> weekly —>monthly, etc). These are called segments. The more segments the slower performance can get (more scans)
Roadmap Streaming Cubes
Spark 1) Thinking about using Spark to speed up cubing MR jobs 2) Source from SparkSQL instead of Hive 3) Route queries to SparkSQL
... View more
- Find more articles tagged with:
- Data Processing
- FAQ
- kylin
- olap
12-28-2015
08:41 PM
1 Kudo
This worked for me. A few other simple things I needed to do: create /kylin folder in HDFS add 7070 in port forwarding of sandbox VM make sure HBase is started
... View more
12-23-2015
06:50 PM
2 Kudos
The best you can do is export from a single component (i.e. table), take a screenshot of the dashboard or export the dashboard to load into another banana instance. The reason why you can't do an offline dashboard is because you would need the entire index. Dashboards typically contain summarized data and/or a subset of detailed records. In order for the dashboard to remain interactive (search, filter, faceting, etc) you would need the entire data set offline because it does all of the counts/aggregations on the fly.
... View more
12-15-2015
02:38 PM
6 Kudos
Repo Description This demonstration simulates real time set top box data capture with simple event processing and real time search & discovery. Products used: Java, Kafka, SOLR/Banana, NiFi
Repo Info Github Repo URL https://github.com/chriscasano/SetTopBox Github account name chriscasano Repo name SetTopBox
... View more
Labels: