Member since
09-28-2015
48
Posts
117
Kudos Received
8
Solutions
09-19-2016
11:06 PM
4 Kudos
This has come up a few times. You’ll sometimes notice that after a Banana deployment in SOLR that you can’t save your dashboards in Banana. To enable this, you have to create an index that stores these dashboards. In order to enable this, all you need to do is run the following statement which will create a banana-int index. sh ${SOLR_HOME}/bin/solr create_core -c banana-int -d ../server/solr-webapp/webapp/banana/resources/banana-int-solr-5.0/conf Then restart SOLR... sh ${SOLR_HOME}/bin/solr restart Then you can... 1) Save your dashboard: 2) And access your saved dashboard: Happy searching!
... View more
Labels:
08-04-2016
11:04 AM
7 Kudos
Joining Collections in SOLR (Part 1)
Sometimes you may want to inner join data from one solr connection to another. There is a facility to perform this action using a join query in SOLR. The easiest way to perform the join is by linking a single attribute from one collection to another attribute in another collection. This join works very well for standalone indexes, but does not work well for distributed indexes. To do this in a distributed index, we’ll perform that in part II of this article.
To demonstrate, let’s say we have two collections. Sales, which contains the amount of sales by region. And in the other collection called People, which has people categorized by their region and a flag if they are a manager. Let’s say our goal is to find all of the sales by manager. To do this, we will join the collections using region as our join key, and also filter the people data by if they are a manager or not.
Here is the filter query (fq) in solr on how to make this happen:
fq={!join from=region_s to=region_s fromIndex=people}mgr_s:yes
Let's use an actual example to show the functionality...
First let’s create a sales collections and populate it:
curl "http://127.0.0.1:8983/solr/admin/cores?action=CREATE&name=sales&instanceDir=/opt/hostname-hdpsearch/solr/server/solr/sales&configSet=basic_configs"
We'll populate it with data using the Solr Admin UI. Select the Sales core, then choose Documents. Document Type should be CSV, paste the values below into the text box and then click Submit Document. Very simple way to index sample data.
id,region_s,sales_i
1,east,100000
2,west,200000
3,north,300000
4,south,400000
Now create our second collection, people:
curl "http://127.0.0.1:8983/solr/admin/cores?action=CREATE&name=people&instanceDir=/Users/ccasano/Applications/solr/solr-5.2.1/server/solr/people&configSet=basic_configs"
You can upload the following data as well into the people collection this time.
id,name_s,region_s,salary_i,mgr_s
1,chris,east,100000,yes
2,jen,west,200000,yes
3,james,east,75000,no
4,ruby,north,50000,yes
5,charlotte,west,120000,yes
Finally let’s run our join query to produce the results we are looking for.
http://localhost:8983/solr/sales/select?q=*:*&fq={!join from=region_s to=region_s fromIndex=people}mgr_s:yes
You should see the following results:
If you would like to run the same functionality using compounded join keys (i.e. 2 or more join keys). The best things to do is concatenate those keys on ingest to create a single join key.
Additionally, this functionality does not work with distributed indexes, i.e. multiple shards. If you try to attempt this on a distributed index with multiple shards, you’ll get the following error message:
"error": { "msg": "SolrCloud join: multiple shards not yet supported people", "code": 400
In Conclusion: Joins between SOLR collections are useful but should be taken with caution. As you can see, this query only works with simple non-distributed collections. Additionally, you can only display the fields from the sales collection and not the people collection which is a total bummer. A more common practice is to pre-join the information before it’s indexed. For joining collections with multiple shards, you could also try to attempt this with Spark. Stay tuned on how to do this in Part II of this post.
... View more
Labels:
07-07-2016
11:25 PM
7 Kudos
Overview I recently encountered a question where someone asked to see how you can do preemption across YARN queues when a spark job is beyond it's queue's min guarantee. They had seen this before with the Fair Scheduler and Map Reduce, but wanted to apply the same experience here but with Spark and the Capacity Scheduler. This how-to article describes how to setup this experience.
Goal: Run large
Spark jobs in two separate capacity queues to produce an equal share of
resources for both jobs.
Hardware: 5 Nodes
of AWS EC2 r3.xlarge
Cluster
Configuration: HDP: 2.4.2, Spark: 1.6.1, 5 Node Managers, 20GB (20480MB) Yarn Containers yarn.scheduler.maximum-allocation-mb
= 20480 yarn.scheduler.minimum-allocation-mb
= 2560
High Level Setup: 1. Add preemption properties as per documentation 2. Create Two YARN Queues with Fair Ordering Child queue “test1” with a min capacity of 50%
and a max of 100% Child queue “test2” with a min capacity of 50%
and a max of 100% Root queue with a fair ordering policy 3. Run Spark jobs Run Spark job on test1 with a max size container
for as many spark executors as possible Run Spark job on test2 with a max size
containers using dynamic resource allocation
1) Add YARN preemption
properties The
following parameters should be applied to the yarn-site.xml file. This can be done manually or through
Ambari. These are the default preemption properties as provided per Hortonworks documentation. The following YARN
Preemption Parameters Applied should be applied to yarn-site.xml: yarn.resourcemanager.scheduler.monitor.enable=true
yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=3000
yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=15000
yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.1 Option 1: Manual Backup /etc/hadoop/conf/yarn-site.xml Update the /etc/hadoop/conf/yarn-site.xml with the following
parameters. Note: You must put these settings in an xml format. Restart YARN Option 2: Ambari To do this in Ambari,
follow the instructions below: The following parameters were added to yarn-site.xml,
which can be done thru Ambari -> Yarn -> Config. You can turn preemption on in the Settings
tab. This will set yarn.resourcemanager.scheduler.monitor.enable=true. The remaining properties need to be added in the Advanced
config tab in Ambari under “Custom yarn-site”.
Click “Add Property”. Then add
the following properties: yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=3000
yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=15000
yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.1 Restart
YARN 2) Create Two YARN
Queues with Fair Ordering The
following parameters are then added to the capacity-scheduler.xml file. You can do this manually or through the
Ambari View – Yarn Queue Manager. yarn.scheduler.capacity.maximum-am-resource-percent=0.2
yarn.scheduler.capacity.maximum-applications=10000
yarn.scheduler.capacity.node-locality-delay=40
yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
yarn.scheduler.capacity.queue-mappings-override.enable=false
yarn.scheduler.capacity.root.acl_administer_queue=*
yarn.scheduler.capacity.root.capacity=100
yarn.scheduler.capacity.root.queues=test1,test2
yarn.scheduler.capacity.root.ordering-policy=fair
yarn.scheduler.capacity.root.ordering-policy.fair.enable-size-based-weight=true
yarn.scheduler.capacity.root.accessible-node-labels=*
yarn.scheduler.capacity.root.test1.acl_submit_applications=*
yarn.scheduler.capacity.root.test1.minimum-user-limit-percent=100
yarn.scheduler.capacity.root.test1.maximum-capacity=100
yarn.scheduler.capacity.root.test1.user-limit-factor=1
yarn.scheduler.capacity.root.test1.state=RUNNING
yarn.scheduler.capacity.root.test1.capacity=50
yarn.scheduler.capacity.root.test1.ordering-policy=fifo
yarn.scheduler.capacity.root.test2.acl_administer_queue=*
yarn.scheduler.capacity.root.test2.acl_submit_applications=*
yarn.scheduler.capacity.root.test2.minimum-user-limit-percent=100
yarn.scheduler.capacity.root.test2.maximum-capacity=100
yarn.scheduler.capacity.root.test2.user-limit-factor=1
yarn.scheduler.capacity.root.test2.state=RUNNING
yarn.scheduler.capacity.root.test2.capacity=50
yarn.scheduler.capacity.root.test2.ordering-policy=fifo Option 1: Manually Backup
the original file: /etc/hadoop/conf/capacity-scheduler.xml Update
the file with settings above: /etc/hadoop/conf/capacity-scheduler.xml Run
the following command to refresh the queues from the /etc/hadoop/conf directory yarn
rmadim –refreshQueues Option 2: Ambari View Using
the YARN Queue Manager in Ambari, you can also apply the following settings to
the capacity-scheduler.xml using the GUI. Set
up the YARN queues as follows. Both the
test1 and test2 queues should look exactly the same and rollup to the root
queue. The
test1 and test2 queues should have the same configuration as below: The
root queue should have an Ordering Policy of Fair and Enable Sized Based Weight
Ordering. Restart YARN 3) Run Spark Jobs sudo su - hdfscd
/usr/hdp/current/spark-client Run the following Spark job and make sure it runs
over-capacity on the test1 queue. Notice how we specify 5 executors and large containers. ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --queue test1 --num-executors 5 --executor-memory 18G --executor-cores 2 lib/spark-examples*.jar 1000000 Confirm in the Resource Manager
UI (http://resource-manager-node:8088/cluster)
that it’s running over-capacity in test1 queue Run a second Spark job on
test2 queue. Notice how this job does not specify the number of executors. That's because we are using Dynamic Resource Allocation in Spark which became available in Spark 1.6. ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --queue test2 --executor-memory 18G --executor-cores 2 lib/spark-examples*.jar 1000000 Initially
you should see the following behavior in the resource manager: And
then…viola - In a few seconds, YARN will preempt and the 2 nd Spark
job will take some containers from the first job so that you have a fair
balance of resources across a root queue.
... View more
Labels:
04-19-2016
04:09 AM
5 Kudos
This was tested on Yosemite 10.10.5 1) Install NiFi on your MacOS: http://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2/bk_HDF_InstallSetup/content/ch_HDF_installing.html 2) Setup your machine to foward syslog messages to port 1514.
Backup your current syslog configuration. mv /etc/syslog.conf /etc/syslog.conf.bkp
Edit your syslog.conf file to send all messages to UDP localhost port 1514 sudo vi /etc/syslog.conf
Add the following entry to /etc/syslog.conf *.* @127.0.0.1:1514 Restart syslogd sudo launchctl unload /System/Library/LaunchDaemons/com.apple.syslogd.plist
sudo launchctl load /System/Library/LaunchDaemons/com.apple.syslogd.plist Confirm syslogd is running. A result should display a process id (PID) for /usr/sbin/syslogd ps -ef | grep syslogd 3) Test with NiFi. Add a ListenSyslog processor to the canvas with following settings:
Protocol: UDP
Port: 1514
Local Network Interface: lo0
Connect the ListenSyslog process to an output port and have the relationship set to “success”. Start the ListenSyslog processor. You should see data get queued up and the Out statistics should show bytes flowing through the processor. Sometimes you need to help it along and send some messages to the syslogd server. If so, try typing this in the command line and then verify the data flowing in NiFi syslog -s test message
... View more
Labels:
03-25-2016
07:01 PM
1 Kudo
If you don’t have important data in Cloudbreak, you can drop the db using root and do a... cbd delete Switch back to the cloudbreak user and Edit the ‘Profile’ file in your cloud break deployment directory (i.e. /var/lib/cloudbreak-deployment). Then export the username and password you would like to use as the default user. export UAA_DEFAULT_USER_EMAIL=admin@example.com export UAA_DEFAULT_USER_PW=mypass This will override the default settings. Then... cbd init cbd start On startup, your default settings should be displayed as being changed.
... View more
Labels:
12-29-2015
09:42 PM
8 Kudos
Kylin Pronounced “KEY LIN” / “CHI LIN” - This project brings OLAP (Online Analytical Processing) to Big Data. It is a top-level project in Apache. Through it’s UI, you can create a logical model (dimensions/measures) from a star schema in Hive. Kylin will then create cube aggregates using MR and put the aggregates and cube metadata into HBase. Users can then query the cube data through the Kylin UI or a BI tool that uses the Kylin odbc driver.
A good video from the committers overviewing the project: https://www.youtube.com/watch?v=7iDcF7pNhV4
Definitions Cube - A data structure containing dimensions and measures for quickly accessing aggregated information (measures) across many axis's (dimensions) Cuboid - A "slice" or subset of a cube Dimensions - Think of these as alphanumeric columns that sit in a group by clause of SQL. i.e. Location, Department, Time, etc. Measure - Think of these as metric/numerical values that sit in a select clause of SQL. i.e. Sum(value), Max(bonus), Min(effort)
Technical Overview
Kylin needs HBase, Hive and HDFS (Nice!) Regarding HDFS, it does alot of processing in MR by creating aggregate data for each N-Cuboid of a cube. These jobs output HFiles for HBase. In turn, HBase stores cube metadata and cube aggregates in HBase. This makes sense for quick fetching of aggregate data. For cube aggregate levels in HBase, dimensions are row keys in HBase, columns are the measure values. Hive is used for the data modeling. Data needs to be in star schema like format in Hive. Also, base level data resides in Hive and not the cube. The cube contains only aggregate data.
The Good - Use Kylin if you have alot of interactive querying on a smaller number of dimensions, your measures/metrics are simple aggregates and the data doesn't need to be viewed in real-time.
- Sql ansi compliant
- Connectivity to BI tools
- Can use hierarchies - Needs HDFS, HBase & Hive
- Has a UI - Does incremental cube updates
- Uses Calcite for Query optimizer
Cautions - MR overhead with building cubes (“query yesterdays data”). Lots of shuffling. Does aggregations on the reduce side
- No cell level security. Security at a cube and project level. - Simple measures only (counts, max, min and sum). No custom calcs, ratios, etc.
- 20 dimensions seem like a practical upper limit
- For larger cubes, it does pre-aggregation and then aggregation at runtime (may result in query latencies) - No Ambari view Security There is security on projects and cubes, no cell level security. One idea around security is to create smaller cubes (i.e. segments) to create security for users / groups. LDAP is also an option.
What's in HBASE? Metadata and cube data. If you list the tables in HBase, you’ll see this:
KYLIN_XXXXXXXXXXX (This is the Cube)
kylin_metadata
kylin_metadata_acl kylin_metadata_user
Other Thoughts...
Kylin has its own ODBC driver and can be used with Tableau / Excel. With Tableau, make sure you connect with Live data as opposed to import. Kylin only puts aggregates in Hbase, base level data is still in Hive. (I.e. Kylin doesn’t do table scans) eBay (26TB / 16B rows) -> 90% of queries with <5sec latency MDX adoption is very low, therefore its not currently supported You can build up a cube of cubes (daily -> weekly —>monthly, etc). These are called segments. The more segments the slower performance can get (more scans)
Roadmap Streaming Cubes
Spark 1) Thinking about using Spark to speed up cubing MR jobs 2) Source from SparkSQL instead of Hive 3) Route queries to SparkSQL
... View more