Member since
09-28-2015
48
Posts
117
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
908 | 07-05-2017 04:37 PM | |
369 | 07-07-2016 03:40 AM | |
460 | 04-28-2016 12:54 PM | |
716 | 04-13-2016 02:32 AM | |
431 | 04-11-2016 08:41 PM |
07-05-2017
04:37 PM
1 Kudo
I've seen something similar to this before when you have a circular flow failure on a processor. Especially with an InvokeHTTP processor. The bad thought is..."Ah the request failed so let me re-route it back to the InvokeHTTP processor because maybe it work the 2nd, 3rd..nth time". The failure flow files loop back onto the originating processor and you get this type of behavior because the flow files typically get penalized, have long expiration periods and they get re-queued again. This becomes a continuous failure loop. A better practice to handle the failure into a new processor instead and pay attention to the amount of retries, expirations, etc. @Matt Clarke has a great post on this: https://community.hortonworks.com/questions/77336/nifi-best-practices-for-error-handling.html
... View more
03-30-2017
07:54 PM
Try using the EvaluateJsonPath processor if you want to pull out key attribute value pairs. You can then map those AVPs by creating additional fields in the properties of the PutSolrContentStream processor. The flow NiFi from this marquee post shows the design pattern. https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.html
... View more
03-30-2017
03:33 AM
when you started solr, did you specify localhost:2181/solr as the zk location? i.e. /solr.sh start -c -z localhost:2181/solr. If not, make sure the zk location you started in the solr setup is the same in the PutSolr properties. I've also had times where I had to recreate the index to get this error message to go away.
... View more
03-29-2017
05:01 PM
1 Kudo
Did you add the collection name to the properties in the PutSolrContentStream processor? what are the other properties you have in there?
... View more
03-29-2017
04:51 PM
1 Kudo
In the PutSolrContentStream processor, did include the Collection name in the properties? What are the other properties that you have set in processor.
... View more
03-29-2017
03:50 PM
6 Kudos
It comes down to your comfort level and the type of ETL you’re are trying to do to give your a proper recommendation. The biggest difference is that you have less GUI’s (but some good ones!) to work with for ETL in the Hortonworks stack. If your comfortable with some SQL, scripting and programming our stack is great for doing ETL at scale. Here’s a break down of the tools and where you can use them in our stack
ETL Options in Hortonworks Extraction / Load - Apache Sqoop, Apache NiFi, SyncSort Transformations - Apache Hive, Apache Spark, Apache Pig, Apache NiFi Other items to consider for ETL Work Orchestration - Ambari Workflow Manager (Oozie UI), Apache NiFi Data Discovery - Apache Zeppelin, Apache SOLR Additionally, ETL takes several forms in Hadoop. ELT is more of a common pattern. In a traditional Informatica ETL pattern, you would extract from source systems, transform in PowerCenter and land in target. In Hadoop, you’ll typically extract from source, land in Hadoop, transform, land in target (i.e. Hive). For this pattern, we would typically recommend Sqoop for EL and Hive, Spark or Pig for T. EtL (little t) is another pattern with streaming ingest pipelines. You’ll extract or capture the source, do light transformation (i.e. preparation, conversions, enrichment, etc) and then land into Hadoop. For these light transformations, they are not typically batch oriented. For this pattern, we would typically recommend Apache NiFi. Things that are not in the platform that you have to account for. Master Data Repository Cleansing Rules Enrichment Modules (i.e. address cleansing) Change Data Capture Reuseable Templates (except with NiFi) In some cases you can use external services for the items above. Or because the beauty of Open Source is that it’s highly extensible, build or leverage integrations into other tools that may assist with cleansing, enrichment, etc. If you go back to the days before commercial ETL tools existed, you can build all of the items mentioned above as part of your overall data management environment.
... View more
03-11-2017
10:20 PM
@Scott Shaw Thanks Scott. This helps for now in that there are other factors we have to include when sizing / estimating for concurrency.
... View more
03-08-2017
03:45 AM
5 Kudos
Outside of YARN queues, more node managers and HS2, is there a rule of thumb for scaling AtScale with more concurrent users? Does the Hybrid Query Service and Cache Manager have any scaling limits?
... View more
Labels:
01-04-2017
09:00 PM
3 Kudos
Not 100% sure but you can try the main.css file here: $SOLR_HOME/server/solr-webapp/webapp/banana/src/css
... View more
09-20-2016
04:27 PM
@Bryan Bende @Artem Ervits this is helpful, I think we could be onto something. For a coprocessor, would it make sense to emit to REST call to get the transaction to NiFi as opposed to having NiFi doing constant Gets? Not too familiar with HBase but co-processors reminds me of triggers which can be useful but slippery. For the two phase commit, I believe the NiFi processor that would receive the "triggered" data would then have to ACK which HBase before transmitting further down the flow.
... View more
09-20-2016
04:04 AM
4 Kudos
How would you perform a two phase commit between HBase and NiFi? Think of a trading system in FinServ. Once a piece of data in transacted (i.e. committed) in HBase (assume Omid / Tephra here), how can a push mechanism get that data into NiFi, and then NiFi can acknowledge that it received the data from HBase?
... View more
Labels:
09-19-2016
11:06 PM
4 Kudos
This has come up a few times. You’ll sometimes notice that after a Banana deployment in SOLR that you can’t save your dashboards in Banana. To enable this, you have to create an index that stores these dashboards. In order to enable this, all you need to do is run the following statement which will create a banana-int index. sh ${SOLR_HOME}/bin/solr create_core -c banana-int -d ../server/solr-webapp/webapp/banana/resources/banana-int-solr-5.0/conf Then restart SOLR... sh ${SOLR_HOME}/bin/solr restart Then you can... 1) Save your dashboard: 2) And access your saved dashboard: Happy searching!
... View more
- Find more articles tagged with:
- Banana
- dashboard
- Data Science & Advanced Analytics
- How-ToTutorial
- solr
Labels:
09-13-2016
04:18 AM
3 Kudos
Hi Arun - This post describes how you can use preemption with a fair policy in the capacity scheduler. It should give you a similar behavior to the fair scheduler but using a YARN capacity queue. https://community.hortonworks.com/articles/44079/yarn-pre-emption-with-spark-using-a-fair-policy.html
... View more
08-04-2016
11:04 AM
7 Kudos
Joining Collections in SOLR (Part 1)
Sometimes you may want to inner join data from one solr connection to another. There is a facility to perform this action using a join query in SOLR. The easiest way to perform the join is by linking a single attribute from one collection to another attribute in another collection. This join works very well for standalone indexes, but does not work well for distributed indexes. To do this in a distributed index, we’ll perform that in part II of this article.
To demonstrate, let’s say we have two collections. Sales, which contains the amount of sales by region. And in the other collection called People, which has people categorized by their region and a flag if they are a manager. Let’s say our goal is to find all of the sales by manager. To do this, we will join the collections using region as our join key, and also filter the people data by if they are a manager or not.
Here is the filter query (fq) in solr on how to make this happen:
fq={!join from=region_s to=region_s fromIndex=people}mgr_s:yes
Let's use an actual example to show the functionality...
First let’s create a sales collections and populate it:
curl "http://127.0.0.1:8983/solr/admin/cores?action=CREATE&name=sales&instanceDir=/opt/hostname-hdpsearch/solr/server/solr/sales&configSet=basic_configs"
We'll populate it with data using the Solr Admin UI. Select the Sales core, then choose Documents. Document Type should be CSV, paste the values below into the text box and then click Submit Document. Very simple way to index sample data.
id,region_s,sales_i
1,east,100000
2,west,200000
3,north,300000
4,south,400000
Now create our second collection, people:
curl "http://127.0.0.1:8983/solr/admin/cores?action=CREATE&name=people&instanceDir=/Users/ccasano/Applications/solr/solr-5.2.1/server/solr/people&configSet=basic_configs"
You can upload the following data as well into the people collection this time.
id,name_s,region_s,salary_i,mgr_s
1,chris,east,100000,yes
2,jen,west,200000,yes
3,james,east,75000,no
4,ruby,north,50000,yes
5,charlotte,west,120000,yes
Finally let’s run our join query to produce the results we are looking for.
http://localhost:8983/solr/sales/select?q=*:*&fq={!join from=region_s to=region_s fromIndex=people}mgr_s:yes
You should see the following results:
If you would like to run the same functionality using compounded join keys (i.e. 2 or more join keys). The best things to do is concatenate those keys on ingest to create a single join key.
Additionally, this functionality does not work with distributed indexes, i.e. multiple shards. If you try to attempt this on a distributed index with multiple shards, you’ll get the following error message:
"error": { "msg": "SolrCloud join: multiple shards not yet supported people", "code": 400
In Conclusion: Joins between SOLR collections are useful but should be taken with caution. As you can see, this query only works with simple non-distributed collections. Additionally, you can only display the fields from the sales collection and not the people collection which is a total bummer. A more common practice is to pre-join the information before it’s indexed. For joining collections with multiple shards, you could also try to attempt this with Spark. Stay tuned on how to do this in Part II of this post.
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- How-ToTutorial
- solr
Labels:
07-07-2016
11:25 PM
7 Kudos
Overview I recently encountered a question where someone asked to see how you can do preemption across YARN queues when a spark job is beyond it's queue's min guarantee. They had seen this before with the Fair Scheduler and Map Reduce, but wanted to apply the same experience here but with Spark and the Capacity Scheduler. This how-to article describes how to setup this experience.
Goal: Run large
Spark jobs in two separate capacity queues to produce an equal share of
resources for both jobs.
Hardware: 5 Nodes
of AWS EC2 r3.xlarge
Cluster
Configuration: HDP: 2.4.2, Spark: 1.6.1, 5 Node Managers, 20GB (20480MB) Yarn Containers yarn.scheduler.maximum-allocation-mb
= 20480 yarn.scheduler.minimum-allocation-mb
= 2560
High Level Setup: 1. Add preemption properties as per documentation 2. Create Two YARN Queues with Fair Ordering Child queue “test1” with a min capacity of 50%
and a max of 100% Child queue “test2” with a min capacity of 50%
and a max of 100% Root queue with a fair ordering policy 3. Run Spark jobs Run Spark job on test1 with a max size container
for as many spark executors as possible Run Spark job on test2 with a max size
containers using dynamic resource allocation
1) Add YARN preemption
properties The
following parameters should be applied to the yarn-site.xml file. This can be done manually or through
Ambari. These are the default preemption properties as provided per Hortonworks documentation. The following YARN
Preemption Parameters Applied should be applied to yarn-site.xml: yarn.resourcemanager.scheduler.monitor.enable=true
yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=3000
yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=15000
yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.1 Option 1: Manual Backup /etc/hadoop/conf/yarn-site.xml Update the /etc/hadoop/conf/yarn-site.xml with the following
parameters. Note: You must put these settings in an xml format. Restart YARN Option 2: Ambari To do this in Ambari,
follow the instructions below: The following parameters were added to yarn-site.xml,
which can be done thru Ambari -> Yarn -> Config. You can turn preemption on in the Settings
tab. This will set yarn.resourcemanager.scheduler.monitor.enable=true. The remaining properties need to be added in the Advanced
config tab in Ambari under “Custom yarn-site”.
Click “Add Property”. Then add
the following properties: yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=3000
yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=15000
yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.1 Restart
YARN 2) Create Two YARN
Queues with Fair Ordering The
following parameters are then added to the capacity-scheduler.xml file. You can do this manually or through the
Ambari View – Yarn Queue Manager. yarn.scheduler.capacity.maximum-am-resource-percent=0.2
yarn.scheduler.capacity.maximum-applications=10000
yarn.scheduler.capacity.node-locality-delay=40
yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
yarn.scheduler.capacity.queue-mappings-override.enable=false
yarn.scheduler.capacity.root.acl_administer_queue=*
yarn.scheduler.capacity.root.capacity=100
yarn.scheduler.capacity.root.queues=test1,test2
yarn.scheduler.capacity.root.ordering-policy=fair
yarn.scheduler.capacity.root.ordering-policy.fair.enable-size-based-weight=true
yarn.scheduler.capacity.root.accessible-node-labels=*
yarn.scheduler.capacity.root.test1.acl_submit_applications=*
yarn.scheduler.capacity.root.test1.minimum-user-limit-percent=100
yarn.scheduler.capacity.root.test1.maximum-capacity=100
yarn.scheduler.capacity.root.test1.user-limit-factor=1
yarn.scheduler.capacity.root.test1.state=RUNNING
yarn.scheduler.capacity.root.test1.capacity=50
yarn.scheduler.capacity.root.test1.ordering-policy=fifo
yarn.scheduler.capacity.root.test2.acl_administer_queue=*
yarn.scheduler.capacity.root.test2.acl_submit_applications=*
yarn.scheduler.capacity.root.test2.minimum-user-limit-percent=100
yarn.scheduler.capacity.root.test2.maximum-capacity=100
yarn.scheduler.capacity.root.test2.user-limit-factor=1
yarn.scheduler.capacity.root.test2.state=RUNNING
yarn.scheduler.capacity.root.test2.capacity=50
yarn.scheduler.capacity.root.test2.ordering-policy=fifo Option 1: Manually Backup
the original file: /etc/hadoop/conf/capacity-scheduler.xml Update
the file with settings above: /etc/hadoop/conf/capacity-scheduler.xml Run
the following command to refresh the queues from the /etc/hadoop/conf directory yarn
rmadim –refreshQueues Option 2: Ambari View Using
the YARN Queue Manager in Ambari, you can also apply the following settings to
the capacity-scheduler.xml using the GUI. Set
up the YARN queues as follows. Both the
test1 and test2 queues should look exactly the same and rollup to the root
queue. The
test1 and test2 queues should have the same configuration as below: The
root queue should have an Ordering Policy of Fair and Enable Sized Based Weight
Ordering. Restart YARN 3) Run Spark Jobs sudo su - hdfscd
/usr/hdp/current/spark-client Run the following Spark job and make sure it runs
over-capacity on the test1 queue. Notice how we specify 5 executors and large containers. ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --queue test1 --num-executors 5 --executor-memory 18G --executor-cores 2 lib/spark-examples*.jar 1000000 Confirm in the Resource Manager
UI (http://resource-manager-node:8088/cluster)
that it’s running over-capacity in test1 queue Run a second Spark job on
test2 queue. Notice how this job does not specify the number of executors. That's because we are using Dynamic Resource Allocation in Spark which became available in Spark 1.6. ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --queue test2 --executor-memory 18G --executor-cores 2 lib/spark-examples*.jar 1000000 Initially
you should see the following behavior in the resource manager: And
then…viola - In a few seconds, YARN will preempt and the 2 nd Spark
job will take some containers from the first job so that you have a fair
balance of resources across a root queue.
... View more
- Find more articles tagged with:
- capacity-scheduler
- Hadoop Core
- How-ToTutorial
- preemption
- Spark
- YARN
Labels:
07-07-2016
03:40 AM
1 Kudo
There's a JIRA and PR already out there for this. You might want to track it here. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-987?jql=project%20%3D%20ZEPPELIN and Pull Request: https://github.com/apache/zeppelin/pull/993
... View more
05-12-2016
12:06 PM
@drussell Thanks. I saw this list too but I'm not sure if it's least access. I can see places where you can create conditions in the policy so that you can only work in a specific VPC, such as: "Condition": {
"StringEquals": {
"ec2:vpc": "arn:aws:ec2:us-east-1:############:vpc/vpc-XXXXXX"
} Or even get really specific on resources. So instead of using "Resource":"*" in the policy, you can get it down to the instances in a certain availability zone. For example: "Resource": "arn:aws:ec2:us-east-1::instance/*" The list of roles are good but the resources and conditions on these roles are just as important. Especially as we discuss with Info Sec team and justify why.
... View more
05-12-2016
04:15 AM
2 Kudos
Many financial and highly regulated institutions use least privilege access models when rolling out software features to end users. With Cloudbreak, what is the least privilege access model that you would have to supply in AWS to make this work? Additionally, how can you ensure that the IAM accounts can manage their own VPC deployment while still ensuring that they can't change configuration for other VPC's using the same account.
... View more
Labels:
05-03-2016
01:09 PM
Here is a good sample to work from: https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2hvcnRvbndvcmtzLWdhbGxlcnkvemVwcGVsaW4tbm90ZWJvb2tzL21hc3Rlci8yQU5UREc4Nzgvbm90ZS5qc29u Put your Raw_data1.txt into hadoop %sh
hadoop fs -put Raw_data1.txt /tmp/Raw_data1.txt Create a data set val dataset=sc.textFile("/tmp/Raw_data1.txt") Define your schema and create a table case class MyData (Users: String, City: String, Library:String)
val = mydata.map(k=>k.split(",")).map(
k => MyData( k(0),k(1),k(2) )
mydata.toDF().registerTempTable("raw_table") Create your query and visualize it %sql
select sum(Users), City, Library
from raw_table
group by City, Library
... View more
04-29-2016
09:10 PM
2 Kudos
If so, where is this typically setup? I don't see anything under Manage Resources for request a machine with a max cost, etc but maybe there is an option somewhere else? Possible feature down the road?
... View more
Labels:
04-28-2016
01:02 PM
2 Kudos
The Ambari visuals are typically used for monitoring metrics collected by Ambari. Bringing in external data is not the intent of the visuals with Ambari Metrics. I would consider either using the new Hive View in HDP 2.4 that you can access in Ambari to visualize your dummy data. You can put that dummy data in a Hive table and then use the Ambari Hive View to visual that data. A second option would be to use Zeppelin and you can bring that dummny data into a Spark data frame. http://hortonworks.com/hadoop-tutorial/apache-zeppelin-hdp-2-4/
... View more
04-28-2016
12:54 PM
2 Kudos
Alex – This makes sense to me. If you’re tailing files or listening to
syslog, NiFi has great easy processors that can easily forward this information
to a downstream search tool (SOLR, ES, etc) or even persist the information in
long term storage (HDFS). You can
encrypt and compress the data as you capture it, send it over a secure wire and
do the simple event processing you need in order to route the information to
the appropriate end point. There are
also some processors such as ScanContent or RouteContent that can allow you to
route message based on patterns (think Regex and Whitelists) that you find in the
message payload (i.e. route errors here, info there) or create priorities for
those messages. The other place where NiFi helps tremendously is around data
conversion. For example, convert AvroToJSON or CSVtoAvro or
AttributesToJSON. These help you get
messages into the proper files streams to be indexed by your search tool. The one place I would look at closely is the amount of log
parsing you need to do. For unique
formats, you may need to create a custom processor in NiFi to assist you with
extracting log attributes. There are
processors such as EvaluateXPath or EvaluateXQuery that allow you to use XPath
to find attribute value pair information in XML and JSON which is extremely helpful
and may be all you need. Otherwise, it’s
really easy to get started and play around with your use case to see if there’s
a fit.
... View more
04-19-2016
04:09 AM
5 Kudos
This was tested on Yosemite 10.10.5 1) Install NiFi on your MacOS: http://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2/bk_HDF_InstallSetup/content/ch_HDF_installing.html 2) Setup your machine to foward syslog messages to port 1514.
Backup your current syslog configuration. mv /etc/syslog.conf /etc/syslog.conf.bkp
Edit your syslog.conf file to send all messages to UDP localhost port 1514 sudo vi /etc/syslog.conf
Add the following entry to /etc/syslog.conf *.* @127.0.0.1:1514 Restart syslogd sudo launchctl unload /System/Library/LaunchDaemons/com.apple.syslogd.plist
sudo launchctl load /System/Library/LaunchDaemons/com.apple.syslogd.plist Confirm syslogd is running. A result should display a process id (PID) for /usr/sbin/syslogd ps -ef | grep syslogd 3) Test with NiFi. Add a ListenSyslog processor to the canvas with following settings:
Protocol: UDP
Port: 1514
Local Network Interface: lo0
Connect the ListenSyslog process to an output port and have the relationship set to “success”. Start the ListenSyslog processor. You should see data get queued up and the Out statistics should show bytes flowing through the processor. Sometimes you need to help it along and send some messages to the syslogd server. If so, try typing this in the command line and then verify the data flowing in NiFi syslog -s test message
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- How-ToTutorial
- listensyslog
- mac
- NiFi
- syslog
Labels:
04-18-2016
06:25 PM
1 Kudo
Sure. Your importing a table from Oracle to HDFS, correct? This looks like your trying to import it to a local file system on c:\, which sqoop doesn't do. Also, your path is missing a \ after c: in your attached image. More importantly...The --target-dir argument is for a folder in HDFS. I would try importing into /tmp to see if the statement initially works. Try... sqoop import --connect jdbc:oracle:thin:hud_reader/hud_reader_n1le@huiprd_nile:1527 --username <hud_reader> --password <hud_reader_n1le> --table <DATAAGGRUN> --target-dir /tmp
... View more
04-18-2016
04:29 PM
1 Kudo
Why do you need --C:\hadoop\hdp\hadoop-2.7.1.2.4.0.0-169\sqoop_out? It's not preceding any argument. Can you try running without it.
... View more
04-13-2016
02:32 AM
4 Kudos
Hi Babu - It's more of a common approach to write out a new file. HDFS is essentially an append only system so creating a new file that's a derivative of the original is a very common practice. You can write a MR program to output a file or use a Hive query to output a query results to a new file. For example, INSERT OVERWRITE DIRECTORY '/user/me/output' SELECT UPPER(myColumn) FROM myTable. This would create a new file(s) with a modified change that's like an update. In this case, we're upper casing the 'myColumn' in the myTable table.
... View more
04-13-2016
12:06 AM
4 Kudos
I haven't found this feature to work too well and there are several JIRAs out there to correct some issues (Ambari - 15335, 13986, 15392, 15046, etc) https://issues.apache.org/jira/browse/AMBARI-15335?jql=project%20%3D%20AMBARI%20AND%20status%20%3D%20Resolved%20AND%20text%20~%20%22hive%20view%20upload%22 It's probably still best to upload the file using the HDFS View, hadoop fs -put or some other means and then create the Hive table. I believe the feature was targeted for 2.1 but never made it out until 2.2: https://issues.apache.org/jira/browse/AMBARI-13747
... View more
04-11-2016
08:41 PM
2 Kudos
You should use the Kylin UI for the operations and the cube build. This includes selecting tables, defining dimensions and deploying the cube build. For querying, you should use the Kylin driver for access to query the cubes. You can grab that driver here: http://kylin.apache.org/docs15/tutorial/odbc.html. You can use that driver with your BI tools (Tableau, PowerBI, etc). MDX is not widely used so I don't believe it is supported with the Kylin.
... View more
03-25-2016
07:01 PM
1 Kudo
If you don’t have important data in Cloudbreak, you can drop the db using root and do a... cbd delete Switch back to the cloudbreak user and Edit the ‘Profile’ file in your cloud break deployment directory (i.e. /var/lib/cloudbreak-deployment). Then export the username and password you would like to use as the default user. export UAA_DEFAULT_USER_EMAIL=admin@example.com export UAA_DEFAULT_USER_PW=mypass This will override the default settings. Then... cbd init cbd start On startup, your default settings should be displayed as being changed.
... View more
- Find more articles tagged with:
- Cloud & Operations
- Cloudbreak
- configuration
- FAQ
Labels: