About ccasano

ccasano · ‎07-05-2017

I've seen something similar to this before when you have a circular flow failure on a processor. Especially with an InvokeHTTP processor. The bad thought is..."Ah the request failed so let me re-route it back to the InvokeHTTP processor because maybe it work the 2nd, 3rd..nth time". The failure flow files loop back onto the originating processor and you get this type of behavior because the flow files typically get penalized, have long expiration periods and they get re-queued again. This becomes a continuous failure loop. A better practice to handle the failure into a new processor instead and pay attention to the amount of retries, expirations, etc. @Matt Clarke has a great post on this: https://community.hortonworks.com/questions/77336/nifi-best-practices-for-error-handling.html

ccasano · ‎03-29-2017

Did you add the collection name to the properties in the PutSolrContentStream processor? what are the other properties you have in there?

ccasano · ‎03-29-2017

It comes down to your comfort level and the type of ETL you’re are trying to do to give your a proper recommendation. The biggest difference is that you have less GUI’s (but some good ones!) to work with for ETL in the Hortonworks stack. If your comfortable with some SQL, scripting and programming our stack is great for doing ETL at scale. Here’s a break down of the tools and where you can use them in our stack ETL Options in Hortonworks Extraction / Load - Apache Sqoop, Apache NiFi, SyncSort Transformations - Apache Hive, Apache Spark, Apache Pig, Apache NiFi Other items to consider for ETL Work Orchestration - Ambari Workflow Manager (Oozie UI), Apache NiFi Data Discovery - Apache Zeppelin, Apache SOLR Additionally, ETL takes several forms in Hadoop. ELT is more of a common pattern. In a traditional Informatica ETL pattern, you would extract from source systems, transform in PowerCenter and land in target. In Hadoop, you’ll typically extract from source, land in Hadoop, transform, land in target (i.e. Hive). For this pattern, we would typically recommend Sqoop for EL and Hive, Spark or Pig for T. EtL (little t) is another pattern with streaming ingest pipelines. You’ll extract or capture the source, do light transformation (i.e. preparation, conversions, enrichment, etc) and then land into Hadoop. For these light transformations, they are not typically batch oriented. For this pattern, we would typically recommend Apache NiFi. Things that are not in the platform that you have to account for. Master Data Repository Cleansing Rules Enrichment Modules (i.e. address cleansing) Change Data Capture Reuseable Templates (except with NiFi) In some cases you can use external services for the items above. Or because the beauty of Open Source is that it’s highly extensible, build or leverage integrations into other tools that may assist with cleansing, enrichment, etc. If you go back to the days before commercial ETL tools existed, you can build all of the items mentioned above as part of your overall data management environment.

ccasano · ‎03-11-2017

@Scott Shaw Thanks Scott. This helps for now in that there are other factors we have to include when sizing / estimating for concurrency.

ccasano · ‎03-08-2017

Outside of YARN queues, more node managers and HS2, is there a rule of thumb for scaling AtScale with more concurrent users? Does the Hybrid Query Service and Cache Manager have any scaling limits?

ccasano · ‎09-20-2016

@Bryan Bende @Artem Ervits this is helpful, I think we could be onto something. For a coprocessor, would it make sense to emit to REST call to get the transaction to NiFi as opposed to having NiFi doing constant Gets? Not too familiar with HBase but co-processors reminds me of triggers which can be useful but slippery. For the two phase commit, I believe the NiFi processor that would receive the "triggered" data would then have to ACK which HBase before transmitting further down the flow.

ccasano · ‎09-20-2016

How would you perform a two phase commit between HBase and NiFi? Think of a trading system in FinServ. Once a piece of data in transacted (i.e. committed) in HBase (assume Omid / Tephra here), how can a push mechanism get that data into NiFi, and then NiFi can acknowledge that it received the data from HBase?

ccasano · ‎09-19-2016

This has come up a few times. You’ll sometimes notice that after a Banana deployment in SOLR that you can’t save your dashboards in Banana. To enable this, you have to create an index that stores these dashboards. In order to enable this, all you need to do is run the following statement which will create a banana-int index. sh ${SOLR_HOME}/bin/solr create_core -c banana-int -d ../server/solr-webapp/webapp/banana/resources/banana-int-solr-5.0/conf Then restart SOLR... sh ${SOLR_HOME}/bin/solr restart Then you can... 1) Save your dashboard: 2) And access your saved dashboard: Happy searching!

ccasano · ‎09-13-2016

Hi Arun - This post describes how you can use preemption with a fair policy in the capacity scheduler. It should give you a similar behavior to the fair scheduler but using a YARN capacity queue. https://community.hortonworks.com/articles/44079/yarn-pre-emption-with-spark-using-a-fair-policy.html

ccasano · ‎08-04-2016

Joining Collections in SOLR (Part 1) Sometimes you may want to inner join data from one solr connection to another. There is a facility to perform this action using a join query in SOLR. The easiest way to perform the join is by linking a single attribute from one collection to another attribute in another collection. This join works very well for standalone indexes, but does not work well for distributed indexes. To do this in a distributed index, we’ll perform that in part II of this article. To demonstrate, let’s say we have two collections. Sales, which contains the amount of sales by region. And in the other collection called People, which has people categorized by their region and a flag if they are a manager. Let’s say our goal is to find all of the sales by manager. To do this, we will join the collections using region as our join key, and also filter the people data by if they are a manager or not. Here is the filter query (fq) in solr on how to make this happen: fq={!join from=region_s to=region_s fromIndex=people}mgr_s:yes Let's use an actual example to show the functionality... First let’s create a sales collections and populate it: curl "http://127.0.0.1:8983/solr/admin/cores?action=CREATE&name=sales&instanceDir=/opt/hostname-hdpsearch/solr/server/solr/sales&configSet=basic_configs" We'll populate it with data using the Solr Admin UI. Select the Sales core, then choose Documents. Document Type should be CSV, paste the values below into the text box and then click Submit Document. Very simple way to index sample data. id,region_s,sales_i 1,east,100000 2,west,200000 3,north,300000 4,south,400000 Now create our second collection, people: curl "http://127.0.0.1:8983/solr/admin/cores?action=CREATE&name=people&instanceDir=/Users/ccasano/Applications/solr/solr-5.2.1/server/solr/people&configSet=basic_configs" You can upload the following data as well into the people collection this time. id,name_s,region_s,salary_i,mgr_s 1,chris,east,100000,yes 2,jen,west,200000,yes 3,james,east,75000,no 4,ruby,north,50000,yes 5,charlotte,west,120000,yes Finally let’s run our join query to produce the results we are looking for. http://localhost:8983/solr/sales/select?q=*:*&fq={!join from=region_s to=region_s fromIndex=people}mgr_s:yes You should see the following results: If you would like to run the same functionality using compounded join keys (i.e. 2 or more join keys). The best things to do is concatenate those keys on ingest to create a single join key. Additionally, this functionality does not work with distributed indexes, i.e. multiple shards. If you try to attempt this on a distributed index with multiple shards, you’ll get the following error message: "error": { "msg": "SolrCloud join: multiple shards not yet supported people", "code": 400 In Conclusion: Joins between SOLR collections are useful but should be taken with caution. As you can see, this query only works with simple non-distributed collections. Additionally, you can only display the fields from the sales collection and not the people collection which is a total bummer. A more common practice is to pre-join the information before it’s indexed. For joining collections with multiple shards, you could also try to attempt this with Spark. Stay tuned on how to do this in Part II of this post.

Online	Offline
Last Visited	‎03-12-2019 11:28 AM

Member Since	‎09-28-2015 07:36 PM
Last Visited	‎03-12-2019 11:28 AM
Posts	48
Kudos received	106

Cloudera Community

Re: NiFi message when emptying queues: "Waiting fo...

Re: How to disable the Interpreter tab and for al...

Re: HDF be used to feed Logstash?

Re: Hive Update - how to update a txt file in HDFS...

Re: How to query/perform OLAP operations on cube c...

Re: NiFi message when emptying queues: "Waiting fo...

Re: Getting Error IOException occured when talking...

Re: traditional ETL vs open source

Re: AtScale Concurrency

AtScale Concurrency

Re: A Two Phase Commit for Hbase & NiFi

A Two Phase Commit for Hbase & NiFi

Save Button Doesn’t Work in SOLR / Banana Dasbhoar...

Re: Configure preemption in Yarn Fair Scheduler

Joining Collections in SOLR (Part I)