About ccasano

WilliamS24 · ‎07-19-2020

Here we have listed a few ETL tools both, traditional and Open source you can have a look at them and see for yourself which one suits your use case. 1. Panoply: Panoply is the main cloud ETL supplier and data warehouse blend. With 100+ data connectors, ETL and data ingestion is quick and simple, with only a couple of snaps and a login among you and your recently coordinated data. In the engine, Panoply is really utilizing an ELT approach (instead of conventional ETL), which makes data ingestion a lot quicker and progressively powerful, since you don't need to trust that change will finish before stacking your data. What's more, since Panoply fabricates oversaw cloud data warehouses for each client, you won't have to set up a different goal to store all the data you pull in utilizing Panoply's ELT procedure. On the off chance that you'd preferably utilize Panoply's rich arrangement of data gatherers to set up ETL pipelines into a current data warehouse, Panoply can likewise oversee ETL forms for your Azure SQL Data Warehouse. 2. Stitch: Stitch is a self-administration ETL data pipeline. The Stitch API can reproduce data from any source, and handle mass and gradual data refreshes. Stitch additionally gives a replication motor that depends on various techniques to convey data to clients. Its REST API underpins JSON or travel, which empowers programmed recognition and standardization of settled report structures into social constructions. Stitch can associate with Amazon Redshift engineering, Google BigQuery design, and Postgres design - and incorporates with BI apparatuses. Stitch is normally intended to gather, change and burden Google examination data into its own framework, to naturally give business bits of knowledge on crude data. 3. Sprinkle: Sprinkle is a SaaS platform providing ETL tool for organisations.Their easy to use UX and code free mode of operations makes it easy for technical and non technical users to ingest data from multiple data sources and drive real time insights on the data. Their Free Trial enables users to first try the platform and then pay if it fulfils the requirement. Some of the open source tools include 1. Heka: Heka is an open source programming framework for elite data gathering, investigation, observing and detailing. Its principle part is a daemon program known as 'hekad' that empowers the usefulness of social occasion, changing over, assessing, preparing and conveying data. Heka is written in the 'Go' programming language, and has worked in modules for contributing, disentangling, separating, encoding and yielding data. These modules have various functionalities and can be utilized together to assemble a total pipeline. Heka utilizes Advanced Message Queuing Protocol (AMQP) or TCP to transport data starting with one area then onto the next. It tends to be utilized to stack and parse log records from a document framework, or to perform constant investigation, charting and inconsistency recognition on a data stream. 2. Logstash: Logstash is an open source data handling pipeline that ingests data from numerous sources at the same time, changing the source data and store occasions into ElasticSearch as a matter of course. Logstash is a piece of an ELK stack. The E represents Elasticsearch, a JSON-based hunt and investigation motor, and the K represents Kibana, which empowers data perception. Logstash is written in Ruby and gives a JSON-like structure which has a reasonable division between inner items. It has a pluggable structure highlighting more than 200 modules, empowering the capacity to blend, coordinate and arrange offices over various information, channels and yield. This instrument can be utilized for BI, or in data warehouses with bring, change and putting away occasion capacities. 3. Singer: Singer's open source, order line ETL instrument permits clients to assemble measured ETL pipelines utilizing its "tap" and "target" modules. Rather than building a solitary, static ETL pipeline, Singer gives a spine that permits clients to interface data sources to capacity goals. With a huge assortment of pre-constructed taps, the contents that gather datapoints from their unique sources, and a broad choice of pre-fabricated focuses on, the contents that change and burden data into pre-determined goals, Singer permits clients to compose succinct, single-line ETL forms that can be adjusted on the fly by trading taps and focuses in and out.

ccasano · ‎03-29-2017

Did you add the collection name to the properties in the PutSolrContentStream processor? what are the other properties you have in there?

ccasano · ‎03-11-2017

@Scott Shaw Thanks Scott. This helps for now in that there are other factors we have to include when sizing / estimating for concurrency.

bbende · ‎09-20-2016

There was some discussion a long time ago about using HBase's replication end-point to possibly push data to NiFi, but at the time it wasn't something that was needed. You can dig through the comment trail here for more info: https://issues.apache.org/jira/browse/NIFI-817 starting with nicolas maillard added a comment - 21/Sep/15 13:15

ccasano · ‎09-19-2016

This has come up a few times. You’ll sometimes notice that after a Banana deployment in SOLR that you can’t save your dashboards in Banana. To enable this, you have to create an index that stores these dashboards. In order to enable this, all you need to do is run the following statement which will create a banana-int index. sh ${SOLR_HOME}/bin/solr create_core -c banana-int -d ../server/solr-webapp/webapp/banana/resources/banana-int-solr-5.0/conf Then restart SOLR... sh ${SOLR_HOME}/bin/solr restart Then you can... 1) Save your dashboard: 2) And access your saved dashboard: Happy searching!

gaffysk3 · ‎04-10-2019

Hey , Its a wonderful article. But can you share the link to part ii of this article?

ashneesharma88 · ‎08-03-2016

Foowloing are the changes in cent os 7 :- New initialization system, systemd. New firewall control, firewalld. This adds a more dynamic and flexible way to control the firewall module in the kernel, which is still netfilter. New bootloader. GRUB2 adds rich scripting support as well as support for the new hardware options offered on modern mainboards. New default filesystem, XFS. XFS adds support for larger single filesystems, faster format times (0 seconds), integrated snapshots, and live filesystem dumps for backup without first unmounting. GNOME 3 – This only really applies to those who use RHEL/CentOS in the desktop, like me. As with any other distro, you aren’t locked into GNOME 3. I personally like it, but KDE is readily available and others can be found on EPEL. If you are used to previous versions you may want to stick with 6. 7 has a lot of command changes And in our env we are using Cent OS 7. Still Didn't face issue with performance etc..

priyanxmail · ‎03-27-2017

@ccasano I set up the queues like above. Say I have 4 queues Q1 to Q4 with min 25% and max 100%. If I start a job on Q1 and it goes up to 100% utilization and later if I launch the same task on Q2, the new task will grow only up to 25% (Absolute configured capacity) and the old one will come back to 75%. Is there a way I can equally distribute the resources here ? ie, the second job should grow beyond its minimum capacity until the queue are balanced equally. Thanks in advance !

drussell · ‎05-12-2016

Hi @ccasano, understood, I don't believe such a list exists right now, unless @lpapp knows differently, or could generate such a list?

ccasano · ‎04-19-2016

This was tested on Yosemite 10.10.5 1) Install NiFi on your MacOS: http://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2/bk_HDF_InstallSetup/content/ch_HDF_installing.html 2) Setup your machine to foward syslog messages to port 1514. Backup your current syslog configuration. mv /etc/syslog.conf /etc/syslog.conf.bkp Edit your syslog.conf file to send all messages to UDP localhost port 1514 sudo vi /etc/syslog.conf Add the following entry to /etc/syslog.conf *.* @127.0.0.1:1514 Restart syslogd sudo launchctl unload /System/Library/LaunchDaemons/com.apple.syslogd.plist sudo launchctl load /System/Library/LaunchDaemons/com.apple.syslogd.plist Confirm syslogd is running. A result should display a process id (PID) for /usr/sbin/syslogd ps -ef | grep syslogd 3) Test with NiFi. Add a ListenSyslog processor to the canvas with following settings: Protocol: UDP Port: 1514 Local Network Interface: lo0 Connect the ListenSyslog process to an output port and have the relationship set to “success”. Start the ListenSyslog processor. You should see data get queued up and the Out statistics should show bytes flowing through the processor. Sometimes you need to help it along and send some messages to the syslogd server. If so, try typing this in the command line and then verify the data flowing in NiFi syslog -s test message

Online	Offline
Last Visited	‎03-12-2019 11:28 AM

Member Since	‎09-28-2015 07:36 PM
Last Visited	‎03-12-2019 11:28 AM
Posts	48
Kudos received	106

Cloudera Community

Re: NiFi message when emptying queues: "Waiting fo...

Re: How to disable the Interpreter tab and for al...

Re: HDF be used to feed Logstash?

Re: Hive Update - how to update a txt file in HDFS...

Re: How to query/perform OLAP operations on cube c...

Re: traditional ETL vs open source

Re: Getting Error IOException occured when talking...

Re: AtScale Concurrency

Re: A Two Phase Commit for Hbase & NiFi

Save Button Doesn’t Work in SOLR / Banana Dasbhoar...

Re: Joining Collections in SOLR (Part I)

Re: Are there any OS level changes/settings that d...

Re: YARN Preemption with Spark using a Fair Policy

Re: What is the least privilege access model for A...

Syslog Forwarding to NiFi on your Mac