Member since
07-07-2016
79
Posts
17
Kudos Received
13
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
645 | 08-01-2017 12:00 PM | |
1397 | 08-01-2017 08:28 AM | |
985 | 07-28-2017 01:43 PM | |
1030 | 06-15-2017 11:56 AM | |
978 | 06-01-2017 09:28 AM |
07-19-2020
07:37 AM
Here we have listed a few ETL tools both, traditional and Open source you can have a look at them and see for yourself which one suits your use case. 1. Panoply: Panoply is the main cloud ETL supplier and data warehouse blend. With 100+ data connectors, ETL and data ingestion is quick and simple, with only a couple of snaps and a login among you and your recently coordinated data. In the engine, Panoply is really utilizing an ELT approach (instead of conventional ETL), which makes data ingestion a lot quicker and progressively powerful, since you don't need to trust that change will finish before stacking your data. What's more, since Panoply fabricates oversaw cloud data warehouses for each client, you won't have to set up a different goal to store all the data you pull in utilizing Panoply's ELT procedure. On the off chance that you'd preferably utilize Panoply's rich arrangement of data gatherers to set up ETL pipelines into a current data warehouse, Panoply can likewise oversee ETL forms for your Azure SQL Data Warehouse. 2. Stitch: Stitch is a self-administration ETL data pipeline. The Stitch API can reproduce data from any source, and handle mass and gradual data refreshes. Stitch additionally gives a replication motor that depends on various techniques to convey data to clients. Its REST API underpins JSON or travel, which empowers programmed recognition and standardization of settled report structures into social constructions. Stitch can associate with Amazon Redshift engineering, Google BigQuery design, and Postgres design - and incorporates with BI apparatuses. Stitch is normally intended to gather, change and burden Google examination data into its own framework, to naturally give business bits of knowledge on crude data. 3. Sprinkle: Sprinkle is a SaaS platform providing ETL tool for organisations.Their easy to use UX and code free mode of operations makes it easy for technical and non technical users to ingest data from multiple data sources and drive real time insights on the data. Their Free Trial enables users to first try the platform and then pay if it fulfils the requirement. Some of the open source tools include 1. Heka: Heka is an open source programming framework for elite data gathering, investigation, observing and detailing. Its principle part is a daemon program known as 'hekad' that empowers the usefulness of social occasion, changing over, assessing, preparing and conveying data. Heka is written in the 'Go' programming language, and has worked in modules for contributing, disentangling, separating, encoding and yielding data. These modules have various functionalities and can be utilized together to assemble a total pipeline. Heka utilizes Advanced Message Queuing Protocol (AMQP) or TCP to transport data starting with one area then onto the next. It tends to be utilized to stack and parse log records from a document framework, or to perform constant investigation, charting and inconsistency recognition on a data stream. 2. Logstash: Logstash is an open source data handling pipeline that ingests data from numerous sources at the same time, changing the source data and store occasions into ElasticSearch as a matter of course. Logstash is a piece of an ELK stack. The E represents Elasticsearch, a JSON-based hunt and investigation motor, and the K represents Kibana, which empowers data perception. Logstash is written in Ruby and gives a JSON-like structure which has a reasonable division between inner items. It has a pluggable structure highlighting more than 200 modules, empowering the capacity to blend, coordinate and arrange offices over various information, channels and yield. This instrument can be utilized for BI, or in data warehouses with bring, change and putting away occasion capacities. 3. Singer: Singer's open source, order line ETL instrument permits clients to assemble measured ETL pipelines utilizing its "tap" and "target" modules. Rather than building a solitary, static ETL pipeline, Singer gives a spine that permits clients to interface data sources to capacity goals. With a huge assortment of pre-constructed taps, the contents that gather datapoints from their unique sources, and a broad choice of pre-fabricated focuses on, the contents that change and burden data into pre-determined goals, Singer permits clients to compose succinct, single-line ETL forms that can be adjusted on the fly by trading taps and focuses in and out.
... View more
04-28-2020
09:37 AM
Were u able to achieve it? I am also facing the same issue.
... View more
02-20-2020
10:04 AM
username: root Password: hadoop
... View more
09-07-2017
02:36 AM
@Allen Niu This might be a tad bit late; however if you want your team to be more experienced developers I would certainly shoot for Spark and Hive. Both components have libraries and jars that support one another and the Spark API makes learning how to develop in Java, Scala, and Python super easy. I personally started to learn how to code in C# and translated those skills into Python and Scala for Spark ML Lib, Spark Core, and Spark SQL. Might be a little biased as I am a BIG SPARK junkie 😉 But the ability to clean data with SPARK at scale is ABSOLUTELY brilliant. I know hortonworks has some great development courses for Spark and Hive as well. Here is the link: https://hortonworks.com/services/training/certification/hdp-certified-spark-developer/ What was your final decision?
... View more
08-01-2017
12:00 PM
1 Kudo
@Zubair Jaleel There are many Kappa and other case studies presented at the DataWorks Summit (e.g. Ford, Yahoo, etc.). Videos and Slides are available for most sessions: https://dataworkssummit.com/san-jose-2017/agenda/
... View more
07-28-2017
01:43 PM
1 Kudo
@Kiran Kumar Should all be answered in the below: https://hortonworks.com/agreements/support-services-policy/
... View more
07-31-2017
01:22 PM
Sorry for late reply, I am attaching the blueprint file. It is a Jinja2 template. Let me know if should I remove the template tags and just post the configuration.blueprint-multi-node-ha-1.txt
... View more
06-16-2017
06:25 AM
@Graham Martin Thanks for your reply, I thihk I haved already define the Tag Service and add it to the hive policy In the tag service I give user "admin" the permission to select all the tables/columns under the tag "Hive", and in Hive Policy I disable the user "admin" 's permission to select of all tables, so if the tag service works, "admin" should have the permission to visit all tables under the tag "Hive", but currently it is not working. Am I missing something here?
... View more
06-01-2017
01:41 PM
Thanks Graham and Robert. This is helpful.
... View more
06-01-2017
05:26 PM
Hi @Sharon Kirkham, I'm glad that you found the blog post helpful. The link to the Troubleshooting section is already there in the beginning of the post, but in case it doesn't stand out (I guess it does not, since you didn't notice it) I added it again at the end.
... View more
05-25-2017
04:00 PM
@Christophe Vico I recommend you download the Sandbox: https://hortonworks.com/products/sandbox/ From Zeppelin, and in the one notebook, you can run different versions of Spark (1.6.3 or 2.1) as per your choice of interpreter: %spark.spark or %spark2.spark You can review the settings used in the Interpreter screen. Regards,
... View more
04-13-2018
07:01 AM
@zahain @Shafi Ahmad Did you find the fix...Can you please share the solution.
... View more
03-22-2017
05:31 PM
@faraon clément The isolation is usually provided at the Logical level by Ranger. If you have multiple tenants - Project A and B, trying to manually manage data locality will get very difficult very quickly. Keep in mind that there is a Replication factor of 3 by default also (each block resides in 3x nodes in the cluster). The same is true for workloads (i.e. YARN Queue Management) - it is easier to manage logically, and assign tenants a % of resources - than try and carve up the cluster physically (though node labels can offer some flexibility). There are ways to achieve data locality (though non-trivial), and future versions may make this easier. Might be worth thinking through the requirements, and understanding the workloads, user interaction, etc., and then working out if locking down data to nodes makes sense. Locking down data locality is somewhat counter to Hadoop's core (smooth elastic scaling, etc.).
... View more
03-27-2017
09:22 PM
Hope this link helps you with HDFS user setup: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_ambari_views_guide/content/_setup_HDFS_user_directory.html
... View more
07-21-2017
08:42 AM
@George Meltser Unfortunately not. I have given up and I am now building the Hadoop stack manually without Ambari.
... View more
03-22-2017
02:40 AM
@ Graham Martin .... Thanks a lot
... View more
11-02-2018
10:59 AM
Here are a couple publicly available Git Repos for fuzzy matching Hive UDFs: https://github.com/ychantit/fuzzymatch_hiveUDF https://github.com/rueedlinger/hive-udf
... View more
03-03-2017
04:38 PM
Hi @christophe menichetti, As @Predrag Monodic mentioned, you can use Blueprints for non-UI based installs. Unfortunately, the UI Wizard will not allow you to generate a Blueprint and Cluster Creation template after you gone through all the screens. The simplest way to generate a Blueprint to start with is to try the following: 1. On a local VM cluster for testing (vagrant, docker, etc), create a cluster that has the services, components, and configuration that you are interested in deploying in your production cluster. 2. Use the UI to deploy this local cluster, going through all the normal screens in the wizard. 3. You can then export the Blueprint from this running cluster. This REST call will generate a Blueprint based on the currently-running cluster you setup in Step #2. 4. Save this Blueprint, and customize it as necessary. 5. Create a Cluster Creation Template that matches hostnames to the host groups from the exported Blueprint. Please note that you may want to manually rename the host groups in the exported Blueprint, as they are generated using a "host_group_n" convention, which may not be useful for documenting your particular cluster. You can check out the following link on the Blueprints wiki to see how to make the REST call to export the Blueprint from a running cluster: https://cwiki.apache.org/confluence/display/AMBARI/Blueprints#Blueprints-APIResourcesandSyntax Hope this helps!
... View more
03-03-2017
09:19 PM
I would like to make sure you can write to /tmp/tweets_staging directory. On linux as root echo hello > /tmp/hello.txt as hdfs: hdfs dfs -put /tmp/tweets_staging/ Yes, codec is an issue on certain versions of the sandbox. As per the article, you can remove the string from the parameter in Ambari.
... View more
03-02-2017
08:23 PM
1 Kudo
@nedox nedox You will want to use one of the available HDFS processors to get data form your HDP HDFS file system.
1. GetHDFS <-- Use if standalone NiFi installation
2. ListHDFS --> RPG --> FetchHDFS <-- Use if NiFI cluster installation
All of the HDFS based NiFi processors have a property that allows you to specify a path to the HDFS site.xml files. Obtain a copy of your core-site.xml and hdfs-site.xml files from your HDP cluster and place them somewhere on the HDF hosts running NiFi. Point to these files using the "Hadoop Configuration Resources" processor property. example: Thanks, Matt
... View more
03-01-2017
02:02 PM
There are a few examples available on HCC, here's one my colleague created https://community.hortonworks.com/content/kbentry/47854/accessing-facebook-page-data-from-apache-nifi.html
... View more
03-01-2017
04:05 PM
You can use the ExtractText processor, and use a regexp to filter what you need. https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html You could also execute a script from an ExecuteScript processor, which could call a shell script for example, which can use standard regexp to filter. @ulung tama
... View more
02-28-2017
04:26 AM
Thank you Mr. Grahan for your valuable inputs
... View more
10-11-2016
02:38 AM
How do you solve this? I met similar problem. , How do you resolve the question?
... View more