Member since
09-22-2015
24
Posts
24
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
545 | 07-26-2016 02:23 PM | |
1586 | 07-11-2016 12:46 PM |
06-07-2018
03:40 PM
1 Kudo
We are pleased to announce the immediate availability of Hortonworks Data Flow(HDF) version 3.1.2 for both x86 and IBM Power Systems. This version is a maintenance release and includes critical bug fixes for NiFi, MiNiFi, Ranger, Streams Analytics Manager(SAM), Schema Registry, plus more. HDF 3.1.2 includes the following components:
Apache Ambari 2.6.2 Apache Kafka 1.0.0 Apache NiFi 1.5.0 NiFi Registry 0.1.0 Apache Ranger 0.7.0 Apache Storm 1.1.1 Apache ZooKeeper 3.4.6 Apache MiNiFi Java Agent 0.4.0 Apache MiNiFi C++ 0.4.0 Hortonworks Schema Registry 0.5.0 Hortonworks Streaming Analytics Manager 0.6.0 The release and documentation are available at: Hortonworks Data Flow v3.1.2 Download: (link) Hortonworks Data Flow v3.1.2 Documentation: (link)
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- FAQ
- hdf
Labels:
03-01-2018
12:15 PM
Hortonworks Data Flow v3.1.1 Documentation (link) Thank you to the Hortonworks Data Flow Development, Product Management, Quality Engineering, Partner Certification, Documentation, and Release Engineering teams.
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- FAQ
- flow
- hdf
- ibm
- PowerSystems
- streaming
Labels:
12-21-2017
08:35 PM
We are pleased to announce the availability of Hortonworks Data Flow(HDF) v3.0.3 for IBM Power Systems on RHEL 7.2. This is an important release as it is the second Hortonworks product ported to IBM Power Systems, specifically Power8 processors, in 2017. This release is a win-win for both IBM and Hortonworks customers as HDF v3.0.3 is the next generation of our open source data-in-motion platform and enables customers to collect, curate, analyze and act on all data in real-time, across the enterprise. Combined with the Hortonworks Data Platform (HDP) currently available for IBM Power Systems, we are improving the experience for customers by simplifying how they create and deploy streaming analytics applications to deliver real time analytics while benefiting from the flexibility, cost of operation and performance of IBM Power8 processors. For additional information, please refer to: Hortonworks Data Flow v3.0.3 Documentation (link) Hortonworks Data Platform v2.6.3 / Ambari v2.6 Documentation (link)
... View more
- Find more articles tagged with:
- FAQ
- ibm
- PowerSystems
11-11-2017
11:38 AM
1 Kudo
We are pleased to announce the availability of Hortonworks Data Flow (HDF) version 3.0.2. This maintenance release of HDF is the final version of our flagship 3.0 product line and includes critical patches and bug fixes. An increasing number of Hortonworks customers are using HDF to meet their needs for enterprise flow and streaming use cases. The team is very pleased to deliver a high quality, well tested release before move onto the next big release. The next release of HDF, version 3.1, is currently in development. For additional information, please refer to: Hortonworks Data Flow (bits) (docs)
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- FAQ
- hdf
Labels:
11-11-2017
11:36 AM
2 Kudos
We are pleased to announce the certification of IBM Data Science Experience (DXS) Local V1.1.2.01 with Hortonworks Data Platform (HDP) 2.6.3 / Ambari 2.6 on RHEL 7. As an important part of the agreement between the two companies, IBM and Hortonworks collaborated on an extensive set of test cases to specifically validate IBM DSX with both Hortonworks Data Platform and Ambari. This version of IBM DSX integrates with Zeppelin 0.7.3 and allows users to configure Livy interpreter to run workloads on HDP clusters (both secure and unsecure). Users also have the option to launch their DSX jobs on either Spark1 or Spark2. This certification is win-win for both DSX and HDP customers as it brings a production-ready data science experience to HDP customers and at the same time provides DSX customers access to information stored in HDP data lakes with an enterprise grade compute grid. An increasing number of Hortonworks customers are using data science to get a greater value from their data and support use cases ranging from churn prediction, predictive maintenance to optimizing product placement and store layout. For additional information, please refer to: IBM Data Science Experience Local (link) Hortonworks Documentation (link)*The Data Science portlet is at the far right of the second row Hortonworks Data Platform 2.6.3 / Ambari 2.6 (link)
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- DSX
- FAQ
- ibm
06-28-2017
11:56 PM
5 Kudos
IBM Spectrum Scale 4.2.3 has been certified with the Hortonworks Data Platform (HDP) 2.6 / Ambari 2.5 on IBM Power Systems. IBM and Hortonworks collaborated on an optimized and integrated solution that was validated against a comprehensive suite of integration test cases across the full stack of HDP components and Ambari. Testing covered secure and non-secure scenarios with Accumulo, Atlas, Falcon, Flume, Hbase, HDFS, Hive, HiveServer2, Kafka, Knox, Mahout, Map Reduce, Oozie, Phoenix, Pig, Spark, Sqoop, Storm, Tez, Yarn, Zeppelin, and Zookeeper. This certification is for Spectrum Scale software and hence applies to all deployment models of Spectrum Scale, including Elastic Storage Server (ESS). Further, this certification includes a paper certification for Hortonworks Data Flow (HDF) V3.0 use with IBM Spectrum Scale. IBM’s Power platform is already certified to run HDP and offers 3x price performance compared with x86. IBM ESS (pre-integrated system powered by IBM Spectrum Scale) includes software RAID function that eliminates the need for the three-way replication for data protection that is required with other solutions. Instead, IBM ESS requires just 30% extra capacity to offer similar data protection benefits. IBM Power Systems along with the IBM ESS offer the most optimized hardware stack for running analytics workloads. Clients can enjoy up to 3x reduction of storage and compute infrastructure on Power Systems and IBM ESS compared to commodity scale-out x86 systems. IBM Spectrum Scale is scheduled to be certified with HDP running on x86 systems by the end of July. Additional references:
https://hortonworks.com/blog/hdp-ibm-spectrum-scale-brings-enterprise-class-storage-place-analytics/
https://developer.ibm.com/storage/2017/06/16/top-five-benefits-ibm-spectrum-scale-hortonworks-data-platform/ https://www-03.ibm.com/press/us/en/pressrelease/51562.wss
... View more
01-25-2017
06:00 PM
2 Kudos
This is the first of a series of short articles on using Apache Spark with Hortonworks HDP for beginners. If you’re reading this, you don’t need need me to define what Spark is, there are numerous references on the web that can speak about that its API, being data structure centric and in my opinion one of the most important Open Source projects. The intent of this article is let you know what helped me get started using Spark on the Hortonworks Data Platform. This is not a tutorial. I’m assuming you have access to a an HDP 2.5.3 cluster or to the Hortonworks Sandbox for HDP 2.5.3 or above. Also, I’m going to assume that you are familiar with SQL, Apache Hive, and using the Linux/Unix Bourne Shell. The problem I was having, which pushed me to use Spark was that using Hive for data processing was limiting. There were things I was used to using in my past life as an Oracle DBA that were not available. Hive is a fantastic product, but Hive SQL didn’t give me all the bells and whistles to do my job, plus complex Hive SQL statements can be time consuming. In my case, I needed to summarize allot of time series data and then store that summarized data in a hive table so others could query it using Apache Zeppelin. For the sake of this article, I’ll keep the table layout simple: txn_details txn_date string, txn_action string, txn_value number The example below will illustrate using the spark command line to summarize data from a hive table and There are a couple ways to run spark commands, but I prefer using a command line. The command line tool or more precisely the Spark repl is called spark-shell. See http://spark.apache.org/docs/latest/quick-start.html. Another good option is to use Apache zeppelin, but we will use spark-shell. Starting up the spark-shell is very easy and is executed from the linux shell prompt by typing: $ spark-shell The standard spark-shell is verbose, which you can turn off. Google for how to do this. Executing spark-shell will bring you to the scala> prompt. From the scala> prompt, the first thing we’ll do is create a data from with all the contents of the txn_detail table. But before executing a piece of SQL we need to define a sql context object scala> val sqlcontext = new org.apache.spark.sql.SQLContext(sc); Next, the command below will execute a SQL statement to query all rows from the txn_detail table and put the result set into a Spark dataframe called ‘dataframe_A’. scala> val dataframe_A = sqlContext.sql(‘’’ Select txn_date, txn_action, txn_value from txn_detail “””);
Now that we have data in a dataframe we can summarize it grouping on either txn_action or txn_date. Summarize on txn_date scala> dataframe_A.groupBy($“txn_date”).agg(sum(“txn_value”).alias(“txn_value”)).show() +-------------+------------------+ |txn_date | txn_value| +-------------+-------------------+ | 2015-12-27| 22.0| | 2015-12-28| 74.0| | 2015-11-20| 59.0| | 2015-12-29| 44.0| | 2015-11-21| 98.0| | 2015-11-22| 52.0| | 2015-11-23| 35.0| | 2015-11-24| 31.0| | 2015-11-25| 62.0| | 2015-11-26| 74.0| | 2015-11-27| 14.0| | 2015-09-21| 25.0| | 2015-10-20| 17.0| | 2015-09-22| 14.0| | 2015-11-29| 14.0| | 2015-10-21| 21.0| | 2015-09-23| 54.0| | 2016-12-01| 42.0| | 2015-10-22| 52.0| | 2015-09-24| 73.0| +-------------+------------------+ only showing top 20 rows
Summarize on txn_action scala> dataframe_A.groupBy($“txn_action”).agg(sum(“txn_value”).alias(“txn_value”)).show() +-------------+------------------+ |txn_action | txn_value| +-------------+-------------------+ | Open | 11.0| | Close | 99.0| +-------------+------------------+ Let’s store the summarized results for txn_date into a separate dataframe and then save those results off to a hive table. Save the result set into a new dataframe scala> val dataframe_B = dataframe_A.groupBy($“txn_date”).agg(sum(“txn_value”).alias(“txn_value”)); Create a temporary table. This will allow us to query it as like any other hive table. scala> dataframe_B.registerTempTable(“txn_date_temp”); Create a hive table and save the data scala> sqlContext.sql(“””create table hive_txn_data as select * from txn_data_temp”””); Now that you have the data summarized in the hive_txn_data hive table, users can query data from the table using Apache Zeppelin or any other tool. Summary There are numerous ways to perform this type of work, but using Spark is very efficient to summarize and execute calculations. In coming articles, I’ll discuss other functions of spark.
For additional Hortonworks tutorials check out: http://hortonworks.com/tutorials/
... View more
- Find more articles tagged with:
- Hadoop Core
- hdp-2.5.0
- How-ToTutorial
- Spark
- SparkSQL
12-07-2016
03:44 PM
What about registering a temptable and then creating a static table to hold onto the results? Drop/recreate as needed..
... View more
11-30-2016
04:47 PM
1 Kudo
Over the last year, Oracle has continued to update and add support for Hortonworks HDP. Below is a list of products which support using Hortonworks HDP 2.5.0.0 Big Data SQL https://www.oracle.com/database/big-data-sql/index.html Big Data Connector https://www.oracle.com/database/big-data-connectors/certifications.html Includes
-Oracle SQL Connector for HDFS -Oracle Loader for Hadoop -Oracle Data Integrator -Oracle XQuery for Hadoop -Oracle R Advanced Analytics for Hadoop -Oracle Datasource for Hadoop Spatial and Graph https://www.oracle.com/database/spatial/index.html GoldenGate for Big Data https://www.oracle.com/middleware/data-integration/goldengate/big-data/index.html Oracle Data Integrator Enterprise Edition https://www.oracle.com/middleware/data-integration/enterprise-edition-big-data/index.html Big Data Discovery https://www.oracle.com/big-data/big-data-discovery/index.html
... View more
- Find more articles tagged with:
- FAQ
- Oozie
- oracle
- oracle big data discovery
- oracle data integration
- oracle data integrator
- oracle goldengate
- solutions
Labels:
11-16-2016
02:27 PM
My hack was to find the JSON file for the notebook and delete the paragraph from there. This worked, but your method is much cleaner
... View more
11-16-2016
02:26 PM
Restarting the service didn't make a difference. Once available and then opening the notebook, the paragraph still was trying to render the scatter graph and hung.
... View more
11-15-2016
08:06 PM
I have a very dataset returned from sparksql, but only three columns. When I used the scatter plot option zeppelin hangs with the spinning pinwheel. It is clearly running as my laptop CPU is high and drives the fan on. Is there any way to kill off this long running process or remove the paragraph from my notebook?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
-
Apache Zeppelin
07-26-2016
02:23 PM
1 Kudo
For production systems, you need to stay with stack components that are certified with version. Changing versions of components not tested together could cause more trouble for you plus lead to potential supportability questions. Suggest sticking to what's tested together.
... View more
07-13-2016
12:38 PM
With HDP 2.5 on the horizon, a new version of sandbox will be released as well. Updates to existing versions are not an option, which means that you'll need to wait until the next version or try and upgrade it yourself.
... View more
07-11-2016
03:02 PM
One idea is to try an executeprocess and pass the raw date into a shell command like this: echo "2016-07-06T15:04:33.332+00:00" | awk -F"T" '{print $1}' | sed 's/-/\//g' produces the format "2016/07/06"
... View more
07-11-2016
12:46 PM
2 Kudos
If you are using HDF 1.2, unfortunately the ExecuteSQL processor doesn't work with hive, yet. The Hive processor is on the roadmap and may be included in the next release.
... View more
06-17-2016
12:04 AM
Another option to consider is a transactional replication tool like Oracle GoldenGate or DbVisit. Both tools capture changes made to a database by monitoring the transaction logs, so it's completely under the covers and would not require changing a schema or adding triggers. For GoldenGate once the transaction is capture it can write the change out to Kafka. From there Nifi could pick the transaction up and replicate it to the cluster. DbVisit has a similar capability and potentially has integrated directly with Nifi.
... View more
10-22-2015
04:56 PM
6 Kudos
One of the most interesting facts about Oracle is that many of their products are agnostic when it comes with working with Hadoop distributions. Hortonworks and Oracle have been working closely for over two years on product certifications and building a strong partnership from both the PM and Engineering organizations. Below are four key products in the Oracle portfolio that are certified and supported with Hortonworks.
Oracle
Data Integration (Oracle Data Integrator, Oracle GoldenGate)
Oracle Data Integrator is
certified on HDP 2.1 and 2.2 Oracle GoldenGate Big Data Edition is certified with HDP 2.2 Both are currently working on HDP 2.3 certification (as of Oct '15) Blogs and helpful articles
Oracle Big Data Integration Certified on HDP 2.2 Oracle Big Data Integration with Hortonworks Series Oracle Big Data Connector (BDC)
Certified for HDP 2.2, less R Analytics. Working with the BDC PM/Dev team to complete certification Blogs and helpful articles
Oracle Big Data Connector Certificationl Oracle Big Data Discovery (BDD)
Certified for HDP 2.2.4 – 2.3 and utilizes Spark on YARN for data processing This article will continue to be updated as new certifications are completed.
... View more
- Find more articles tagged with:
- oracle
- oracle big data discovery
- oracle data integration
- oracle data integrator
- oracle goldengate
- solutions