Member since
10-08-2015
87
Posts
143
Kudos Received
23
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1044 | 03-02-2017 03:39 PM | |
4744 | 02-09-2017 06:43 PM | |
14292 | 02-04-2017 05:38 PM | |
4633 | 01-10-2017 10:24 PM | |
3624 | 01-05-2017 06:58 PM |
12-21-2016
03:44 PM
2 Kudos
@Warren Tracey Although Kylin is an Apache project, it is not currently part of the Hortonworks Data Platform. It is important to mention that Kylin does limit your BI options somewhat, as it is oriented at Microsoft PowerPivot and Tableau. Alternatively, we at Hortonworks are engaging AtScale quite frequently with our customers in the field. AtScale provide a unified Semantic layer, based on "Virtual Cubes". Virtual Cubes allow the user to create models with measures and dimensions, just like OLAP, but on large volumes of data stored in Hadoop. Users can ‘scale-out’ their BI, because the cube is ‘virtual’. Users can query multiple years, lines of business, brands, etc. all from 1 ‘virtual cube’, scaling out to millions and billions of rows of data available to query. AtScale Adaptive Cache generates automatic ‘smart aggregates’ that learn to anticipate user BI and OLAP queries so you maintain scale, performance and control across your Hadoop cluster. AtScale can leverage any SQL engine under the covers (including Hive or Spark), as well as any BI tool (including those that require an MDX interface, rather than a SQL interface). If not interested, I can dig further into Apache Kylin or Apache Lens for you - but your support options may be limited going forward until the respective communities around those projects begin to grow.
... View more
12-17-2016
11:28 PM
3 Kudos
@Kaliyug Antagonist "As a realistic alternative, which tool/plugin/library should be used to test the MR and other Java/Scala code 'locally' i.e on the desktop machine using a standalone version of the cluster?" Please see the hadoop-mini-clusters github project. hadoop-mini-clusters provides an easy way to test Hadoop projects directly in your IDE, without the need for a sandbox or a full-blown development cluster. It allows the user to debug with the full power of the IDE.
... View more
12-14-2016
04:05 PM
@Hedart - Does this answer help answer your questions? If so, can you please Accept my answer? If there are further questions you'd like to ask, please feel free to use @Tom McCuch when you ask them, so I get notified. Thanks. Tom
... View more
12-13-2016
03:36 PM
4 Kudos
@Hedart
Great use case!
Hadoop is very commonly added to an existing Data Architecture in this way - where you have traditional structured data, such as customer data, in a RDMS, such as MySQL - and you want to capture new unstructured data, such as location-based information from user's phones. Answers to your questions below:
Q. What's an ideal database to store this [location-based data from user's phones] in?
This really depends on how you are going to in-turn use the data. Since this use case in a real-time ad-serving application, and this location data will be analyzed in real-time - one would automatically consider Apache HBase. However, it may be very easy to parse the location data out of the location-based data from the user's phones and use it to serve ads in real-time while the data is in-motion. In this case, the location-based data you are collecting is being used primarily for "deep thinking" once the data is at rest, such as BI / Analytics around campaigns. Apache Hive is the best database to consider for storing data for BI / Analytics purposes. This leads to your next question ...
Q. Are there any bottlenecks when receiving vast amounts of location data from millions of phones and how to mitigate it?
To mitigate the risks around the effective, efficient data movement of highly distributed, high velocity / high volume data such as in the use case above, we frequently recommend Hortonworks Data Flow, powered by @apachenifi, as part of a Connected Data Platform involving Modern Data Applications at the intersection of Data-in-Motion and Data-at-Rest.
Hortonworks Data Flow (HDF) securely moves data from wherever it is, to wherever it needs to go, regardless of size, shape, or speed dynamically adapting to the needs of the source, the connection, and the destination. HDF was designed with the real-world constraints you are concerned about in mind:
power limitations,
connectivity fluctuations,
data security and traceability,
data source diversity and
geographical distribution,
... altogether, for accurate, time-sensitive decision making - such as what you specify in your next question.
Q. What's a real time option to spot the user in a geo-fenced area, know there is an ad associated with that location and show the ad real-time. Can this hadoop technology (probably storm) handle millions of users?
HDF, powered by Apache NiFi, can parse the location-based data incoming from the user's phones and spot if the user is in a geo-fenced area in real-time. It can then publish those events to Apache Kafka for processing in HDP - either through Apache Storm as you suggest, or alternatively through Apache Spark Streaming - depending on your team's technical skillsets / preferences and the specific requirements of the processing that needs to be performed. That processing would most likely interact with fast data from Apache Phoenix to provide context. Together, HDF and HDP, can handle millions of users in an easy, horizontally scalable fashion.
Q. How do I take data from MySQL (for example, who is the user, which ad needs to be shown and the content of the ad) within the Hadoop System when I spot a user in the Geo-fenced area? Can I keep this information synced in Hadoop in real-time as the ad is created? Which technology supports moving MySQL data to Hadoop real-time and which database would be good to store this information? What happens when ad data is changed regularly - how do we sync it in Hadoop?
As mentioned above, the contextual information used during event processing (for example, who is the user and which ad needs to be shown when a user is spotted in a geo-fenced area) would be stored in Apache Phoenix. Depending on access requirements, the actual content of the ad could easily be stored in Apache HDFS and simply referenced from Apache Phoenix.
Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. It enables developers to access large dataset in real-time with familiar SQL interface, providing:
Standard SQL and JDBC APIs with full ACID transaction capabilities
Support for late-bound, schema-on-read with existing data in HBase
Access data stored and produced in other Hadoop products such as Spark, Hive, Pig, Flume, and MapReduce
There are multiple ways to keep this information synced with MySQL in real-time. Depending on the specifics of your use case, we can employ a combination of both batch and event based replication solutions, using tools such as Apache Sqoop and HDF. A good example how to employ HDF to handle incrementally streaming RDBMS data that is changed regularly can be found here.
Q. Once an ad is served, how do we store our actions. Which database would you recommend?
We would recommend the same Connected Platform technologies described above - HDF (Apache NiFi) for capturing user actions and HDP (Apache Storm / Apache Spark / Apache Phoenix) for storing those actions.
Q. What technologies can we use for analytics to be performed on the data from MySQL and Hadoop database? Again, how do we reduce bottlenecks using specific technologies so that we don't waste time pulling data from one DB to another local machine and process and redo everything? What's a good way to automate things from the beginning.
Again, as I described at the beginning, Apache Hive is the de facto standard for SQL queries in Hadoop and the best Hadoop database to be used for analytics. HDP provides Apache Zeppelin - a completely open web-based notebook that enables interactive data analytics. Interactive browser-based notebooks, such as Apache Zeppelin, enable data engineers, data analysts and data scientists to be more productive by developing, organizing, executing, and sharing data code and visualizing results without referring to the command line or needing the cluster details. Notebooks allow these users not only allow to execute but to interactively work with long workflows. As such they are a great way to automate things from the beginning. Apache Zeppelin allows you to work with data across MySQL, Apache Hive, and Apache Phoenix using Apache Spark - a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.
... View more
11-30-2016
02:57 PM
Thank you, Constantin!
... View more
11-29-2016
03:24 PM
@Fernando Lopez Bello Does this help? Is there more that I can help answer for you? If not, would you please accept my answer? Thanks. Tom
... View more
11-27-2016
05:54 PM
3 Kudos
@Fernando Lopez Bello Traditional reporting tools you seem to have pretty well covered. The usual suspects: Pentaho, JasperReports, SpagoBI. For multi-dimensional analysis, you may want to look into Apache Druid. AirBNB Superset is a pretty popular data exploration platform for Druid available under the Apache License. Apache Kylin is also an option. Kylin provides direct support for standard tools such as Tableau and Microsoft Excel/PowerBI. In addition to Banana, Kibana from Elastic is a pretty popular dashboard for real-time analytics available under the Apache License. Impetus also offers a free version of its StreamAnalytix product, but not through an Apache License. Hope this helps!
... View more
10-01-2016
12:01 AM
@Huahua Wei What version of Spark are you running? There is a JIRA for Spark 1.5.1 where the SparkContext stop method does not close HiveContexts.
... View more
09-24-2016
02:42 PM
2 Kudos
@Huahua Wei You need to explicitly stop the SparkContext sc by calling sc.stop. In cluster settings if you don't explicitly call sc.stop() your application may hang. Like closing files, network connections, etc, when you're done with them, it's a good idea to call sc.stop(), which lets the spark master know that your application is finished consuming resources. If you don't call sc.stop(), the event log information that is used by the history server will be incomplete, and your application will not show up in the history server's UI.
... View more
09-14-2016
12:52 AM
Thank you, Ameet! The google browser key was not required by gcm when we first published this. I have heard from others it is now necessary configuration.
... View more