Created 12-06-2016 04:15 AM
1) MySQL stores user data such as app id, contact details etc.MySQL also stores ad data, such as the type of ad to be served, what's the location etc?
2) We would like to capture location based information from users' phones. Q - what's an ideal database to store this information in? Are there any bottlenecks when receiving vast amounts of location data from millions of phones and how to mitigate it?
3) While this location data is streaming in, if an ad is available (ad details are in mysql), then the ad needs to be shown real time as the user enters a geo-fenced area. Q. what's a real time option to spot the user in a geo-fenced area, know there is an ad associated with that location and show the ad real-time. Can this hadoop technology (probably storm) handle millions of users? Q. How do I take data from MySQL (for example, who is the user, which ad needs to be shown and the content of the ad) within the Hadoop System when I spot a user in the Geo-fenced area? Can I keep this information synced in Hadoop in real-time as the ad is created? Which technology supports moving MySQL data to Hadoop real-time and which database would be good to store this information? What happens when ad data is changed regularly - how do we sync it in Hadoop?
4) Once an ad is served, how do we store our actions. Which database would you recommend?
5) What technologies can we use for analytics to be performed on the data from MySQL and Hadoops database? Again, how do we reduce bottlenecks using specific technologies so that we don't waste time pulling data from one DB to another local machine and process and redo everything? What's a good way to automate things from the beginning.
Thanks for taking the time to advice me. We are a small team of app developers and a java developer. We are just starting to think of ads as a revenue stream and were researching good solutions to go about it. Before we hire the right staff, we wanted to know what are the skill-sets we should look for from such a team we would setup. We prefer Hortonworks platform as it's the only true integrated hadoop solution.
Created on 12-13-2016 03:36 PM - edited 08-19-2019 01:35 AM
@Hedart
Great use case!
Hadoop is very commonly added to an existing Data Architecture in this way - where you have traditional structured data, such as customer data, in a RDMS, such as MySQL - and you want to capture new unstructured data, such as location-based information from user's phones. Answers to your questions below:
Q. What's an ideal database to store this [location-based data from user's phones] in?
This really depends on how you are going to in-turn use the data. Since this use case in a real-time ad-serving application, and this location data will be analyzed in real-time - one would automatically consider Apache HBase. However, it may be very easy to parse the location data out of the location-based data from the user's phones and use it to serve ads in real-time while the data is in-motion. In this case, the location-based data you are collecting is being used primarily for "deep thinking" once the data is at rest, such as BI / Analytics around campaigns. Apache Hive is the best database to consider for storing data for BI / Analytics purposes. This leads to your next question ...
Q. Are there any bottlenecks when receiving vast amounts of location data from millions of phones and how to mitigate it?
To mitigate the risks around the effective, efficient data movement of highly distributed, high velocity / high volume data such as in the use case above, we frequently recommend Hortonworks Data Flow, powered by @apachenifi, as part of a Connected Data Platform involving Modern Data Applications at the intersection of Data-in-Motion and Data-at-Rest.
Hortonworks Data Flow (HDF) securely moves data from wherever it is, to wherever it needs to go, regardless of size, shape, or speed dynamically adapting to the needs of the source, the connection, and the destination. HDF was designed with the real-world constraints you are concerned about in mind:
... altogether, for accurate, time-sensitive decision making - such as what you specify in your next question.
Q. What's a real time option to spot the user in a geo-fenced area, know there is an ad associated with that location and show the ad real-time. Can this hadoop technology (probably storm) handle millions of users?
HDF, powered by Apache NiFi, can parse the location-based data incoming from the user's phones and spot if the user is in a geo-fenced area in real-time. It can then publish those events to Apache Kafka for processing in HDP - either through Apache Storm as you suggest, or alternatively through Apache Spark Streaming - depending on your team's technical skillsets / preferences and the specific requirements of the processing that needs to be performed. That processing would most likely interact with fast data from Apache Phoenix to provide context. Together, HDF and HDP, can handle millions of users in an easy, horizontally scalable fashion.
Q. How do I take data from MySQL (for example, who is the user, which ad needs to be shown and the content of the ad) within the Hadoop System when I spot a user in the Geo-fenced area? Can I keep this information synced in Hadoop in real-time as the ad is created? Which technology supports moving MySQL data to Hadoop real-time and which database would be good to store this information? What happens when ad data is changed regularly - how do we sync it in Hadoop?
As mentioned above, the contextual information used during event processing (for example, who is the user and which ad needs to be shown when a user is spotted in a geo-fenced area) would be stored in Apache Phoenix. Depending on access requirements, the actual content of the ad could easily be stored in Apache HDFS and simply referenced from Apache Phoenix.
Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. It enables developers to access large dataset in real-time with familiar SQL interface, providing:
There are multiple ways to keep this information synced with MySQL in real-time. Depending on the specifics of your use case, we can employ a combination of both batch and event based replication solutions, using tools such as Apache Sqoop and HDF. A good example how to employ HDF to handle incrementally streaming RDBMS data that is changed regularly can be found here.
Q. Once an ad is served, how do we store our actions. Which database would you recommend?
We would recommend the same Connected Platform technologies described above - HDF (Apache NiFi) for capturing user actions and HDP (Apache Storm / Apache Spark / Apache Phoenix) for storing those actions.
Q. What technologies can we use for analytics to be performed on the data from MySQL and Hadoop database? Again, how do we reduce bottlenecks using specific technologies so that we don't waste time pulling data from one DB to another local machine and process and redo everything? What's a good way to automate things from the beginning.
Again, as I described at the beginning, Apache Hive is the de facto standard for SQL queries in Hadoop and the best Hadoop database to be used for analytics. HDP provides Apache Zeppelin - a completely open web-based notebook that enables interactive data analytics. Interactive browser-based notebooks, such as Apache Zeppelin, enable data engineers, data analysts and data scientists to be more productive by developing, organizing, executing, and sharing data code and visualizing results without referring to the command line or needing the cluster details. Notebooks allow these users not only allow to execute but to interactively work with long workflows. As such they are a great way to automate things from the beginning. Apache Zeppelin allows you to work with data across MySQL, Apache Hive, and Apache Phoenix using Apache Spark - a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.
Created on 12-13-2016 03:36 PM - edited 08-19-2019 01:35 AM
@Hedart
Great use case!
Hadoop is very commonly added to an existing Data Architecture in this way - where you have traditional structured data, such as customer data, in a RDMS, such as MySQL - and you want to capture new unstructured data, such as location-based information from user's phones. Answers to your questions below:
Q. What's an ideal database to store this [location-based data from user's phones] in?
This really depends on how you are going to in-turn use the data. Since this use case in a real-time ad-serving application, and this location data will be analyzed in real-time - one would automatically consider Apache HBase. However, it may be very easy to parse the location data out of the location-based data from the user's phones and use it to serve ads in real-time while the data is in-motion. In this case, the location-based data you are collecting is being used primarily for "deep thinking" once the data is at rest, such as BI / Analytics around campaigns. Apache Hive is the best database to consider for storing data for BI / Analytics purposes. This leads to your next question ...
Q. Are there any bottlenecks when receiving vast amounts of location data from millions of phones and how to mitigate it?
To mitigate the risks around the effective, efficient data movement of highly distributed, high velocity / high volume data such as in the use case above, we frequently recommend Hortonworks Data Flow, powered by @apachenifi, as part of a Connected Data Platform involving Modern Data Applications at the intersection of Data-in-Motion and Data-at-Rest.
Hortonworks Data Flow (HDF) securely moves data from wherever it is, to wherever it needs to go, regardless of size, shape, or speed dynamically adapting to the needs of the source, the connection, and the destination. HDF was designed with the real-world constraints you are concerned about in mind:
... altogether, for accurate, time-sensitive decision making - such as what you specify in your next question.
Q. What's a real time option to spot the user in a geo-fenced area, know there is an ad associated with that location and show the ad real-time. Can this hadoop technology (probably storm) handle millions of users?
HDF, powered by Apache NiFi, can parse the location-based data incoming from the user's phones and spot if the user is in a geo-fenced area in real-time. It can then publish those events to Apache Kafka for processing in HDP - either through Apache Storm as you suggest, or alternatively through Apache Spark Streaming - depending on your team's technical skillsets / preferences and the specific requirements of the processing that needs to be performed. That processing would most likely interact with fast data from Apache Phoenix to provide context. Together, HDF and HDP, can handle millions of users in an easy, horizontally scalable fashion.
Q. How do I take data from MySQL (for example, who is the user, which ad needs to be shown and the content of the ad) within the Hadoop System when I spot a user in the Geo-fenced area? Can I keep this information synced in Hadoop in real-time as the ad is created? Which technology supports moving MySQL data to Hadoop real-time and which database would be good to store this information? What happens when ad data is changed regularly - how do we sync it in Hadoop?
As mentioned above, the contextual information used during event processing (for example, who is the user and which ad needs to be shown when a user is spotted in a geo-fenced area) would be stored in Apache Phoenix. Depending on access requirements, the actual content of the ad could easily be stored in Apache HDFS and simply referenced from Apache Phoenix.
Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. It enables developers to access large dataset in real-time with familiar SQL interface, providing:
There are multiple ways to keep this information synced with MySQL in real-time. Depending on the specifics of your use case, we can employ a combination of both batch and event based replication solutions, using tools such as Apache Sqoop and HDF. A good example how to employ HDF to handle incrementally streaming RDBMS data that is changed regularly can be found here.
Q. Once an ad is served, how do we store our actions. Which database would you recommend?
We would recommend the same Connected Platform technologies described above - HDF (Apache NiFi) for capturing user actions and HDP (Apache Storm / Apache Spark / Apache Phoenix) for storing those actions.
Q. What technologies can we use for analytics to be performed on the data from MySQL and Hadoop database? Again, how do we reduce bottlenecks using specific technologies so that we don't waste time pulling data from one DB to another local machine and process and redo everything? What's a good way to automate things from the beginning.
Again, as I described at the beginning, Apache Hive is the de facto standard for SQL queries in Hadoop and the best Hadoop database to be used for analytics. HDP provides Apache Zeppelin - a completely open web-based notebook that enables interactive data analytics. Interactive browser-based notebooks, such as Apache Zeppelin, enable data engineers, data analysts and data scientists to be more productive by developing, organizing, executing, and sharing data code and visualizing results without referring to the command line or needing the cluster details. Notebooks allow these users not only allow to execute but to interactively work with long workflows. As such they are a great way to automate things from the beginning. Apache Zeppelin allows you to work with data across MySQL, Apache Hive, and Apache Phoenix using Apache Spark - a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.
Created 12-14-2016 04:05 PM
@Hedart - Does this answer help answer your questions? If so, can you please Accept my answer? If there are further questions you'd like to ask, please feel free to use @Tom McCuch when you ask them, so I get notified.
Thanks. Tom
Created 12-13-2016 04:59 PM
IMGUR moved from MySQL to HBase for their notifications.
https://medium.com/imgur-engineering/imgur-notifications-from-mysql-to-hbase-9dba6fc44183#.x1xf6lbsz
HBase and Phoenix are very easy to use for Java developers. An HDF cluster running NIFI, Storm and Kafka is minimal administration and all the tools are Java based and Java oriented.
Here are some articles I wrote on accessing Phoenix and Hive data with Java.
https://community.hortonworks.com/repos/65142/linkextractor.html
https://community.hortonworks.com/articles/65239/mp3-jukebox-with-nifi-1x.html
You can use Spark to leverage your HDFS, Hive, Phoenix/HBase data
https://databricks.com/blog/2016/02/02/an-illustrated-guide-to-advertising-analytics.html