Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Deploying hadoop cluster for my web application

Deploying hadoop cluster for my web application

New Contributor

After searching, reading, watching a lot of articles, videos, seminars on google for 2 months i am still in some confusion. I'm here to make it clear.

I've a web application using mysql database and apache server. My users upload images on their account and it stored on my server in a folder on public_html. and later it shows from the folder. Its a very general and common practice for web applications.

Now the problem raised my image storage becoming high. Now i would like to take a big data solution. So i would like to setup hadoop cluster to store my image files. i would like to alter mysql database with cassadra. I would like to store all user signup details and other picture upload log data into cassadra. now my question is how to upload images on hadoop?

my demand> 1) Whenever someone signup it will perform a CQL on cassandra and store all data to cassadra like mysql database 2) When users will upload a image it will store to hadoop server directly except public_html and write the image uplaod and user data on cassadra.

My question is: 1) Is it possible to upload image to hadoop cluster directly by my php application or it upload to my public_html first then with cron job it move to hadoop later with batch files/? Which is best practice to do?

2) What is the best tools and practice to store images on hadoop?

3 REPLIES 3
Highlighted

Re: Deploying hadoop cluster for my web application

Expert Contributor

Hadoop is typically not designed for random access but for analytical processing. The difference is latency. For example in your case you likely want to display images in (milli-) seconds to the user, not in minutes and hours. OLAP (online analytic processing) use cases typically scan a bunch of (large) resources to make aggregation for data insights.

Nevertheless there are a bunch of existing use case that are similar to yours. In almost all this use cases HBase http://hortonworks.com/apache/hbase/ is being used to store PDF, Images, or other binary files. HBase also gives you random access to the data, which is sufficient for a web application like your.

HBase has a very simple PUT/GET API to store and get key value data. It uses binary for both key and value, which makes it very convenient for binary data like images: http://www.tutorialspoint.com/hbase/hbase_create_data.htm

HBase pretty much works like Cassandra, so you could consider doing what you want to do with Cassandra also with HBase. Phoenix https://phoenix.apache.org/ is the SQL layer for HBase, which is like CQL, and should help you get started quickly.

Happy hadooping!

Re: Deploying hadoop cluster for my web application

@rock one How are you currently protecting the photo files in public_html from unauthorized access by non-owners? To do so you must intermediate the access to the files. I hope you are proxying them for your users, and not actually handing off raw URIs in a redirect.

As long as you are already doing that, you can use any storage system without changing your program logic much. Just make sure you have a proper abstraction layer for storage system access. Then you can store and read files with whatever client calls are appropriate to your storage system, and can even migrate to different storage systems over time.

With HDFS it would probably be simplest for you to use the WebHDFS REST API for writing and reading the photo files. But before committing to HDFS you should probably also look at Amazon's S3 + Glacier, Microsoft's Azure Storage, or Google's Cloud Storage + Nearline Storage. Any of them offer unlimited storage for $10-25/Terabyte/month, depending on redundancy level. It is unlikely you can match this cost (including operational costs, not just the price of cheap disk!) with an on-site storage system with comparable redundancy and availability, unless your data is only Terabyte-scale, or you work for a big company with an expert IT group. Of course you should run the numbers yourself, for your particular use case.

As the number of individual photos stored in your system increases past several million, you may indeed need a big database to store the metadata. This is a separate issue from whether the blob storage is in HDFS or not. Cassandra or HBase would work fine for this. But neither of them will intrinsically solve your permissions and security needs, you still need server logic to handle that, just as you do today.

Re: Deploying hadoop cluster for my web application

Rising Star

My question is: 1) Is it possible to upload image to hadoop cluster directly by my php application or it upload to my public_html first then with cron job it move to hadoop later with batch files/? Which is best practice to do?

You can put your images in Hbase using phoenix or native api.

2) What is the best tools and practice to store images on hadoop?

For low latency random access HBase is only option. For analytical use-cases you can use hive or hdfs for storage.