Support Questions

rendi_7936 · ‎04-20-2016

Hello anyone,

First, i want say sorry because maybe my question is out of the topic. But, i need to ask it because i dont find another forum that discuss about Big Data and how to use it with your own (Using Tutorial that provided by Hortonworks).

My question is

1. Can i get data from twitter to HDP ?

Is it true that i can use Apache Nifi or Can i use another Software like Apache Kafka ?

But How ?

2. When i get the data form Twitter, how to process it ?

Can i use Spark or Mahout ?

In this case, i want to find the most people personality from twitter, which is the best option ?

3. Then, if i can calculate the data, how i can access it with android aplication ?

In my mind, i think i can use web service but i dont know how to make it with HDP ?

I will make a clear explanation about this,

a. User want to use my app. But, user must login to Twitter before use this app.

b. After this, i want to collect some twitter data from user.

c. Twitter data passes to HDP using Web services.

d. HDP will calculate the data with certain algorithm.

e. After that, HDP will send the result the data to android app.

f. Voila, use can see his/her personality using twitter data.

Can i implement this ? Or maybe i think too much how use HDP ?

4. Do you some another idea for my final project in the college ?

Actually, my older lecturer use HDP but only single cluster (Sandbox). And i want to make it multi node, but i dont know how to maximize the potential how to use HDP and real Bid Data.

I hope my question with not confuse you guys, why i ask to many question because my lack knowledege how to use HDP and Big Data. I hope i can get the answer and have a nice day :).

isoardi · ‎04-20-2016

Hi @Rendiyono Wahyu Saputro,

I answer point to point:

1. You can use whatever you prefer. I used NIFI in a similar project for get and model the data from twitter

2. If you use NiFi you can process the data (the twitter API return a JSON file) to set the attributes with the values of JSON node. I suggest you to index data to Solr and use Spark for querying to Solr. After you should process the data in Spark for find the user with more retwitter tweets, for example.

3. In this page we used Storm to querying and generate JSON, for minimize the traffic to the SolrCloud (We don't have more CPU), but if you want you can queryng directly solr and return a JSON file for your app.

My idea is:

a- I register to your apps

b- I insert my twitter account

c - the app send to NiFi my personal information to follow me (add my userName in the getTwitter-processor filter)

d - NiFi send to Solr my tweets

e - Spark querying Solr for my username and process my data

f - Spark send the result to another Solr collection (coll_B)

g - the application requires the data processed to coll_B of solr in JSON format

All components that I mentioned run on HDP in cluster mode. The size of the cluster depend on different factors: - how many people will use your app (many requests to solr require a lot of CPU and RAM) for example for ingest (avg) 800 tweets we have 3 NiFi worker with 3GB of Xmx - how heavy data processing? - It must be in real time? These and more are all things to be assessed for deploy a cluster.

View solution in original post

isoardi · ‎04-20-2016

Hi @Rendiyono Wahyu Saputro,

I answer point to point:

1. You can use whatever you prefer. I used NIFI in a similar project for get and model the data from twitter

2. If you use NiFi you can process the data (the twitter API return a JSON file) to set the attributes with the values of JSON node. I suggest you to index data to Solr and use Spark for querying to Solr. After you should process the data in Spark for find the user with more retwitter tweets, for example.

3. In this page we used Storm to querying and generate JSON, for minimize the traffic to the SolrCloud (We don't have more CPU), but if you want you can queryng directly solr and return a JSON file for your app.

My idea is:

a- I register to your apps

b- I insert my twitter account

c - the app send to NiFi my personal information to follow me (add my userName in the getTwitter-processor filter)

d - NiFi send to Solr my tweets

e - Spark querying Solr for my username and process my data

f - Spark send the result to another Solr collection (coll_B)

g - the application requires the data processed to coll_B of solr in JSON format

All components that I mentioned run on HDP in cluster mode. The size of the cluster depend on different factors: - how many people will use your app (many requests to solr require a lot of CPU and RAM) for example for ingest (avg) 800 tweets we have 3 NiFi worker with 3GB of Xmx - how heavy data processing? - It must be in real time? These and more are all things to be assessed for deploy a cluster.

rendi_7936 · ‎04-21-2016

@Davide Isoardi Thanks for reply my question. But, i have another question.

When an App using web service, the sequence will be like this.

Android --[JSON Data]--> Web Service --[SQL Query]--> Database.

In this case, to proccess the data, we only using Web Service and send SQL Query to Database.

From the your answer this is what i get.

Android --[Twitter Data]--> Nifi --> Solr --> Spark --> Another Solr Collection --[JSON Format]--> Android

In this case, to process the data we need many app to send data. From Android to Nifi, from Nifi to Solr, etc.

Can Solr receive data from Nifi after that send to Spark ? Because i never know how to communicate (send/receive data) from one app to another app in HDP automatically like your idea ?

I'm sorry, if my question makes you confuse 😉

isoardi · ‎04-21-2016

Hi,

NiFi can comunicate whit Solr via PutSolrContentStream. (documentation) If you want NiFi have also a GetSolr process.

Spark get/put information from/to Solr via API (documentation)

I have never tried (I promised myself to do it) but if you want to create a strreaming process from NIFI to Spark you can try , perhaps starting with this one.

rendi_7936 · ‎04-21-2016

Hmmmm, so the key of this is Apache Nifi. In Nifi, we can create a sequence from get the data, process it and save it according the Processor what we choose.

If in the first time, i can load "aloha.csv" and make a table from it manually (Click Files Views and Hive View). So, with Nifi i can load "aloha.csv" and make a table automatically using the Processor. Is it correct ?

Well, maybe i try it. I'm sorry if i ask too much because i need to clarify this. Because, if i cant implement this (my final project in the college), i will step back to search another idea.

isoardi · ‎04-21-2016

It is not totally true:

with NiFi you can get "aloha.csv" from different source (web, local file, HDFS, DB, twitter, ...), enrich the data (for example you can merge with "byebye.csv") or to modify the data, and save data (to hive in different ways, to hdfs, to local file, ...).

If you eant to see the personalityof twitter user, you can not use NiFi to calculate it. For this you can use spark

rendi_7936 · ‎08-16-2016

@Davide Isoardi Good Afternoon, sir. May i ask you something ? Is http://www.ecubecenter.it/Hadoop-Big-Data-Twitter-Map/ using this https://github.com/disoardi/tweetsdemo_with_ambari_views.

It has been a while that i lost my mind to work my undergraduate project. Maybe, i will start again from this. Thanks before

dorio · ‎04-20-2016

Hi @Rendiyono Wahyu Saputro,

I'll add only one last thing to what @Davide Isoardi wrote: we're converting our demo to run inside Hortonworks Sandbox and we'll push all to GitHub. I'll ping you when we'll finish the job: it could be a good starting point to build your own app. Meanwhile we're also planning a text analysis algorithm to analyze tweets to understand the right rilevance for what we're searching.

Stay tuned.

rendi_7936 · ‎04-21-2016

@Andrea D'Orio Thanks for your support 😉

Cloudera Community

Support Questions

Can i get twitter data to HDP, process it and show the result in Android Application using web services?