12-12-2014 02:26 AM
There is a thing I cannot mange to understand:
- should we use the graphicsl interface (hue) for developping? How are we going to version this thing?
- should we use maven and the jdbc connector for developping?
- should we do shell scripts for developping?
Are there any cloudera best practice about all this?
12-12-2014 01:04 PM - edited 12-12-2014 01:04 PM
I get the sense that you are thinking of Hadoop as if it were an "application development framework" -- it's not that.
What is your use case/objective, exactly?
12-12-2014 09:51 PM
My use case and the choosed services are:
1) taking some data from different productors/sources
* flume => configuration scripts
=> custom source (S3) => java
No interface so we are constraint to make scripts (?)
2) applying different ETL on the data
2.1) standardize the data (different treatment/ source)
2.2) enrich the data with the help of other sources
2.3) transform data (json) into parquet
This part is considered as batch today (there is not a source that is transfering all the data in real time)
We are doing the transformation to parquet in the end because, until then, we may need to return the data to other applications in json.
1? => use the graphical interface? if yes, how should we work as a team on it? naming conventions?)
visualisation ok/ debugging ok
2? => use eclipse/maven (like in this example https://github.com/cloudera/cdh-twitter-example) => sping on Hadoop?
=> versionning ok /debugging ok
=> the architecture of the project can be seen in the project architecture + Oozie
3? => make .hql scripts since most of the cloudera examples are showing only the HiveQL part of Hive.
=> debugging ? /the global vision of the project is offered by ozzie
3) render the data to Qlickview or other applications.
- make KPI calculations
- return the data in tables
* Oozie => ordonnancing all this.
12-16-2014 05:39 AM
From the first comment I understood that we should'nt try to use the web interface but I still have 2 choices with cloudera:
- use eclipse/maven (like in this example https://github.com/cloudera/cdh-twitter-example)
- use scripts (hql scripts/shell ....)
What are the best practice Cloudera for developpment?
12-16-2014 12:52 PM
I'm not from Cloudera - so I can't comment on what they would say is best practice but I can say that we've had success using HQL/shell scripts when it comes to development work in Apache Hive.
We still do ad-hoc queries and we tend to use Impala for that.. and for ad-hoc there is less of a need for us to keep a version controled copy of the script while our production process does have version controled HQL scripts.
I hope this helps.
12-18-2014 10:44 AM
This a very broad subject so I don't know if I can point you anywhere in particular -- aside from recommending a close reading of our documentation.
Also, please continie to use these forums as a resource for any additional questions.