Member since
10-22-2015
5
Posts
1
Kudos Received
0
Solutions
04-03-2017
11:09 PM
Thanks zhoussen, I will get back to you with my experience. Thanks. Anyways, are you familiar with the performance degradation if any with the solution that you have suggested?
... View more
03-30-2017
08:13 PM
I am looking for a solution, which can let me transform many external JSONs into my internal generic JSON and store it. I have worked with dozer mapper for mapping java beans when I was working in JAVA EE. Are there any solutions that would let me do something same for JSON? or something close?
Since, this will be run in a Hadoop against large data sets, I want to also consider the performance deviation that this kind of transformation/mapping solution will add?
Some more details on where I want to go..
External JSON Type 1 {
"preview":false,
"result":{
"user_id":"1000000216",
"service_name":"Sports Unlimited",
"service_id":"74",
"period_start":"2017-02-15 19:30:00",
"period_end":"2017-02-15 20:00:00"
}
}
External JSON Type 2 {
"User":{
"user_id":"1000000216",
"name":"test"
}
"Service":{
"service_name":"Sports Unlimited",
"service_id":"74",
"service_start":"2017-02-15 19:30:00",
"service_end":"2017-02-15 20:00:00"
}
} All of these types, I want to map to lets say and internal type of following format: -- Generic Common Internal JSON:
{
"user_id":"1000000216",
"service_name":"Sports Unlimited",
"service_id":"74",
"start":"2017-02-15 19:30:00",
"end":"2017-02-15 20:00:00 "
} My current data pipeline is built with Flume topology as ingestion and Spark as processing these jsons and Hive etcs. I want to make a unified transformation layer that can take care of these complex mappings and make the downstream processes independent of external types? Any suggestions will be appreciated?
... View more
Labels:
- Labels:
-
Apache Flume
-
Apache Spark
03-30-2017
09:55 AM
Thanks for updating Knut, In our environment, we used Parquet files on which we had hive external tables and queries were issued from Tableau. I had a question, if you set the spark.executor.cores = 1, will not the over all ETL batch job be slow? I mean will it not loose the the core concurrency power? Also, if you are using spark, you will see its effect on data sets that are often visited but not the one visit data sets? All your updates will greatly appreciated, as another team in my work place is going to try hive-on-spark. Thanks Sumit
... View more
03-29-2017
12:10 PM
1 Kudo
Hi knut N, We had attempted to connect Tableau to our Cloudera cluster and used Spark as the execution engine for Hive. We faced similar problems. The Spark job/task would never finish and to add some other observations, our Tableau queries would sometime finish quickly and sometime it would take long time and sometime would never finish. For these never ending applications that kept using resources, I was using the yarn -kill command to kill them. But, my expericence was not impressive. On the other hand, When we switched back to default Mapreduce execution engine for hive, our queries would always/always finish in on time(like 30+ seconds), the results were returned to the tableau properly and the applicationmaster and all other slave(map/reduce) tasks use to finish successfully all the time. Did you issue sql from shell or outside the cluster? Since, we were connecting via tableau, I suspected tableau may be part of the problem. Thanks for bringing this up, I would be curious to know more about this issue. As, we still are planning to adopt the Hive-On-Spark for our reporting purposes, but this experiecence has led us to suspect hive-on-spark. Thanks Sumit
... View more