Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 7357 | 08-12-2016 01:02 PM | |
| 2708 | 08-08-2016 10:00 AM | |
| 3667 | 08-03-2016 04:44 PM | |
| 7210 | 08-03-2016 02:53 PM | |
| 1863 | 08-01-2016 02:38 PM |
04-21-2016
09:56 AM
2 Kudos
That was interesting I learned a bit about Hive today. Yes its possible: 0: jdbc:hive2://sandbox:10000/default> select * from test9;
+-----------+--------------------------+--+
| test9.id | test9.name |
+-----------+--------------------------+--+
| 1 | ["ben","klaus","klaus"] |
| 2 | ["ben","klaus","klaus"] |
+-----------+--------------------------+--+
You can explode the arrays in a lateral view, and then using collect_list to merge them again using a distinct:
0: jdbc:hive2://sandbox:10000/default> select id, collect_list(distinct(flatname))
from (select id,flatname from test9 lateral view explode(name) mytable as flatname) g group by id ;
+-----+------------------+--+
| id | _c1 |
+-----+------------------+--+
| 1 | ["ben","klaus"] |
| 2 | ["ben","klaus"] |
+-----+------------------+--+
You can also use a UDF might be faster. Brickhouse provides a set of really cool UDFs which I have used before. CombineUnique sounds like what you want. https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CombineUniqueUDAF.java
... View more
04-20-2016
10:57 PM
I mean I see the benefit of parallel computation but you essentially pipe the full dataset over the network. If you can ascertain data correctness using a set of test queries on both clusters ( perhaps as part of that using something like an Aggregated MD5 hash if you want to be really safe ) you save yourself from having to do that. I think all of the approaches have advantages: a) Copying everything over to one cluster and comparing them locally: This assumes that the copy process is safe but you can use the full power of hadoop to join and compare the data sets locally after. b) Your approach of pointing one cluster to two hdfs The computation happens in parallel and if you have a fast network connection between the nodes it might be a good approach. c) My approach of running a set of test queries on both Hiveservers ( mix of simple aggregations, some representative business queries and perhaps a hash function and comparing the small results client side. You can do the computations on both clusters locally without transmitting much out of the cluster network and you also test equivalence of Hive on both systems ( We had a question just today of hive giving wrong results because of statistics differences ). However you cannot do a full comparison of the data sets. Choose your poison I suppose. I still think my approach makes most sense if you essentially want to compare two Hive warehouse instances that should be identical and it requires minimal changes to the cluster setup. However you obviously cannot do more in-depth record level comparisons. If you have big datasets and you absolutely have to compare every single record you need to go with a or b.
... View more
04-20-2016
10:01 PM
2 Kudos
That is one of the things that is more natural in storm. I think your only chance is to set a pretty low base frequency and then either check for the time/trigger event yourself ( in code that gets executed Iike a mappartitions. ) or to use a trigger input ( for example a kafka topic wirh control commands) and join with your main data stream. the first approach would be in pseudo code Inputstream.mappartitions{ String command=<load trigger from database hbase whatever...> Transform your data flow based on command }
... View more
04-20-2016
06:57 PM
I think only Alex knows rhe requirementa. If you want to do a sanity check if the main data propagation worked correctly I woild rather run a set of aggregation queries against hive on both and compare the results. Otherwise you essentially tell him "to check if distcp worked do distcp and compare " which is a bit catch22. Also for big amounts of data this is not efficient. It is better to run a good mixed set of test queries that cover all main columns etc. This way you also catch problems in hive.
... View more
04-20-2016
05:24 PM
2 Kudos
The way to do it will be to run a set of queries against both tables. Testsuite: This should include a set of basic queries ( counts, min,max,sum,avg of each number columns perhaps count distinct for string fields ), You also normally have a couple of more complex business queries in the mix ( testing suite ). You could do a SELECT * but in that case in a bigdata environment you most likely have too much data. Executing the queries: You can use beeline ( in contrast what was said above you CAN connect to a remote cluster with it. If your cluster is kerberized your client would need to be part of both realms though ). But please use PAM/LDAP authentication for Hiveserver2 makes life so much easier. If you use beeline you can output the data as CSV data to make life simpler. https://community.hortonworks.com/questions/25789/how-to-dump-the-output-from-beeline.html You can also use other tools like sqlline or write a custom Java client. Comparison tool: You then need to compare the output with each other. You can do that in Linux using diff however in this case you need to have the data sorted by the query using an order by. Or sort using any linux tool. You can also write a little java program that does everything in one. The below does this but much more but you could perhaps use it as an inspiration. HiveConnection and HiveHelper classes. But be-warned its ugly. https://github.com/benleon/HivePerformance/
... View more
04-18-2016
09:14 PM
1 Kudo
I am actually pretty sure that most backups will still work. Sure all the RPMs etc. will be different but lets go through it one by one: a) HDFS data, should really not depend on the OS unless the data is switched from big to little endian or something. b) Databases ( ambari, hive, oozie, ... ) Should also not depend on the OS. Depends on the database obviously but if you do an export import you should be fine. Simply copying the files over might be a different matter. Now you would need to change the hostnames inside the backups, for hive that is a single location for the others it could be more complicated. Unless you migrate the hostnames 1:1 c) configs? I think the easiest way here would be blueprints. ( i.e. export one and setup the new cluster with it ) OR install clean and apply settings carefully. Which might be safer to make any modifications that are needed. d) timeline store, spark history, etc. are most likely not needed to keep. But yeah it might be safer to setup the new cluster and distcp the data over instead of copying the namenode/datanode folders over. However I really don't think the OS should affect them. ( Never did it though fair warning ) My tip would be try it it on a sandbox ( install a single node suse make a table, an oozie job and a couple files and then migrate everything )
... View more
04-18-2016
08:59 PM
Will try it out thanks. I normally only need the CURL commands when something goes wrong but would be nice to investigate the API a bit. Much easier in a shell.
... View more
04-18-2016
08:22 PM
How cool is the ambari shell, could we please bundle that out of the box?
... View more
04-18-2016
07:32 PM
also add the distribute by clause as I wrote below otherwise each reducer will write to 1173 partitions which guarantees OOM exceptions. ( ORC keeps some memory for every task. ) Also really no idea why he distributes by the second column ( _col1 ) He shouldn't add a reducer to a simple SELECT * from. Which column is that DT? Then everything is ok.
... View more
04-18-2016
07:31 PM
MAPRED.REDUCE.TASKS ( like I wrote ) or mapreduce.job.reduceRs ( Laurent had a small error there )
... View more