About bleonhardi

bleonhardi · ‎05-23-2016

Partitioning: Hive doesn't have range partitioning, 2719 values is not too much ( a couple thousand partitions works fine ) However you might have bad distribution of values ( does each smapiname has similar number rows? ) If the above doesn't work for you you can do dirty tricks: i.e. hashing the smapiname into a set of partitions. partiton by ( hash(smapiname ) % 64 ) for example. However in this case you need to add a where condition to your queries doing this computation again. Given the 'INSERT INTO ORC_TABLE SELECT * FROM STAGING_TABLE', I didn't get what the 'Loading data using distribution' section Please be more specific Couldn't find the optimize.sort.dynamic.partitioning setting, is it same as : Its neither I would not set it globally ( since this affects all loads. ) rather do it in the load script with a set param=value; Is 'DISTRIBUTE BY' clause relevant(even in my example) when it comes to querying the large tables ? Isn't it's role limited to the data loading(ORC table creation) phase ?If not, how do I use it ? It is relevant to loading because it defines number of files ( very much like bucketing ). I used it because bucketing for me was buggy and it gave me more control over the ingestion. 'For PPD, use SORT BY' meaning we don't enforce a total order, also, this means a field like date or some integer. In my case, I guess I need to use the snapshot_id (latest entry will have the highest snapshot_id). Correct me if I am mistaken. The good thing in sort by is that it sorts each file( or bucket ) i.e. each reducer output. This is all you need for performance since predicate pushdown only cares about sorting in one file. Try to visualize how data is structured in each file and see how many tasks will run in the end and how many blocks of data they can skip because of the sorting. It helps to write it down. So how would I do it in your case? Try to partition by the smapiname or if you have to do it by a hash of it. If you partition by it directly sorting doesnt help you much because you dont have a second where condition. If you partiton by a hash then sort by the name itself to at least skip some values. Make sure to use distribute by OR bucketing to create a couple of files ( 1-10 blocks is a good size ) and distribute bucket by something OTHER than the smapiname. ( in this case you get parallel execution ) OR by the smapiname in this case a single query will be slower but multiple queries in parallel would be faster. Tradeoff again.

bleonhardi · ‎05-23-2016

I think you have different options. You need to run the server -> web api thing somewhere. ( python with curl, some little java program whatever ). You could run it on the source servers but then all of them need to run the logic or you can run it on the edge node but then you potentially have a bottleneck ( although I have normally not seen that unless your data volumes are huge assuming the data is properly compressed ). In the end there is no hadoop "best practice" It depends on your setup. I like to do all my data gathering and the push into hadoop on a single node. It makes it easier to manage that piece of code and often there is some kind of logic that needs to aggregate data from different source servers. In this case it would be an edge node with data. However if tou directly push from the source systems you save one hop. So trade off. Another alternative is something like sshfs. This allows you to have an edge node without the additional storage or extra hop. ( i.e. you mount all relevant source systems as NFS or SCP filesystems. )

bleonhardi · ‎05-23-2016

The best practice is to use a proper edge node and the hadoop api. It will be significantly faster than webhdfs ( 2x performance ) and knox will have some additional performance impact. However if you have to use knox because of your firewall/security settings then you have to do it. Then I don't get the second option. What do you mean with using knox as an edge node?

bleonhardi · ‎05-23-2016

Probably, a silly question to ask at this stage - does the infrastructure seem enough ? Hard to say. I had a not too unsimilar join, 25 ( more powerful than yours ) nodes, 170b rows in one table, 7m rows in the other. a Join on key, some filters and aggregations etc. Normally my queries were heavily filtered ( on a single customer ). This ran in seconds ( however all heavily partitioned ). However I did a full join to create aggregation tables and these recreated everything in an hour. ( but that is much more work since it recreates all ORCs ) So it will most likely work the question is how restrictive is your where condition and what are your performance requirements. Format and compression Stay with Zlib, Its normally better. ORC is heavily optimized on it. I did a test with 170b rows joined to 50m rows once, and ORC/zlib in general was the best one ( as fast as snappy and much smaller ) Indexes No, Rather sort the data properly before inserting it ( by your where condition ) ORC have build in indexes so try sorting by smapiname_ver for example. ( And load data in the tables with a good set of reducers so you get a couple of files all sorted by the where condition. Bucketing(normal or 'List Bucketing') You can try that but with 8m rows? It depends. If you can restrict the columns you may be able to do a full mapside join which would be best by far. Let's say 3-4 columns you would end up with 800MB which would easily fit into memory. 100 columns? would not work. In general I would first try it without it, the Hive optimizer is pretty good ( use the ANALYZE keyword and ANALYZE COLUMNS ) Misc. params Not to tout my own horn but I think my presentation has the most important pieces in there. http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data Should I consider this in the given simple query? If yes, what I need to do ? Cartesian joins ( without the join condition ) are the devil and should be avoided at all cost. But you don't have one you have a join condition. The tables should be well sorted - does this mean that while inserting data into final tables, I need to use some 'order by/sort by' clause I think my presentation covers it well. Have a look through it and I am glad to answer any questions you might come up with.

bleonhardi · ‎05-21-2016

ah thanks.

bleonhardi · ‎05-20-2016

While Hive is perfect for analytical queries and is amazing for highly parallel workloads with lots of parallel queries, it is not as fast for small queries as traditional databases yet. You will not get queries faster than 2-3 seconds in total even under perfect circumstances. This is due to the architecture. Rule of thumb: - If Tez has to create a new session ( application master ), i.e. a query on a cold system, you can expect 10-15s pre time. You can fix this by pre-creating sessions. However that takes a bit of the cluster even if you don't need it. - If Tez has to create task containers you can expect 2-3s extra. Tez can reuse containers and there is also prewarm to precreate containers but it's tuning depends a lot on your usecase. If you don't know what you are doing you can make it worse - In general the Hiveserver has a bit of overhead ( around 1s ) for plan compilation communication with the metastore etc. So yes at the moment you will not get faster than 2-3s, realistically 4-5s. If you need sub second responses look at Phoenix for example. However things will soon get better for these short queries: - LLAP is already available as a tec preview ( long running processes that have an ORC data cache and remove the startup needs. - Hive will have an Hbase backed metastore which should speed up the hiveserver2 and more. In short look out for this space.

bleonhardi · ‎05-18-2016

Yeah I still don't get what you get out of it, you could just add a comment to it with the column names of the target table. But I don't see any functional enhancement. Regarding Eugene's link I just tried it, essentially the trick is if your target table has more columns than your select and you want to set the remainder null you can now select a subset of columns in the target table. Like: insert into test30 (id) select * from test31; insert into test34 (id_target) select id from test31; In this case any additional column in the target table is null. However I just tried it with partitioned tables and it doesn't seem to work for me. Looks like a bug. You could open a support case if that doesn't work. However for your usecase where you have all columns in both tables I really don't see the point. insert into table test32(id) partition(dt=3) select * from test31; NoViableAltException(192@[]) at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:41226) at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:40421) ...

bleonhardi · ‎05-18-2016

haha, no worries. I still don't get the question, if you were missing some of the columns in the target table Eugene's link would help. But you don't. You have all of the columns of the target table so I do not get what you are trying to achieve. Your query already works no? If you need to change the order of columns you can just specify them in a different order? So what exactly is better in your first query than in your second query apart from adding additional redundant information?

bleonhardi · ‎05-18-2016

"But, now, im looking another way to insert data in target table(tb_h_teste_insert), where I show the fields\columns in the target table and the fields\columns corresponding of the source table. I didn't understand the question at all. What do you want to do exactly? Quizas mejor en espanol?

bleonhardi · ‎05-18-2016

The columns and orders just need to match. Not sure about the question. Let's assume Table A ( daily data ) col a, col b, col c, col dt Table B ( fact table with subset of columns reordered ) col c, col a partitioned by dt So if you know the date is only today you can just select the columns you need in the order you need them: INSERT INTO B PARTITION (dt=1805 ) SELECT c,a from A; or if you want dynamic partitoning you need to add the partition column at the end: INSERT INTO B PARTITION (dt ) SELECT c,a,dt from A;

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Part-1 : Join involving 24 billion X 1 to 8 mi...

Re: Staging on edge nodes

Re: Staging on edge nodes

Re: Part-1 : Join involving 24 billion X 1 to 8 mi...

Re: It is possible insert data into a table A from...

Re: session tez long time

Re: It is possible insert data into a table A from...

Re: It is possible insert data into a table A from...

Re: It is possible insert data into a table A from...

Re: It is possible insert data into a table A from...