About bpreachuk

bpreachuk · ‎10-24-2016

Great job @Alena Melnikova! Nice work with the data and visualization. Really helpful, confirms some longstanding assumptions I've had.

bpreachuk · ‎10-19-2016

Hi @Laurent Edel - Nice Article! I do have a question - are there performance issues when using this method (HCatalog integration) to go from Sqoop directly to an ORC format? In other words, is Option A: Sqoop -> directly to ORC format table via HCatalog integration equivalent performance (or better performance) when compared to Option B? Sqoop -> text files/external Hive table -> Hive CTAS/Insert into ORC format table Would like to ensure the best possible Sqoop performance. Thanks!

bpreachuk · ‎10-07-2016

Hi @Sunile. I suspect this is not a firm number but a bang-for-the-buck recommendation - i.e. you will get the most substantial relative performance improvements with dataset sizes that can remain in-cache. In Carter and Nita's recent blog post, they go into testing 10TB TPC-DS datasets - much larger than the aggregate cluster LLAP cache size. They wanted to see (direct quote) "if LLAP can truly tackle the big data challenge or if it’s limited to reporting roles on smaller datasets." What they found was that (another quote) "If we look at the full results, we see substantial performance benefits across the board, with an average speedup of around 7x. The biggest winners are smaller-scale ad-hoc or star-schema-join type queries, but even extremely complex queries realized almost 2x benefit versus Hive 1."

bpreachuk · ‎09-07-2016

Hi @Sundar Lakshmanan. The input file must use a comma as the field delimiter. If you can change the format of the input file to use a more non-standard character such as a pipe or a tilde, that would fix the issue. With the lazy simple serde you can specifiy a different field delimiter as another serdeproperty: 'field.delim' = '|' Otherwise you could use a pig script to pre-process the data to remove comma from that field, or use some hive post-processing to collapse the 2 fields down into one in a new table.

bpreachuk · ‎09-06-2016

Hi @sanjeevan mahajan and @Constantin Stanca. There is a teeny tiny fix needed to Constantin's last query - the inner query needs to state "Group By emp_no" instead of "Group By s.emp_no". This snippet should work: SELECT e.emp_no, e.birth_date,e.first_name, e.last_name, e.gender, s.min_from_date FROM employees e, (SELECT emp_no, min(from_date) as min_from_date FROM new2_salaries GROUP BY emp_no) s WHERE s.emp_no = e.emp_no; By the way - going way back to your original query, this snippet should work as well. Note the changes - using double-quotes around the orc.compress parameters, ensuring that all selected (non-aggregated) columns are in the group by statement, and user the inner join syntax to explicitly call out your join and join clause. Personal opinion: It's cleaner to keep your join clauses and the where clauses separated... instead of having the join clauses stuffed into the where clause. create table employees2_final stored as ORC tblproperties ("orc.compress"="SNAPPY") AS SELECT e.emp_no, e.birth_date, e.first_name, e.last_name, e.gender, min(s.from_date) as `hire_date_new` from new2_salaries s inner join employees e on s.emp_no = e.emp_no GROUP BY e.emp_no, e.birth_date, e.first_name, e.last_name, e.gender;

bpreachuk · ‎08-31-2016

Without seeing the input file you may need to do a pre-processing step in pig or do a second 'create table as' step in order to reformat the data correctly. It's great to use csv-serde since it does such a good job of stripping out quoted text among other things, but you may need that extra processing in order to use csv-serde effectively (handling NULLs and double-quotes the way you want it to).

bpreachuk · ‎08-31-2016

Hi @Simran Kaur. This looks like a very complex set of data with a wildly varying structure. Pipe characters, empty strings (""""), forward slashes, quoted and unquoted text, overloaded string fields delimited internally with pipes, etc. From what you state, It appears that this data came from a source (file or RDBMS) -> then was loaded into HDFS -> then a Hive table structure placed on the file using CSVSerde. Are you using comma as the delimiter for the table? If you could paste in the table DDL it would help. My suspicion is that the data in the source data file is not correctly formatted so it causing the CSVSerde to have a weird value in that column. I suspect that [in the Hive table] the final column in question actually holds this 14 character literal string: NULL ",61.8""" Please validate if that is true via a select statement. Also please confirm that the source file does not have a <CR><LF> character after the NULL and before the double-quote. In either case - the file may require some data cleansing, and it may make sense to use a different delimiter on the source file - perhaps a tilde "~".

bpreachuk · ‎08-28-2016

That's the interesting part - it actually wasn't working correctly. It was just hard to see the issue without doing the corresponding from_unixtime call. When you were doing the UNIX_TIMESTAMP call it was using '00' as the month, since the format string was using 'minutes' instead of 'months'. For whatever reason, the UNIX_TIMESTAMP function returns a timestamp value when the format string causes you to pass in invalid data like Month = '00'. In my opinion it should fail instead of returning invalid data like that. Check out this query - the first 2 columns use the correct data format string, while the second 2 columns use the invalid date format string (minutes instead of months): select from_unixtime(1440201632, 'yyyy-MM-dd HH:mm:ss') as `good_date1`, from_unixtime(1421884832, 'yyyy-MM-dd HH:mm:ss') as `good_date2`, from_unixtime(1440201632, 'yyyy-mm-dd HH:mm:ss') as `bad_date1`, from_unixtime(1421884832, 'yyyy-mm-dd HH:mm:ss') as `bad_date2` from sample_07 limit 1; and results: good_date1 good_date2 bad_date1 bad_date2 2015-08-22 00:00:32 2015-01-22 00:00:32 2015-00-22 00:00:32 2015-00-22 00:00:32 Notice that the bad dates have zeros in their month field instead of 8 and 1 respectively. Hope this helps.

bpreachuk · ‎08-26-2016

Hi @Simran Kaur. I do not know of issues where hive will run a stream of queries out-of-order. If you paste in text or run a file it should execute in-order. If you have 10 queries in a row and the 7th one fails - the Hive execution ends immediately. If you are looking to add more flow control, error trapping, etc. to your 15 queries, you may want to look into HPL/SQL. It is an add-on to HDP 2.4, but will be included in the upcoming HDP 2.5 release. It lets you use if/then/else, stored procedures, local variables, etc. in your HQL file. Check it out at http://www.hplsql.org/start Syntax Reference: http://www.hplsql.org/doc

bpreachuk · ‎08-26-2016

Hi @Sooraj Antony. Here is the issue - your format string is incorrect. Be careful - the format string is CaSe SenSiTive. 😉 Here is documentation: http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html Here is an example - note the case sensitivity in the format: select UNIX_TIMESTAMP('2015-08-22 00:00:32', 'yyyy-MM-dd HH:mm:ss'), from_unixtime(1440201632, 'yyyy-MM-dd HH:mm:ss'), from_unixtime(1421884832, 'yyyy-MM-dd HH:mm:ss') from sample_07 limit 1; Results: col1 col2 col3 1440201632 2015-08-22 00:00:32 2015-01-22 00:00:32 I hope this helps.

Online	Offline
Last Visited	‎04-26-2019 11:02 AM

Member Since	‎09-25-2015 05:26 PM
Last Visited	‎04-26-2019 11:02 AM
Posts	112
Kudos received	85

Cloudera Community

Re: Does Hortonworks have EOL dates?

Re: I have multiple tables which i need to join an...

Re: Difference between WHERE ...OR & WHERE ... IN

Re: How to insert individual rows into hive based ...

Re: Why is there data limitation with LLAP?

Re: The best approach to the thousands of small pa...

Re: Import RDBMS into Hive table stored as ORC wit...

Re: Why is there data limitation with LLAP?

Re: How can i handle the spanish characters in hiv...

Re: Hive Query not working

Re: hive csv serde not working properly

Re: hive csv serde not working properly

Re: UNIX_TIMESTAMP function returns different valu...

Re: How to run all of hive queries in a file in on...

Re: UNIX_TIMESTAMP function returns different valu...