Member since
09-25-2015
112
Posts
88
Kudos Received
12
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
9545 | 03-15-2017 03:17 PM | |
5977 | 02-17-2017 01:03 PM | |
1756 | 01-02-2017 10:47 PM | |
2642 | 11-16-2016 07:03 PM | |
1056 | 10-07-2016 05:24 PM |
10-24-2016
09:17 PM
Great job @Alena Melnikova! Nice work with the data and visualization. Really helpful, confirms some longstanding assumptions I've had.
... View more
10-19-2016
01:31 PM
Hi @Laurent Edel - Nice Article! I do have a question - are there performance issues when using this method (HCatalog integration) to go from Sqoop directly to an ORC format? In other words, is Option A: Sqoop -> directly to ORC format table via HCatalog integration equivalent performance (or better performance) when compared to Option B? Sqoop -> text files/external Hive table -> Hive CTAS/Insert into ORC format table Would like to ensure the best possible Sqoop performance. Thanks!
... View more
10-07-2016
05:24 PM
3 Kudos
Hi @Sunile. I suspect this is not a firm number but a bang-for-the-buck recommendation - i.e. you will get the most substantial relative performance improvements with dataset sizes that can remain in-cache.
In Carter and Nita's recent blog post, they go into testing 10TB TPC-DS datasets - much larger than the aggregate cluster LLAP cache size. They wanted to see (direct quote) "if LLAP can truly tackle the big data challenge or if it’s limited to reporting roles on smaller datasets."
What they found was that (another quote) "If we look at the full results, we see substantial performance benefits across the board, with an average speedup of around 7x. The biggest winners are smaller-scale ad-hoc or star-schema-join type queries, but even extremely complex queries realized almost 2x benefit versus Hive 1."
... View more
09-07-2016
02:12 PM
Hi @Sundar Lakshmanan. The input file must use a comma as the field delimiter. If you can change the format of the input file to use a more non-standard character such as a pipe or a tilde, that would fix the issue. With the lazy simple serde you can specifiy a different field delimiter as another serdeproperty: 'field.delim' = '|' Otherwise you could use a pig script to pre-process the data to remove comma from that field, or use some hive post-processing to collapse the 2 fields down into one in a new table.
... View more
09-06-2016
12:39 PM
2 Kudos
Hi @sanjeevan mahajan and @Constantin Stanca. There is a teeny tiny fix needed to Constantin's last query - the inner query needs to state "Group By emp_no" instead of "Group By s.emp_no". This snippet should work: SELECT e.emp_no, e.birth_date,e.first_name, e.last_name, e.gender, s.min_from_date
FROM employees e,
(SELECT emp_no, min(from_date) as min_from_date
FROM new2_salaries
GROUP BY emp_no) s
WHERE s.emp_no = e.emp_no; By the way - going way back to your original query, this snippet should work as well. Note the changes - using double-quotes around the orc.compress parameters, ensuring that all selected (non-aggregated) columns are in the group by statement, and user the inner join syntax to explicitly call out your join and join clause. Personal opinion: It's cleaner to keep your join clauses and the where clauses separated... instead of having the join clauses stuffed into the where clause. create table employees2_final
stored as ORC
tblproperties ("orc.compress"="SNAPPY")
AS
SELECT e.emp_no, e.birth_date, e.first_name, e.last_name, e.gender,
min(s.from_date) as `hire_date_new`
from new2_salaries s inner join employees e
on s.emp_no = e.emp_no
GROUP BY e.emp_no, e.birth_date, e.first_name, e.last_name, e.gender;
... View more
08-31-2016
01:25 PM
1 Kudo
Without seeing the input file you may need to do a pre-processing step in pig or do a second 'create table as' step in order to reformat the data correctly. It's great to use csv-serde since it does such a good job of stripping out quoted text among other things, but you may need that extra processing in order to use csv-serde effectively (handling NULLs and double-quotes the way you want it to).
... View more
08-31-2016
10:38 AM
1 Kudo
Hi @Simran Kaur. This looks like a very complex set of data with a wildly varying structure. Pipe characters, empty strings (""""), forward slashes, quoted and unquoted text, overloaded string fields delimited internally with pipes, etc. From what you state, It appears that this data came from a source (file or RDBMS) -> then was loaded into HDFS -> then a Hive table structure placed on the file using CSVSerde. Are you using comma as the delimiter for the table? If you could paste in the table DDL it would help. My suspicion is that the data in the source data file is not correctly formatted so it causing the CSVSerde to have a weird value in that column. I suspect that [in the Hive table] the final column in question actually holds this 14 character literal string: NULL ",61.8""" Please validate if that is true via a select statement. Also please confirm that the source file does not have a <CR><LF> character after the NULL and before the double-quote. In either case - the file may require some data cleansing, and it may make sense to use a different delimiter on the source file - perhaps a tilde "~".
... View more
08-28-2016
02:22 PM
1 Kudo
That's the interesting part - it actually wasn't working correctly. It was just hard to see the issue without doing the corresponding from_unixtime call. When you were doing the UNIX_TIMESTAMP call it was using '00' as the month, since the format string was using 'minutes' instead of 'months'. For whatever reason, the UNIX_TIMESTAMP function returns a timestamp value when the format string causes you to pass in invalid data like Month = '00'. In my opinion it should fail instead of returning invalid data like that. Check out this query - the first 2 columns use the correct data format string, while the second 2 columns use the invalid date format string (minutes instead of months): select
from_unixtime(1440201632, 'yyyy-MM-dd HH:mm:ss') as `good_date1`,
from_unixtime(1421884832, 'yyyy-MM-dd HH:mm:ss') as `good_date2`,
from_unixtime(1440201632, 'yyyy-mm-dd HH:mm:ss') as `bad_date1`,
from_unixtime(1421884832, 'yyyy-mm-dd HH:mm:ss') as `bad_date2`
from sample_07 limit 1; and results: good_date1 good_date2 bad_date1 bad_date2
2015-08-22 00:00:32 2015-01-22 00:00:32 2015-00-22 00:00:32 2015-00-22 00:00:32 Notice that the bad dates have zeros in their month field instead of 8 and 1 respectively. Hope this helps.
... View more
08-26-2016
06:00 PM
Hi @Simran Kaur. I do not know of issues where hive will run a stream of queries out-of-order. If you paste in text or run a file it should execute in-order. If you have 10 queries in a row and the 7th one fails - the Hive execution ends immediately. If you are looking to add more flow control, error trapping, etc. to your 15 queries, you may want to look into HPL/SQL. It is an add-on to HDP 2.4, but will be included in the upcoming HDP 2.5 release. It lets you use if/then/else, stored procedures, local variables, etc. in your HQL file. Check it out at http://www.hplsql.org/start Syntax Reference: http://www.hplsql.org/doc
... View more
08-26-2016
05:42 PM
6 Kudos
Hi @Sooraj Antony. Here is the issue - your format string is incorrect. Be careful - the format string is CaSe SenSiTive. 😉 Here is documentation: http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html Here is an example - note the case sensitivity in the format: select
UNIX_TIMESTAMP('2015-08-22 00:00:32', 'yyyy-MM-dd HH:mm:ss'),
from_unixtime(1440201632, 'yyyy-MM-dd HH:mm:ss'),
from_unixtime(1421884832, 'yyyy-MM-dd HH:mm:ss')
from sample_07
limit 1; Results: col1 col2 col3
1440201632 2015-08-22 00:00:32 2015-01-22 00:00:32 I hope this helps.
... View more