Member since
09-25-2015
112
Posts
88
Kudos Received
12
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3936 | 03-15-2017 03:17 PM | |
2545 | 02-17-2017 01:03 PM | |
271 | 01-02-2017 10:47 PM | |
284 | 11-16-2016 07:03 PM | |
212 | 10-07-2016 05:24 PM |
01-04-2017
07:23 PM
Hi @Bala Vignesh N V. Lester Martin has an excellent Pig script to do this type of work. It is not an external table solution but a good way to do this type of work... https://martin.atlassian.net/wiki/pages/viewpage.action?pageId=21299205
... View more
01-04-2017
03:02 PM
Hi @Kiran Kumar. You make very good points. Virtual/Cloud environments make Points #1 and #2 easy to implement. I think what is important is that organizations plan out & address all of the pain points that a single environment entails. To put it even simpler - make a conscious decision rather than just drive blindly. Understand the pros and cons, and make a wise decision involving all teams/stakeholders. Process is also key. Having a change management culture is crucial. If you yourself were to create a single environment as you describe in your answer, it would accomplish many if not all of the separation/requirements that multiple environments provide. Specifically your Point #4 answer - even on the same cluster you would be effectively creating multiple environments.
... View more
01-03-2017
12:03 PM
1 Kudo
Hi @Kaliyug Antagonist. This is a bit of a philosophical question. The issue above occurs because the IN statement appears to do an implicit conversion of 0552094 to a numeric datatype and the IN statement does not find the row for 0552094. This implicit conversion is not what you want the IN statement to do. By explicitly quoting the numeric value, we do not allow the IN statement to do an Implicit conversion. IMHO I would recommend that you *never* allow implicit conversions to take place - whether in the RDBMS world (SQL Server, Oracle) or in Hive. By always quoting string/date values (OR using CAST function to ensure the correct datatype) you will get the correct/optimal results and you will never be affected by an implicit conversion. In the RDBMS world there are good of discussions about avoiding implicit conversions. RDBMSs do a much more thorough job of handling conversions, but even they are far from perfect when doing implicit conversions. An example of this - please see the Green/Yellow/Red chart of allowable conversions in this blog post by Jonathan Keyahias... https://www.sqlskills.com/blogs/jonathan/implicit-conversions-that-cause-index-scans/
... View more
01-02-2017
10:47 PM
5 Kudos
Hi @Kaliyug Antagonist. I couldn't help but notice that the value in question had a leading zero. Check the datatype used for the column called equipmentNumber and adjust the IN clause accordingly. It looks like equipmentNumber is defined as a String and thus you will have to quote the values inside the "IN" clause. Here are some examples & results: 1. Using INT as the datatype for equipmentNumber: create table if not exists fact_rtc_se
(equipmentnumber int, dimensionid string, datemessage date) ;
insert into fact_rtc_se values (0552094, 36081, '2016-02-29') ;
insert into fact_rtc_se values (0552094, 18943, '2016-02-29') ;
insert into fact_rtc_se values (1857547, 27956, '2016-01-08') ;
insert into fact_rtc_se values (1857547, 749597, '2016-01-15') ;
select equipmentnumber, dimensionid, datemessage from fact_rtc_se
where equipmentnumber in (0552094,1857547) and datemessage < '2016-03-01';
552094 36081 2016-02-29
552094 18943 2016-02-29
1857547 27956 2016-01-08
1857547 749597 2016-01-15
select equipmentnumber, dimensionid, datemessage from fact_rtc_se
where equipmentnumber in ('0552094','1857547') and datemessage < '2016-03-01';
1857547 27956 2016-01-08
1857547 749597 2016-01-15
2. Using STRING as the datatype for equipmentNumber: create table if not exists fact_rtc_se
(equipmentnumber string, dimensionid string, datemessage date) ;
insert into fact_rtc_se values ('0552094', 36081, '2016-02-29') ;
insert into fact_rtc_se values ('0552094', 18943, '2016-02-29') ;
insert into fact_rtc_se values ('1857547', 27956, '2016-01-08') ;
insert into fact_rtc_se values ('1857547', 749597, '2016-01-15') ;
select equipmentnumber, dimensionid, datemessage from fact_rtc_se
where equipmentnumber in (0552094,1857547) and datemessage < '2016-03-01';
1857547 27956 2016-01-08
1857547 749597 2016-01-15
select equipmentnumber, dimensionid, datemessage from fact_rtc_se
where equipmentnumber in ('0552094','1857547') and datemessage < '2016-03-01';
0552094 36081 2016-02-29
0552094 18943 2016-02-29
1857547 27956 2016-01-08
1857547 749597 2016-01-15
I hope this helps.
... View more
12-23-2016
09:49 AM
Hi @Rama. I believe that Qlik caches data. You will want to check that the Qlik cache is cleared - or that your Qlik results are refreshed/rebuilt. I'm not sure of the specific way to do that... but ensure that your Qlik data is "fresh."
... View more
12-19-2016
12:22 AM
Thanks for the clarification @Vijayandra Mahadik. You are correct... It was implemented for Hive 0.14. My comments were out of date. I will correct my original statement above. https://issues.apache.org/jira/browse/HIVE-5760
... View more
12-09-2016
07:50 PM
Hi @Ryan Templeton. Some suggestions (all loosely described). Don't want to seem trivial with some of these, but other people may end up with the same issue in the future. These may all fail with the same error but may be worth a try. The examples below assume that the data is partitioned by YYYY/MM/DD. This makes it ewasier to write a sample query. Also assuming that all the input TSV files have a similar number of lines - which may make it easier to find the outlier file(s).
1. Do rowcounts work on the table? (not even trying the "insert into" code)
I assume you've tried to do a rowcount selecting by the partition clause and you get the same error?
If it does work, is there one YYYY/MM/DD result that has zero (or far fewer) rows? select YR, MNTH, DY, count(*)
from <tbl>
group by YR, MNTH, DY; 2. If #1 doesn't work, Have you tried running the same query & limiting by a specific year?
If a year works, them those files for that YYYY are golden.
If a year fails then go down to YYYY/MM and YYYY/MM/DD and successively eliminate the good partitions/files until you get to the bad file.
(this all depends on partition elimination working for the query). select YR count(*)
from <tbl>
group by YR;
Where YR = '2012'; -- then try13, 14, 15, 16 3. You could try running #1 query using virtual column INPUT__FILE__NAME. select INPUT__FILE__NAME, count(*)
from <tbl>
group by INPUT__FILE__NAME; 4. Write a pig script that counts the number of newlines in the files.
The invalid file might be the one returning zero rows. I hope this helps.
... View more
11-20-2016
04:42 AM
Hi @Sagar Shimpi. I am attaching a screenshot, but the error does not appear in the screenshot. It is in the syslog file for the pig script job. Ravi Gangappa was looking into it this afternoon but we were not able to find the problem yet. screen-shot-2016-11-19-at-103336-pm.png I will send you via email the text of the HipChat we had.
... View more
11-18-2016
11:53 PM
1 Kudo
Installed Ambari Pig View, but am not able to run Jobs via the View. The Pig view was installed on Azure IaaS, Ambari 2.4.1 and the Pig view is a 'Local Cluster' configuration. It errors out without any stdout or stderr information displaying on the Pig view screen (stdout and stderr are available in the job logs). In the job logs Found the following message: 2016-11-18 14:11:05,464 FATAL [IPC Server handler 2 on 37186]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
attempt_1479494676475_0003_m_000000_0 - exited : java.io.IOException: Cannot
run program "pig.tar.gz/pig/bin/pig": error=13, Permission denied at
java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at
org.apache.hive.hcatalog.templeton.tool.TrivialExecService.run(TrivialExecService.java:66) at
org.apache.hive.hcatalog.templeton.tool.LaunchMapper.startJob(LaunchMapper.java:177) at
org.apache.hive.hcatalog.templeton.tool.LaunchMapper.run(LaunchMapper.java:410) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at
java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at
org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)Caused by: java.io.IOException: error=13, Permission denied at
java.lang.UNIXProcess.forkAndExec(Native Method) at
java.lang.UNIXProcess.<init>(UNIXProcess.java:247) at
java.lang.ProcessImpl.start(ProcessImpl.java:134) at
java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 10 more Have not had this error before on previous installs on different clusters. Background Work Performed for Ambari view setup: Pig jobs run fine from the grunt shell and from `pig –f`. No problems there. Ensured all Proxyuser/group setup and user hdfs directory work was
done according to the instructions on Hortonworks Docs. http://docs.hortonworks.com/HDPDocuments/Ambari-2.4.1.0/bk_ambari-views/content/ch_using_pig_view.html Confirmed WebHCat server is running. Additional work performed so far (kindly assisted by Ravi Gangappa): Double-checked the proxyusers in Ambari hdfs and hive configs Confirmed that this should not be a port issue (opened up all ports in Azure) Gave wide-open permissions to HDFS /hdp/apps/2.5.0.0-1245/pig/pig.tar.gz Thanks in advance for your assistance.
... View more
Labels:
11-18-2016
12:11 PM
Yikes! Have never seen that happen before, but I certainly have no reason to doubt you. Does it happen with hive.execution.engine=tez? If you could grab & sanitize your query/config details & post that as a Hive bug in Jira it would be greatly appreciated... we don't want that problem to bite anyone else.
... View more
11-16-2016
10:42 PM
Edited the post to fix syntax error. Now it runs just fine. 😉
... View more
11-16-2016
07:03 PM
Hi @Zack Riesland. Here's how I normally do this. It's not specifically a subquery but accomplishes what you're looking for. insert into table daily_counts
select count(*), 'table_a' from table_a
UNION
select count(*), 'table_b' from table_b
UNION
select count(*), 'table_c' from table_c
...
;
... View more
11-08-2016
09:56 PM
Hi @vamsi valiveti. The example above uses the exact same source file in the exact same location for both external tables. Both test_csv_serde_using_CSV_Serde_reader and test_csv_serde tables read an external file(s) stored in the directory called '/user/<uname>/elt/test_csvserde/'. The file I used was pipe delimited and contains 62,000,000 rows - so I didn't attach it . 😉 It would look like Option 2 above, but of course with 4 columns: 121|Hello World|4567|34345
232|Text|5678|78678
343|More Text|6789|342134
... View more
11-03-2016
09:52 PM
You are correct. The answer is 'Yes' to both of your questions. You can load data into the tables by adding a file to the directory (as log as the file has the correct schema), by using LOAD DATA INPATH, or by using the standard INSERT INTO <tablename> select * FROM <othertablename> syntax. Just like any other Hive table.
... View more
11-02-2016
07:09 PM
In that example if you do not specify a location for tables t2 and t4 - those 2 external tables would be created in /apps/hive/warehouse and then the next folder would be the database you were using (perhaps default) and then the table name. So they would be found in: /apps/hive/warehouse/default.db/t2
/apps/hive/warehouse/default/db/t4/b=1
/apps/hive/warehouse/default/db/t4/b=2 ... (where b is the partition column)
... View more
10-31-2016
10:27 AM
Jean-Philippe is correct - you can place Internal and External tables to any location you wish to. But IMHO it is very wise to maintain the default convention - Keep your internal (managed) tables in the /apps/hive/warehouse location, and your external tables away from the /apps/hive/warehouse location. You do not want to have to guess if the underlying tables will or will not be deleted when you drop a table. By keeping the Internal and External locations separated... you will know.
... View more
10-24-2016
09:17 PM
Great job @Alena Melnikova! Nice work with the data and visualization. Really helpful, confirms some longstanding assumptions I've had.
... View more
10-19-2016
01:31 PM
Hi @Laurent Edel - Nice Article! I do have a question - are there performance issues when using this method (HCatalog integration) to go from Sqoop directly to an ORC format? In other words, is Option A: Sqoop -> directly to ORC format table via HCatalog integration equivalent performance (or better performance) when compared to Option B? Sqoop -> text files/external Hive table -> Hive CTAS/Insert into ORC format table Would like to ensure the best possible Sqoop performance. Thanks!
... View more
10-07-2016
05:24 PM
3 Kudos
Hi @Sunile. I suspect this is not a firm number but a bang-for-the-buck recommendation - i.e. you will get the most substantial relative performance improvements with dataset sizes that can remain in-cache.
In Carter and Nita's recent blog post, they go into testing 10TB TPC-DS datasets - much larger than the aggregate cluster LLAP cache size. They wanted to see (direct quote) "if LLAP can truly tackle the big data challenge or if it’s limited to reporting roles on smaller datasets."
What they found was that (another quote) "If we look at the full results, we see substantial performance benefits across the board, with an average speedup of around 7x. The biggest winners are smaller-scale ad-hoc or star-schema-join type queries, but even extremely complex queries realized almost 2x benefit versus Hive 1."
... View more
09-24-2016
12:12 PM
1 Kudo
Hi @Vivek S. The issue you are having with your query is that Hive does not support non-equi-joins. So to accomplish your query you need to move the non-equi-join clause out of the INNER JOIN section, and do it as part of the Where clause instead. So here is the correct query: SELECT c.original_item_id, c.bought_with_item_id, count(*) as times_bought_together
FROM
(SELECT a.item_id as original_item_id, b.item_id as bought_with_item_id
FROM items a
INNER join items b
ON a.transaction_id = b.transaction_id
WHERE a.item_id in('B','C')
AND a.item_id != b.item_id
) c
GROUP BY c.original_item_id, c.bought_with_item_id; which gives the expected results: +---------------------+------------------------+------------------------+--+
| c.original_item_id | c.bought_with_item_id | times_bought_together |
+---------------------+------------------------+------------------------+--+
| B | A | 2 |
| B | C | 2 |
| C | A | 1 |
| C | B | 2 |
+---------------------+------------------------+------------------------+--+
4 rows selected (1.48 seconds) Hope this helps.
... View more
09-09-2016
11:34 AM
6 Kudos
Hi @Tim David. This is as much a philosophical argument as it is a logical one. I know I won't be the only one to drop an answer, but I'll get it started. 😉 Also, the answer is not meant to be preachy. There will be many people reading it, so this answer is meant for an audience with more of a beginner IT/Hadoop background. 1. Upgrades. At a minimum, you'll need a Dev environment to test upgrades to your cluster. This could even be in the cloud. You'll confirm that the upgrade works, that it does not cause problems with existing code, and can confirm the steps/scripts required to perform the upgrade successfully in production. 2. Tuning. Yarn, Hive & Tez in particular have a baffling array of "levers and switches" that you can change to affect performance. Container sizes, pre-warming, statistics, file types (ORC). Changing those settings in production can improve your processing but hurt others. Dev/QA environments let you play with local & global settings prior to rolling them out in production. 3. Isolation of data/Security. Having one environment means that people are testing against production data... or co-mingling dev data with production data. There are often datasets that contain data the development team should never see. Think PII (personally identifiable info) or PHI data (personal health info). Having a dev environment means that masked/minimized data is in the dev cluster where developers can test away to their hearts content without the security team worrying about failing a security audit (which can have terrible regulatory/financial consequences). Also it is easier to pass an audit when you KNOW that the developers have zero access to production. If you only have one environment, you have all kinds of people granted access to that one production server, and you are proving that person X can only see data in section Y. 4. Human Nature - "We judge others by their actions, but judge ourselves by our intentions." Change Management often feels like an unnecessary roadblock or speed bump, but it's there for a reason. Small changes in one place can cause a ripple affect in others, that cannot be seen until after unit tests are complete and system tests take place. Having dev/QA/prod environments in-place and a change management process in effect greatly minimizes the affects of these unintended problems.
... View more
09-07-2016
02:12 PM
Hi @Sundar Lakshmanan. The input file must use a comma as the field delimiter. If you can change the format of the input file to use a more non-standard character such as a pipe or a tilde, that would fix the issue. With the lazy simple serde you can specifiy a different field delimiter as another serdeproperty: 'field.delim' = '|' Otherwise you could use a pig script to pre-process the data to remove comma from that field, or use some hive post-processing to collapse the 2 fields down into one in a new table.
... View more
09-06-2016
06:53 PM
Also note that support for a JOIN clause in the update statement is not available. See this HCC article for more information and a link to the Hive JIRA requesting support for a MERGE statement. https://community.hortonworks.com/questions/11652/how-to-update-hive-row-with-join.html
... View more
09-06-2016
12:39 PM
2 Kudos
Hi @sanjeevan mahajan and @Constantin Stanca. There is a teeny tiny fix needed to Constantin's last query - the inner query needs to state "Group By emp_no" instead of "Group By s.emp_no". This snippet should work: SELECT e.emp_no, e.birth_date,e.first_name, e.last_name, e.gender, s.min_from_date
FROM employees e,
(SELECT emp_no, min(from_date) as min_from_date
FROM new2_salaries
GROUP BY emp_no) s
WHERE s.emp_no = e.emp_no; By the way - going way back to your original query, this snippet should work as well. Note the changes - using double-quotes around the orc.compress parameters, ensuring that all selected (non-aggregated) columns are in the group by statement, and user the inner join syntax to explicitly call out your join and join clause. Personal opinion: It's cleaner to keep your join clauses and the where clauses separated... instead of having the join clauses stuffed into the where clause. create table employees2_final
stored as ORC
tblproperties ("orc.compress"="SNAPPY")
AS
SELECT e.emp_no, e.birth_date, e.first_name, e.last_name, e.gender,
min(s.from_date) as `hire_date_new`
from new2_salaries s inner join employees e
on s.emp_no = e.emp_no
GROUP BY e.emp_no, e.birth_date, e.first_name, e.last_name, e.gender;
... View more
09-02-2016
06:52 PM
That's really strange. I'm just trying to think of other things that could be causing the issue:
Do you have the userID and password in your providerstring (@provstr) in the spAddLinkedServer call in SQL Server? Are there permissions on the linked server object or on the SQL Server database that need to be set? Could you try selecting from a different table in a different database on Hive? Are there any messages showing up in the Error Logs in Hive or SQL Server? Could there be a version issue with the ODBC driver and the HDP version? DO you have the correct 32 bit or 64 bit version of the ODBC driver? I don't have any other ideas beyond that. Linked server queries are tough to debug since there are so many places that they can fail. Cross-system linked server calls like this are even tougher. I hope this helps.
... View more
09-01-2016
10:50 AM
This post as well is good. Refer to the section titled "SQL Server Linked to a Hadoop Server - Setup" http://www.codeproject.com/Articles/1067605/Hadoop-For-Net-AngularJS-Developers
... View more
09-01-2016
10:44 AM
Hello @Abhijeet Barbate. In your ODBC driver I assume your connection to HiveServer2 is using a username and password. Have you confirmed that the same username/password can connect to Hive via beeline and can execute the select statement & return results? This blog post has been useful for our project in the past - it's a bit old but is well-written. It does a good job of documenting the linked server work and permissions required to get the linked server calls to work... https://blog.oraylis.de/2015/02/querying-hadoop-from-sql-server/
... View more
08-31-2016
01:25 PM
1 Kudo
Without seeing the input file you may need to do a pre-processing step in pig or do a second 'create table as' step in order to reformat the data correctly. It's great to use csv-serde since it does such a good job of stripping out quoted text among other things, but you may need that extra processing in order to use csv-serde effectively (handling NULLs and double-quotes the way you want it to).
... View more
08-31-2016
10:38 AM
1 Kudo
Hi @Simran Kaur. This looks like a very complex set of data with a wildly varying structure. Pipe characters, empty strings (""""), forward slashes, quoted and unquoted text, overloaded string fields delimited internally with pipes, etc. From what you state, It appears that this data came from a source (file or RDBMS) -> then was loaded into HDFS -> then a Hive table structure placed on the file using CSVSerde. Are you using comma as the delimiter for the table? If you could paste in the table DDL it would help. My suspicion is that the data in the source data file is not correctly formatted so it causing the CSVSerde to have a weird value in that column. I suspect that [in the Hive table] the final column in question actually holds this 14 character literal string: NULL ",61.8""" Please validate if that is true via a select statement. Also please confirm that the source file does not have a <CR><LF> character after the NULL and before the double-quote. In either case - the file may require some data cleansing, and it may make sense to use a different delimiter on the source file - perhaps a tilde "~".
... View more
08-28-2016
02:22 PM
1 Kudo
That's the interesting part - it actually wasn't working correctly. It was just hard to see the issue without doing the corresponding from_unixtime call. When you were doing the UNIX_TIMESTAMP call it was using '00' as the month, since the format string was using 'minutes' instead of 'months'. For whatever reason, the UNIX_TIMESTAMP function returns a timestamp value when the format string causes you to pass in invalid data like Month = '00'. In my opinion it should fail instead of returning invalid data like that. Check out this query - the first 2 columns use the correct data format string, while the second 2 columns use the invalid date format string (minutes instead of months): select
from_unixtime(1440201632, 'yyyy-MM-dd HH:mm:ss') as `good_date1`,
from_unixtime(1421884832, 'yyyy-MM-dd HH:mm:ss') as `good_date2`,
from_unixtime(1440201632, 'yyyy-mm-dd HH:mm:ss') as `bad_date1`,
from_unixtime(1421884832, 'yyyy-mm-dd HH:mm:ss') as `bad_date2`
from sample_07 limit 1; and results: good_date1 good_date2 bad_date1 bad_date2
2015-08-22 00:00:32 2015-01-22 00:00:32 2015-00-22 00:00:32 2015-00-22 00:00:32 Notice that the bad dates have zeros in their month field instead of 8 and 1 respectively. Hope this helps.
... View more