Member since
12-09-2015
34
Posts
12
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3349 | 09-26-2016 06:28 PM | |
1618 | 12-11-2015 02:58 PM | |
2267 | 12-11-2015 02:50 PM |
12-19-2016
09:29 PM
WORKING WORKAROUND: Set the options in your JDBC URL. So your URL goes from this... Old: jdbc:hive2://hadoop.company.com:8443/default;transportMode=http;httpPath=gateway/default/hive
New: jdbc:hive2://hadoop.company.com:8443/default;transportMode=http;httpPath=gateway/default/hive?hive.execution.engine=tez;tez.queue.name=di;hive.exec.parallel=true;hive.vectorized.execution.enabled=true;hive.vectorized.execution.reduce.enabled
... View more
12-19-2016
09:26 PM
Hive-Hcatalog 1.2.1
... View more
12-19-2016
09:14 PM
Using a set works fine, correct. However, that requires that I do that for every script I try to automate. My goal is to make these values a default part of this base command, and just change what SQL I throw at it. Global values, if you will. This should be remedied by the use of --hiveconf or --property-file, but neither results in it taking the values. Here's an example: Command: beeline -n $user -p $password -u "$jdbc_url" -f $script --verbose true --property-file "query.properties" --fastConnect
query.properties hive.execution.engine=tez;
tez.queue.name=di;
hive.exec.parallel=true;
The job shows up in the 'default' queue when it should show up in the 'di' queue. When I use a set command, it does in fact show up in the 'di' queue.
... View more
12-19-2016
08:57 PM
1 Kudo
Need to set hive configuration options from Beeline CLI command. The below code, for example, doesn't work. --hiveconf hive.execution.engine=tez --hiveconf and --property-file both don't appear to be making the session take the settings. Here it is in full usage: beeline -n $user -p $password -u "$jdbc_url" -f $script --verbose true --property-file "query.properties" --fastConnect Is the only way to make it work to include the options in the JDBC URL as arguments? Like hive.execution.engine=tez;next;next ?
... View more
Labels:
- Labels:
-
Apache Hive
09-26-2016
06:28 PM
I've resolved the issue by using the --boundary-query option in Sqoop, in tandem with my use of --query and --split-by. Will post answer shortly. Edit: Posted answer at top of my question. Here's the code again for convenience: sqoop import -Ddb2.jcc.charsetDecoderEncoder=3 -Dmapred.job.queue.name=default --connect jdbc:db2://mybox.mycompany.com:1234/mydb2db --username myuser --password mypassword --query "select * from ( SELECT column1, column2, column3, current timestamp as load_time, CHAR(current date, iso) as load_dt FROM my_database.my_table where column1 <= 1000 )a WHERE $CONDITIONS with UR" --hcatalog-database myHiveDB --hcatalog-table myHiveTable --split-by column1 --fields-terminated-by \001 -m 5 --relaxed-isolation --fetch-size=100000 --boundary-query "SELECT MIN(column1), MAX(column1) FROM my_database.my_table where column1 <= 1000 with UR"
... View more
09-26-2016
05:09 PM
I've attempted that, but it still feeds out the same error as above. Am trying now to use JDBC level connection parameters as below, but it doesn't seem to be passing it to the splitted sub queries. Initial connection is with UR, but subsequent subqueries are not. --connection-param-file /just/some/path/sqoop.properties
jdbc.transaction.isolation=TRANSACTION_READ_UNCOMMITTED
... View more
09-26-2016
02:51 PM
EDIT (SOLUTION): Problem: Can't include the "with ur" statement DB2 requires to keep a table from locking. i.e. a "dirty read." Solution: Make use of the --boundary-query option in Sqoop. I modified my Sqoop import statement to leverage the --boundary-query option in Sqoop, which allows me to build the boundary query for determining the splits for the mappers. Typically, since I'm using the --split-by and --query options, this is generated automatically. By adding the new option at the end of my Sqoop statement, and adjusting both it and my standard query to include the desired "with ur", I can achieve a Sqoop job whose multiple mappers fire off non-locking select queries, with ranges of data, with the "with ur" statement. Working code: sqoop import -Ddb2.jcc.charsetDecoderEncoder=3 -Dmapred.job.queue.name=default
--connect jdbc:db2://mybox.mycompany.com:1234/mydb2db
--username myuser
--password mypassword
--query "select * from ( SELECT column1, column2, column3, current timestamp as load_time, CHAR(current date, iso) as load_dt FROM my_database.my_table where column1 <= 1000 )a WHERE $CONDITIONS with UR"
--hcatalog-database myHiveDB
--hcatalog-table myHiveTable
--split-by column1
--fields-terminated-by \001
-m 5
--relaxed-isolation
--fetch-size=100000
--boundary-query "SELECT MIN(column1), MAX(column1) FROM my_database.my_table where column1 <= 1000 with UR" * This Sqoop command is being dynamically generated from a bash script. If you experience problems adopting the above code, try the following: - put the whole sqoop command on one line - use \$CONDITIONS not $CONDITIONS (if this is coming from a text file, not hardcoded, you don't need the escape. If you're typing it directly in the sqoop command, you do - ensure the information following both --query and --boundary-query are in double quotes as above --- ORIGINAL QUESTION: I have this Sqoop command that nearly works... up until I get a syntax exception from DB2: Sqoop Command: sqoop import -Ddb2.jcc.charsetDecoderEncoder=3 -Dmapred.job.queue.name=default --connect jdbc:db2://mybox.mycompany.com:1234/mydb2db --username myuser --password mypassword --query SELECT column1, column2, column3, current timestamp as load_time, CHAR(current date, iso) as load_dt FROM my_database.my_table where column1 <= 1000 AND $CONDITIONS with UR --hcatalog-database myHiveDB --hcatalog-table myHiveTable --split-by column1 --fields-terminated-by \001 -m 5 --relaxed-isolation --fetch-size=100000 It works great up to this point. Compares the target data with the hcat/hive table, all checks out. Then it tries to do the splits with the bounding values, and flips. Error (and log): 16/09/26 10:32:12 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(column1), MAX(column1) FROM (SELECT
column1,
column2,
column3,
current timestamp as load_time,
CHAR(current date, iso) as load_dt
FROM my_database.my_table
where column1 <= 1000 AND (1 = 1) with UR) AS t1
Caused by: com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-199, SQLSTATE=42601, SQLERRMC=WITH;AND OR HAVING GROUP INTERSECT ORDER ) FETCH EXCEPT MINUS UNION, DRIVER=4.16.53 I believe it has to do with DB2 needing the "with UR" phrase last, but Sqoop always tacks on the AS t1, causing it to break. Any suggestions to fixing this? We need to do non-locking reads on a DB2 table.
... View more
Labels:
- Labels:
-
Apache HCatalog
-
Apache Hive
-
Apache Sqoop
01-28-2016
07:00 PM
2 Kudos
What's the horton-recommended way to get Hive data in Java code?
Thinking: hcat api jdbc query HDFS call We're just trying to get Hive data into a data structure in java, like a Collection, and operate on it. Typical Map/Reduce stuff. Specifically this would be used in a Crunch/Cascading setup. Thanks!
... View more
Labels:
- Labels:
-
Apache HCatalog
-
Apache Hive
01-19-2016
03:34 PM
Thanks! I gave that a try after posting this, and have updated my question with my findings accordingly. Thanks for answering!
... View more
01-18-2016
03:22 PM
We have Falcon jobs that use oozie workflows (workflow.xml) in HDFS. However, we've made some changes to the Oozie workflow, specifically argument values for actions, but don't see them reflected in Falcon. I notice that the workflow.xml is present in the /apps/falcon/clusterName/staging area for the Falcon process, but it's the older version. How can I get Falcon to refresh/rebuild this area and incorporate the new workflow it points to in HDFS? Will Falcon -update or -repair do it? Edit: Resolved! Falcon -update does the trick. falcon entity -type process -suspend -name hdp0106x-my-process
falcon entity -type process -update -name hdp0106x-my-process That provided this output: falcon/update/default/Updated successfully(process) hdp0106x-my-process/Effective Time: 2016-01-20T09:30Z. Old workflow id: 0001498-151208005457707-oozie-oozi-C. New workflow id: 0007639-151208005457707-oozie-oozi-B Explanation: In a recent project, we experienced a minor issue with a Falcon job not adapting future instances to changes we made to its Oozie workflow. In short, we made an enhancement, but Falcon didn’t recognize the change simply by modifying the Oozie workflow. Cause: When you submit & schedule a set of entities to Falcon, it
creates a home for it in /apps/falcon/clusterName/staging/falcon/workflows/process/ places the workflow.xml you specified in the process.xml in that home (including a lib folder if you have one) Uses that workflow.xml for future generated instances The only issue is when you want to make changes to that Oozie workflow — Falcon doesn’t re-ingest this workflow to its home at any frequency. It only does this at submit/schedule time. So if you make changes to an Oozie workflow, you need to tell Falcon that it has changed (because it technically runs copies of the workflow and /lib folder you give it, not the HDFS versions themselves). However... the above solution (using falcon update) does the trick!
... View more
Labels:
- Labels:
-
Apache Falcon