Member since
12-09-2015
34
Posts
12
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1945 | 09-26-2016 06:28 PM | |
753 | 12-11-2015 02:58 PM | |
1300 | 12-11-2015 02:50 PM |
12-19-2016
09:29 PM
WORKING WORKAROUND: Set the options in your JDBC URL. So your URL goes from this... Old: jdbc:hive2://hadoop.company.com:8443/default;transportMode=http;httpPath=gateway/default/hive
New: jdbc:hive2://hadoop.company.com:8443/default;transportMode=http;httpPath=gateway/default/hive?hive.execution.engine=tez;tez.queue.name=di;hive.exec.parallel=true;hive.vectorized.execution.enabled=true;hive.vectorized.execution.reduce.enabled
... View more
12-19-2016
09:26 PM
Hive-Hcatalog 1.2.1
... View more
12-19-2016
09:14 PM
Using a set works fine, correct. However, that requires that I do that for every script I try to automate. My goal is to make these values a default part of this base command, and just change what SQL I throw at it. Global values, if you will. This should be remedied by the use of --hiveconf or --property-file, but neither results in it taking the values. Here's an example: Command: beeline -n $user -p $password -u "$jdbc_url" -f $script --verbose true --property-file "query.properties" --fastConnect
query.properties hive.execution.engine=tez;
tez.queue.name=di;
hive.exec.parallel=true;
The job shows up in the 'default' queue when it should show up in the 'di' queue. When I use a set command, it does in fact show up in the 'di' queue.
... View more
12-19-2016
08:57 PM
1 Kudo
Need to set hive configuration options from Beeline CLI command. The below code, for example, doesn't work. --hiveconf hive.execution.engine=tez --hiveconf and --property-file both don't appear to be making the session take the settings. Here it is in full usage: beeline -n $user -p $password -u "$jdbc_url" -f $script --verbose true --property-file "query.properties" --fastConnect Is the only way to make it work to include the options in the JDBC URL as arguments? Like hive.execution.engine=tez;next;next ?
... View more
Labels:
- Labels:
-
Apache Hive
09-26-2016
06:28 PM
I've resolved the issue by using the --boundary-query option in Sqoop, in tandem with my use of --query and --split-by. Will post answer shortly. Edit: Posted answer at top of my question. Here's the code again for convenience: sqoop import -Ddb2.jcc.charsetDecoderEncoder=3 -Dmapred.job.queue.name=default --connect jdbc:db2://mybox.mycompany.com:1234/mydb2db --username myuser --password mypassword --query "select * from ( SELECT column1, column2, column3, current timestamp as load_time, CHAR(current date, iso) as load_dt FROM my_database.my_table where column1 <= 1000 )a WHERE $CONDITIONS with UR" --hcatalog-database myHiveDB --hcatalog-table myHiveTable --split-by column1 --fields-terminated-by \001 -m 5 --relaxed-isolation --fetch-size=100000 --boundary-query "SELECT MIN(column1), MAX(column1) FROM my_database.my_table where column1 <= 1000 with UR"
... View more
09-26-2016
05:09 PM
I've attempted that, but it still feeds out the same error as above. Am trying now to use JDBC level connection parameters as below, but it doesn't seem to be passing it to the splitted sub queries. Initial connection is with UR, but subsequent subqueries are not. --connection-param-file /just/some/path/sqoop.properties
jdbc.transaction.isolation=TRANSACTION_READ_UNCOMMITTED
... View more
09-26-2016
02:51 PM
EDIT (SOLUTION): Problem: Can't include the "with ur" statement DB2 requires to keep a table from locking. i.e. a "dirty read." Solution: Make use of the --boundary-query option in Sqoop. I modified my Sqoop import statement to leverage the --boundary-query option in Sqoop, which allows me to build the boundary query for determining the splits for the mappers. Typically, since I'm using the --split-by and --query options, this is generated automatically. By adding the new option at the end of my Sqoop statement, and adjusting both it and my standard query to include the desired "with ur", I can achieve a Sqoop job whose multiple mappers fire off non-locking select queries, with ranges of data, with the "with ur" statement. Working code: sqoop import -Ddb2.jcc.charsetDecoderEncoder=3 -Dmapred.job.queue.name=default
--connect jdbc:db2://mybox.mycompany.com:1234/mydb2db
--username myuser
--password mypassword
--query "select * from ( SELECT column1, column2, column3, current timestamp as load_time, CHAR(current date, iso) as load_dt FROM my_database.my_table where column1 <= 1000 )a WHERE $CONDITIONS with UR"
--hcatalog-database myHiveDB
--hcatalog-table myHiveTable
--split-by column1
--fields-terminated-by \001
-m 5
--relaxed-isolation
--fetch-size=100000
--boundary-query "SELECT MIN(column1), MAX(column1) FROM my_database.my_table where column1 <= 1000 with UR" * This Sqoop command is being dynamically generated from a bash script. If you experience problems adopting the above code, try the following: - put the whole sqoop command on one line - use \$CONDITIONS not $CONDITIONS (if this is coming from a text file, not hardcoded, you don't need the escape. If you're typing it directly in the sqoop command, you do - ensure the information following both --query and --boundary-query are in double quotes as above --- ORIGINAL QUESTION: I have this Sqoop command that nearly works... up until I get a syntax exception from DB2: Sqoop Command: sqoop import -Ddb2.jcc.charsetDecoderEncoder=3 -Dmapred.job.queue.name=default --connect jdbc:db2://mybox.mycompany.com:1234/mydb2db --username myuser --password mypassword --query SELECT column1, column2, column3, current timestamp as load_time, CHAR(current date, iso) as load_dt FROM my_database.my_table where column1 <= 1000 AND $CONDITIONS with UR --hcatalog-database myHiveDB --hcatalog-table myHiveTable --split-by column1 --fields-terminated-by \001 -m 5 --relaxed-isolation --fetch-size=100000 It works great up to this point. Compares the target data with the hcat/hive table, all checks out. Then it tries to do the splits with the bounding values, and flips. Error (and log): 16/09/26 10:32:12 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(column1), MAX(column1) FROM (SELECT
column1,
column2,
column3,
current timestamp as load_time,
CHAR(current date, iso) as load_dt
FROM my_database.my_table
where column1 <= 1000 AND (1 = 1) with UR) AS t1
Caused by: com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-199, SQLSTATE=42601, SQLERRMC=WITH;AND OR HAVING GROUP INTERSECT ORDER ) FETCH EXCEPT MINUS UNION, DRIVER=4.16.53 I believe it has to do with DB2 needing the "with UR" phrase last, but Sqoop always tacks on the AS t1, causing it to break. Any suggestions to fixing this? We need to do non-locking reads on a DB2 table.
... View more
Labels:
- Labels:
-
Apache HCatalog
-
Apache Hive
-
Apache Sqoop
01-28-2016
07:00 PM
2 Kudos
What's the horton-recommended way to get Hive data in Java code?
Thinking: hcat api jdbc query HDFS call We're just trying to get Hive data into a data structure in java, like a Collection, and operate on it. Typical Map/Reduce stuff. Specifically this would be used in a Crunch/Cascading setup. Thanks!
... View more
Labels:
- Labels:
-
Apache HCatalog
-
Apache Hive
01-19-2016
03:34 PM
Thanks! I gave that a try after posting this, and have updated my question with my findings accordingly. Thanks for answering!
... View more
01-18-2016
03:22 PM
We have Falcon jobs that use oozie workflows (workflow.xml) in HDFS. However, we've made some changes to the Oozie workflow, specifically argument values for actions, but don't see them reflected in Falcon. I notice that the workflow.xml is present in the /apps/falcon/clusterName/staging area for the Falcon process, but it's the older version. How can I get Falcon to refresh/rebuild this area and incorporate the new workflow it points to in HDFS? Will Falcon -update or -repair do it? Edit: Resolved! Falcon -update does the trick. falcon entity -type process -suspend -name hdp0106x-my-process
falcon entity -type process -update -name hdp0106x-my-process That provided this output: falcon/update/default/Updated successfully(process) hdp0106x-my-process/Effective Time: 2016-01-20T09:30Z. Old workflow id: 0001498-151208005457707-oozie-oozi-C. New workflow id: 0007639-151208005457707-oozie-oozi-B Explanation: In a recent project, we experienced a minor issue with a Falcon job not adapting future instances to changes we made to its Oozie workflow. In short, we made an enhancement, but Falcon didn’t recognize the change simply by modifying the Oozie workflow. Cause: When you submit & schedule a set of entities to Falcon, it
creates a home for it in /apps/falcon/clusterName/staging/falcon/workflows/process/ places the workflow.xml you specified in the process.xml in that home (including a lib folder if you have one) Uses that workflow.xml for future generated instances The only issue is when you want to make changes to that Oozie workflow — Falcon doesn’t re-ingest this workflow to its home at any frequency. It only does this at submit/schedule time. So if you make changes to an Oozie workflow, you need to tell Falcon that it has changed (because it technically runs copies of the workflow and /lib folder you give it, not the HDFS versions themselves). However... the above solution (using falcon update) does the trick!
... View more
Labels:
- Labels:
-
Apache Falcon
12-22-2015
08:32 PM
Hey Balu, that worked. Perfectly, actually! The only thing left I was curious about... is there a Falcon-based way to remove the _SUCCESS file after replication has completed? I know how we can do it after some time or after creating it in the process (and waiting a few minutes perhaps), but if there's a Falcon method or tag I'd love to leverage that. Please let me know! And thanks for all the great help.
... View more
12-22-2015
07:58 PM
Balu, thanks for that answer! We knew of this tag, but weren't sure how to truly use it. I'm testing this now and will let you know asap! Thank you kindly.
... View more
12-21-2015
08:08 PM
Scenario: we have data that is ingested to the cluster via a Falcon process. It leverages a Falcon feed as an output to replicate the ingested data to a backup cluster. We'd like the feed to not replicate until the process has completed. We currently use a delay to semi-accomplish this, but it's not perfect. Question: How can we (if at all) tell Falcon to wait until the process has completed to begin replication? Currently we have process.xml code as below, which tells the Feed to start "now" but with a delay of 2 hours (this delay being specified in the feed.xml). To be clear, we have replication working and all that just fine -- we're just after a more elegant way for Falcon to only replicate after a process has confirmed completed. Is there a way? Process <outputs>
<output name="hdp0001-my-feed" feed="hdp0001-my-feed" instance="now(0,0)"/>
</outputs> Feed <cluster name="primary-cluster" type="source" delay="hours(1)">
<validity start="2015-12-04T09:30Z" end="2099-12-31T23:59Z"/>
<retention limit="months(9999)" action="archive"/>
<locations>
<location type="data" path="/hdfs/data/path/to/my_table/"/>
</locations>
</cluster>
... View more
Labels:
- Labels:
-
Apache Falcon
-
Apache Hadoop
12-11-2015
02:58 PM
1 Kudo
We have solved this problem with our environment in a number of different ways. With Falcon, we have Sqoop imports on a regular frequency that run during determined "not busy" time intervals. We've also done some logic in bash scripts run through Oozie that will execute, but determine "not safe" to run at the moment, and will sleep or terminate for that instance, and try again later. If you have a window, say 3am-5am, in which you could feasibly connect and pull data, you could set up a sleep/wait loop until either a specific exact time has passed or the system is available. Plenty of options, definitely feasible what you mentioned!
... View more
12-11-2015
02:50 PM
1 Kudo
Have you looked into CompressedStorage features on Hive? You should be able to use this (for Snappy at least): SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
... View more
12-09-2015
06:29 PM
I have some notes but nothing formal cobbled together. Most stuff goes on hadoopsters.net when I learn about it. I'm hoping to publish some more on Falcon here very soon. The article feature makes it easy.
... View more
12-09-2015
06:19 PM
I've tried HDP on a Pi. Works pretty good. Will add Nifi to my list of things to POC...
... View more
12-09-2015
06:11 PM
Yeah, you should be able to pass any properties to the workflow.xml from Falcon's process.xml. Like this: This would go before the <workflow> tag and after the <output> tags. They're used in the workflow.xml as ${workflowName}, ${hiveDB}, ${queueName}, and so on. <properties>
<property name="workflowName" value="1234-my-workflow" />
<property name="rawDataDirectoryHDFS" value="/path/to/hdfs/files/" />
<property name="hiveDB" value="my_db" />
<property name="jobTracker" value="hdpcluster003.company.com:8050"/>
<property name="nameNode" value="hdfs://MYHA:8020"/>
<property name="queueName" value="dev"/>
</properties>
... View more
12-09-2015
06:00 PM
1 Kudo
SQuirreL is quite good, and free. Built in Java, and works well with Hive/Beeline. We've overall had success with: Teradata Studio Squirrel Aquadata Studio http://squirrel-sql.sourceforge.net/
... View more
12-09-2015
03:50 PM
You can point to it directly via its address, or you can do as @bvellanki (balu) mentioned, and list its HA. For example, if your HA for your backup cluster is called DRHA, your address would be hdfs://DRHA:8020. See below: <interface type="readonly" endpoint="hftp://DRHA.company.com:50070" version="2.2.0"/>
<interface type="write" endpoint="hdfs://DRHA.company.com:8020" version="2.2.0"/>
#You can also do this, depending on preference
<interface type="readonly" endpoint="hftp://DRHA:50070" version="2.2.0"/>
<interface type="write" endpoint="hdfs://DRHA:8020" version="2.2.0"/>
... View more
12-09-2015
03:15 PM
We actually do this as well. Here's an example for retention/replication on a folder that looks like this. This resolves, much like above, to a folder called /hdfs/path/to/data/in/table/file_dt=2015-12-09. Hope this helps! <location type="data" path="/hdfs/path/to/data/in/table/file_dt=${YEAR}-${MONTH}-${DAY}"/>
... View more