Member since
11-23-2015
28
Posts
16
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
752 | 01-27-2016 08:32 AM |
10-17-2016
08:09 AM
1 Kudo
@Padmanabhan Vijendran
Actually I did not, since the need passed. However my question was more about access to Hive. In case of HDFS it should be more simple.
In your java code you need to have something like this:
if (System.getenv("HADOOP_TOKEN_FILE_LOCATION") != null) {
jobConf.set("mapreduce.job.credentials.binary", System.getenv("HADOOP_TOKEN_FILE_LOCATION"));
}
to tell your java app where to find delegation token needed for HDFS access. Hope this helps, Pavel
... View more
05-09-2016
07:06 AM
Hi Larry, yes, the Apache HttpClient works like a charm. Thanks, Pavel
... View more
05-06-2016
06:25 PM
1 Kudo
Hi, we are trying to use WebHDFS over Knox to access HDFS on our secured cluster from java. We are able to list files/folders there, but we are still struggling with the file creation. The problem is probably in Oracle's Java library, where the streaming does not seem to be supported when authentication is required:. In sun.net.www.protocol.http.HttpURLConnection.getInputStream0() there is something like if
(j == 401) { /*
1635 */ if (streaming()) { /*
1636 */ disconnectInternal(); /*
1637 */thrownew
HttpRetryException("cannot retry due to server
authentication, in streaming mode",
401); /*
*/ } The streaming is not needed for some operation, like list/delete (and therefore it works), but it is required for file creation. Any suggestions how to handle this? Thanks a lot, Pavel
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Knox
04-26-2016
07:38 AM
Thanks, we are working on something similar. I have one question/comment to the 'compact' stage. The execution flow as presented here means that table 'reporting_table' disappears for significant amount of time, before it is filled again. This could break queries running against this table. Is there a way how to make this switch (almost) seamless? It also may require to keep the older data not to break already running queries. Thanks, Pavel
... View more
03-16-2016
02:13 PM
1 Kudo
Hi, I am wondering if there is a general way how to determine what YARN applications were started by some application and vice-versa if some YARN application was started by some other. My use case is Oozie and sqoop where Oozie runs some launchers that in turn start some MR jobs to do the actual ingest. It is possible to browse through the logs to get ID of spawned application, but keep thinking that there should be some better way how to do it. This kind of relation must be stored somewhere, since when the Oozie workflow is killed all child processes are killed as well almost immediately. Thanks for any hints, Regards, Pavel
... View more
- Tags:
- Hadoop Core
- YARN
Labels:
- Labels:
-
Apache YARN
03-14-2016
04:46 PM
1 Kudo
Hi @Sowmya Ramesh, thanks for your reply. You had definitely more luck with google since I could not find anything useful related to this exception and Falcon in particular. I do not have the xml for failed request yet (logging added and waiting for the issue to happen again), but in general it looks like this: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <feed name="test-2-ALL-RELATIONSHIP" xmlns="uri:falcon:feed:0.1"> <frequency>days(1)</frequency> <clusters> <cluster name="dev-cluster" type="source"> <validity start="2016-03-14T00:00Z" end="2016-03-29T00:00Z"/> <retention limit="months(12)" action="delete"/> </cluster> </clusters> <table uri="catalog:test_2:ALL_RELATIONSHIP#mg_version=${YEAR}-${MONTH}-${DAY}-${HOUR}-${MINUTE}"/> <ACL owner="user@domain.COM"/> <schema location="/none" provider="none"/> <properties> <property name="queueName" value="mglauncher"/> </properties> </feed>
however I doubt that the error is caused by incorrect xml since it is generated automatically and the same operation is usually successful and fails only sometimes. There is a code in org.apache.falcon.resource.AbstractEntityManager.deserializeEntity() method that does some logging when parsing fails: if (LOG.isDebugEnabled() && xmlStream.markSupported()) {
try {
xmlStream.reset();
String xmlData = getAsString(xmlStream);
LOG.debug("XML DUMP for ({}): {}", entityType, xmlData, e);
} catch (IOException ignore) {
// ignore
}
}
but I could not find anything like ""XML DUMP for" in our Falcon log. Is this fragment in log4j.xml Falcon conf file <logger name="org.apache.falcon" additivity="false">
<level value="debug"/>
<appender-ref ref="FILE"/>
</logger>
enough to get this messages into log? I am not familiar with implementation so I am not sure whether the stream supports marking or not. Regards and thanks for any input, Pavel
... View more
03-03-2016
04:54 PM
2 Kudos
Hi, we are using org.apache.falcon.client.FalconClient API to update Falcon process from java: falconClient.update( EntityType.PROCESS.name(), <some-id>, <file-name>,true,doAs); where the local <file-name> is created like this: ...
Marshaller marshaller = entityType.getMarshaller();
final File createTempFile = File.createTempFile(entityType.name().toLowerCase() + "_" + id, ".xml");
LOGGER.debug("Generated entity: {}", entity.toString());
marshaller.marshal(entity, createTempFile);
return createTempFile.getPath(); and sometimes the update fails with this error: javax.xml.bind.UnmarshalException:
[org.xml.sax.SAXParseException; Premature end of file.]
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.createUnmarshalException(AbstractUnmarshallerImpl.java:335)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.createUnmarshalException(UnmarshallerImpl.java:523)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:220)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:189)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:157)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:204)
at org.apache.falcon.entity.parser.EntityParser.parse(EntityParser.java:94)
... 61 more
Caused by: org.xml.sax.SAXParseException; Premature end of file.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:216)
... 65 more most of the updates pass, this error happens only sometimes, so I believe the file is created correctly on the client side and the error is caused possibly by some performance issue or race condition. Have you seen this behavior? Thanks, Pavel
... View more
Labels:
- Labels:
-
Apache Falcon
02-22-2016
07:12 AM
@Guillermo Ortiz I would say it is Oozie/Kerberos problem. If I would like to call HBase from Oozie (there is probably not a task for it), I would end up with the same problem.
... View more
02-22-2016
07:07 AM
1 Kudo
The problem is that our system does not have access to users password or keytab. It uses kerberos authentication and than Haddop proxy user to access various Hadoop services. So it is not possible for us to do kinit again on a data node or use password (in file or directly).
... View more
02-16-2016
08:10 PM
1 Kudo
@Guillermo Ortiz Not really, I have split the original java action into two Oozie actions; the first one is hive action where I get what I need from from hive (using temporary external table) and the second java actions where the data are further processed. Currently I use hive action, but it should be trivial to replace it with hive2 action in future when needed. And yes, according to my knowledge it necessary to have valid kerberos token (kinit does not have to happen in java though) or use delegation token to connect to Kerberized hive from java.
... View more
01-27-2016
08:32 AM
1 Kudo
Do you have the feed scheduled? The feed needs to be scheduled for retention to work.
... View more
01-07-2016
09:40 PM
1 Kudo
@Benjamin Leonhardi Regarding the LDAP/PAM scenario you mention, I am not familiar with details, but I am afraid that our users are expecting a single sing on, so they wont be willing to enter their intranet password again to some "custom" system.
... View more
01-07-2016
08:00 PM
1 Kudo
@Benjamin Leonhardi Thanks for reply. However as I tried to describe above, I cannot do kinit for the user since I do not have access the his keytab at all. I am not sure, maybe in theory I could do kinit as some service user with ability to impersonate users on Haddop (e,g. like oozie), and using doAs() to get access to hive or obtain delegation token. I am not a Kerberos expert, but it still feels like a security hole to allow access to this keytab for normal users.
... View more
01-07-2016
07:47 PM
1 Kudo
@bsaini I can modify the action code, but I cannot do the kinit there since I do not have access to user's keytab at all. My scenario is like this: the user is logged in company network (with Kerberos) the user access the REST API of some application server (authenticated using Kerberos) the application server runs Oozie workflow, that includes the java task that needs to access some tables in Hive using the original user credentials. The only way I see is the delegation token. Even if Oozie would support kinit on data, it still is no help, since the keytab/password is not available.
... View more
01-07-2016
08:34 AM
1 Kudo
@Artem Ervits Thanks for reply. In my opinion it would not help. The shell action is the same as java one with respect to kerberos login, so the delegation token is still required to connect from JDBC. The only way I see is to do the initial JDBC connection within Oozie action handler/executor that is executed under kinit and pass the delegation token to the actual java action code running on a datanode. But maybe I miss something. Th
... View more
01-06-2016
09:12 PM
2 Kudos
Hi, I am trying to execute Hive query from java action that is part of Oozie workflow. My preferred way is to use the Beeline or JBDC rather than old Hive CLI. However I am struggling with this a little bit, since the connection is failing due to authentication errors. When the java action code is executed on some data node in a cluster, Oozie does not do kinit for the user and thus the connection fails. Both Beeline and JDBC connection string seem to support delegation tokens, but those token can only be obtained when user is logged in by Kerberos (kinited). We are currently using hive-0.14.0 and oozie-4.1. I have found out that new hive2 action introduced in oozie-4.2 seems to first create jdbc connection under Oozie Kerberos login, obtain delegation token from this connection and finally pass this token to the Beeline. Maybe the same approach could be used here as well. It would require a new custom oozie action (e.g. java-jdbc). . Seems possible but it is quite complicated; is there some easier way? Thanks for any comments, Pavel
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Oozie
12-11-2015
02:09 PM
Hi, we are trying to setup ingest of data from databases into Hive using Sqoop. Problem is that databases are used in production and we cannot use them too heavily during some working hours. There are many tables and some of them are quite huge ( > 2 GRows) so it is probable that we cannot ingest them all during the time window available. It is difficult to create some general delta queries that would run some given amount of time and no longer. I am thinking about possibility to implement such feature directly into Sqoop. I am not very familiar with Sqoop implementation, but I guess there is some loop where a row gets loaded from JDBC resultset, converted and stored into Hive table. All that would be required is to place a check in this loop and wait/sleep for some time if this is happening during the database working hours. This way the ingest will run at the full speed outside of the working hours and will be significantly reduced (but still running) when database is not supposed to be overloaded. What do you think? Does this sound like a feature that could be useful to someone else as well? Thanks, Pavel
... View more
Labels:
- Labels:
-
Apache Sqoop
12-03-2015
09:59 AM
@Balu Thanks for filing the issue. I understand that the immediate cause of the failure are unsufficient hdfs permissions for the 'feed' folder. However I am puzzled about what triggered this. We were using the same Falcon instalation (both 'kefi' and 'middlegate_test1' users) for several weeks without problems. At the same time we have experienced problem with cluster/YARN overload since there were some processes running with minute(1) frequency. But I am not sure whether this could be related.
... View more
12-02-2015
12:41 PM
Hi, our Falcon installation abruptly ceased to work and no feed could be created. It complained about file permission Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=kefi, access=WRITE, inode="/apps/falcon-MiddleGate/staging/falcon/workflows/feed":middlegate_test1:falcon:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:238)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:179)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6515)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6497)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6449)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4251)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.ja where 'kefi' is the user trying to create the feed and 'middlegate_test1' is another user that created some feed before. the folders on hdfs looked like this bash-4.1$ hadoop fs -ls /apps/falcon-MiddleGate/staging/falcon/workflows/
Found 2 items
drwxr-xr-x - middlegate_test1 falcon 0 2015-12-02 09:13 /apps/falcon-MiddleGate/staging/falcon/workflows/feed
drwxrwxrwx - middlegate_test1 falcon 0 2015-12-02 09:13 /apps/falcon-MiddleGate/staging/falcon/workflows/process
I can think of two questions related to this: why the permission for the 'feed' folder is now 'drwxr-xr-x' whereas the 'process' folder has permissions 'drwxrwxrwx'? The feed creation has been working before, so I guess someone or something had to change it. It is not very probable that some user did it manually, is it possible that it was falcon itself who did it? it does not seem correct that such internal falcon system folders are owned by some normal user; quite probably the first one who ever tried to create some entity in falcon. Is it as expected, or rather some our misconfiguration of falcon? Thanks for any input, Regards, Pavel
... View more
Labels:
- Labels:
-
Apache Falcon
12-01-2015
08:51 AM
Hi Balu, thanks for your answer. My understanding of retention is that all instances older that the retention period would be deleted no matter whether dataset is valid or not. However I am not saying that is the only possible interpretation. Possibly the existing behavior could be useful in some use case. If so, maybe you can introduce the new 'retention action' such as 'delete-always' to handle such situation. This way the change will be also backward compatible and will not change existing Falcon behavior.
... View more
11-30-2015
12:48 PM
Hello, I have got one question regarding how Falcon implements retention policy for feed instances. I have observed that the retention policy action(i.e. DELETE in my case) is executed only within the dataset's validity interval. It means that several instances (how many it depends of the dataset's frequency) close to end of dataset's validity are kept forever even when some retention is defined. Is it as expected or I am doing something wrong? Thanks for any input, Regards, Pavel
... View more
Labels:
- Labels:
-
Apache Falcon