Member since
11-27-2017
52
Posts
3
Kudos Received
0
Solutions
04-10-2019
08:35 AM
I've come up with an initial solution, but would like to hear better alternatives in case this doesn't perform/scale very well on larger inputs. Let me know how this can be expected to scale. FlattenJSON Remove outer array brackets with ReplaceContent regex replacing ^\[(.*)]$ with \1 (seems expensive wit regex for such a simple operation) Make each record object stand on a separate line with no comma separation with literal replace of },{ with }\n{ Remember, I'm on NiFi 1.5. Thanks.
... View more
04-10-2019
07:00 AM
In NiFi (1.5) I need the best way to prepare JSON so it can be parsed by the Hive json serde ('org.apache.hive.hcatalog.data.JsonSerDe). I have an array of, here two, records in this format (from a REST API response). [
{
"id": 1,
"call_status": "OK",
"result": 0.0239,
"explanation": [
"some_var",
"another_var"
],
"foo": "OK"
},
{
"id": 2,
"call_status": "OK",
"result": 0.0239,
"explanation": [
"some_var",
"another_var"
],
"foo": "OK"
}
] It seems it should be transformed to this format for the serde to work, which I tried doing manually with succes. No array brackets and one record per line. { "id": 1, "call_status": "OK", "result": 0.0239, "explanation": [ "some_var", "another_var" ], "foo": "OK" }
{ "id": 2, "call_status": "OK", "result": 0.0239, "explanation": [ "some_var", "another_var" ], "foo": "OK" } What is the recommended / most efficient way of doing this transformation? Preferably without having to input the schema, as it would be nice with a generic solution. If a schema _is_ required, let's say for Record processors, I'll live with that.
... View more
Labels:
- Labels:
-
Apache NiFi
01-09-2019
09:16 AM
Any progress on getting MoveHDFS to accept attributes in Output Directory? It seems difficult not be able to have a dynamic solution as mentioned in this thread.
... View more
12-03-2018
12:01 PM
Looking for this info also. Sorry to bump the the thread, but any news on this wish?
... View more
11-27-2018
09:54 AM
Using ConsumeKafka processor in NiFi - can the Topic Name(s) property list be dynamically updated (without a processor restart)? I have a need to expose a web service trough NiFi that can retrieve metadata on which topic names to consume/ingest. So when receiving a new one topic consume request, I need to update the Topic Name(s) property list of the ConsumeKafka processor (adding the new topic to the list). How do I achieve that? Does it afterwards require a restart of the ConsumeKafka processor (through the NiFi REST API)? The topic list is planned to be persisted somewhere. Could simply be a flat file or in RDBMS, HBASE or just about anywhere. Any preferences recommended?
... View more
Labels:
- Labels:
-
Apache Kafka
-
Apache NiFi
11-15-2018
11:30 AM
1 Kudo
Follow-up regarding Records and schemas. I use InferAvroSchema. It seems to miss the possibility of null in some of my CSV data. I've set it, as test, to analyse 1.000.000 records to ensure it sees all, but no luck. On some columns it adds possible null to field type, on others not. Is there a built in (invisible) upper limit to have many records are analysed? And could it be considered to add an option in the processor to always allow null values?
... View more
11-13-2018
09:39 AM
I'm trying to gain experience with Records, specifically ValidateRecord. All FlowFiles out of ValidateRecord seem to be converted to the format set by the Record Writer property. Is there any way of not having the data parsed on output / maintaining the original input? I have a case with CSV input as example, where I'd like to log and report invalid lines as-is. Could be for passing back to data supplier, where I'd rather show original input, than something transformed by the Writer. I'd appreciate inspiration also to when you'd want the schema to validate against not be the one used by the Record Reader. Using NiFi 1.5.0.3.
... View more
Labels:
- Labels:
-
Apache NiFi
11-07-2018
11:37 AM
Still seeing these run-away tasks. Anyone having similar experience or an explanation?
... View more
10-12-2018
08:55 AM
I see GetHDFSFileInfo from 1.7 might be relevant. Running 1.5 currently though. Suggestions on that platform?
... View more
10-12-2018
08:16 AM
In PutHDFS I have set Conflict Resolution Strategy to Fail as I don't want to overwrite existing files. But for error handling and logging/notification, I need to differentiate file-exist fails from other types of fails from this processor. How is that possible? In bulletin board I can see a text message from the processor indicating when file exists, but how do I get that info in the flow itself? Is the message / fail type available for flow control handling somehow? Suggestions?
... View more
Labels:
- Labels:
-
Apache NiFi
10-12-2018
06:48 AM
Her are a couple of screenshots, showing the case. Look at the UpdateAttribute processor in the middle. Input to it is stopped. 12 penalized flowfiles are waiting in input queue. UpdateAttribute stopped: Next, UpdateAttribute started, and within a couple of seconds thousands of tasks generated. Flowfiles are penalized for a full day in this case, so don't flow through, but the task generation goes crazy while waiting for the penalized flowfiles to be released. Is this really intended behaviour? If only non-penalized flowfiles are input, then only the needed tasks are generated, and no runaway waisted tasks are made. PS. Yield duration for UpdateAttribue is default 1s and penalty 30s.
... View more
10-11-2018
11:22 AM
1 Kudo
I'm seeing processors with very little input generate a tsunami of tasks (thousands within a couple of seconds) when Run Schedule is set to 0ms (run duration to 0 also). I have the understanding that 0ms in Run Schedule should be interpreted as "always on" / "continuously", like a HTTP request handler or similar listener, always ready for handling requests individually and immediately when received. I have a case with an UpdateAttribute processor, also with Run Schedule to 0ms. It generates no tasks when no incoming flowfiles, and only generates as few as incoming flowfiles require (this case, very few, say 10 for 10 incoming test flow files). But if the FlowFiles are penalized (this instance a full day, 1d, coming from PutHDFS) then the task generation goes crazy (like thousands per seconds), without actually doing any work (no flow files move through). Why the high number of tasks generations? Seems to affect the cluster. We notice similar tsunami task generation on other processors, like HTTP request handler. It's unexpected behaviour to me. Doesn't seem as a robust reaction. Is it intended/expected outcome or a bug?
... View more
Labels:
- Labels:
-
Apache NiFi
09-14-2018
09:09 AM
Could it be it's a Hive service user that's trying to write /user/XXX/.staging/job_1536655741381_0075/libjars/antlr-runtime-3.4.jar and can't because permissions are set as drwx------ - XXX hdfs 0 2018-09-03 10:47 /user/XXX/.staging - as in only XXX can write, not 'group' or 'other' (so not Hive or other service user sqoop might initiate to do the work). Just a thought.
... View more
09-14-2018
07:45 AM
I'm using hcatalog, which doesn't support target-dir, so cannot try it out. I'm not allowed to change ownership, and think it shouldn't help also having the group set to me, if I'm already owner with rwx. It would rather restrict chances of writing, as hdfs no longer can access it, unless I put rwx on 'other'.
... View more
09-14-2018
06:31 AM
@Geoffrey Shelton Okot I already have a user account dir, which seem to have right permissions (user called XXX here) drwx------ - XXX hdfs 0 2018-09-03 10:47
/user/XXX/.staging I once saw an admin user look into the Ranger logs, and it seemed strange that I first got an allow on the file in question, then immiediately after (within same second) a deny - on the very same filepath.
... View more
09-13-2018
09:05 AM
I'm getting error like this 18/09/13 10:56:44 INFO mapreduce.JobSubmitter: Cleaning up the staging area /user/XXX/.staging/job_1536655741381_0075 18/09/13 10:56:44 ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.security.AccessControlException: Permission denied: user=XXX, access=WRITE, inode="/user/XXX/.staging/job_1536655741381_0075/libjars/antlr-runtime-3.4.jar":XXX:hdfs:---------- It's strange, because the .staging under my user has permission like this drwx------ - XXXhdfs 0 2018-09-03 10:47 /user/XXX/.staging The database and tables reside on HDFS with Ranger controlled permissions. They are initially written with these perms (by hive/beeline commands) hive:hdfs drwx------ (where my sqoop job works fine) but then (on a cron basis to let Ranger control) changed to hdfs:hdfs d--------- (and then my sqoop job does not work anymore). Is that because sqoop needs to be told to use Kerberos, and how is that done with sqoop 1(.4.6)?
... View more
09-13-2018
08:49 AM
Using HDP 2.6.4 and Sqoop 1.4.6. So not Sqoop 2, which I understand is dropped. I'm importing from a postgresql with simple username and password (no Kerberos) to Hive/HDFS which does use Kerberos. Besides doing a kinit first, do I need to tell sqoop somehow that the underlying import work to Hive/HDFS (initiated by the sqoop import command) needs to use Kerberos? If so, how?
... View more
Labels:
- Labels:
-
Apache Sqoop
09-06-2018
12:16 PM
It works using beeline instead, with same kinit (as me). Any idea, why it works with beeline and not with hive command?
... View more
09-04-2018
03:18 PM
As my own user (not system user) I cannot create an external table via a .hql script with hive -f. use db;
create external table `table` (
columns...)
stored as orc
LOCATION '/external/path/to/my/db/table'
TBLPROPERTIES ("orc.compress"="ZLIB");
I get this $ hive -f ddl_create_db.table.hql
FAILED: SemanticException MetaException(message:org.apache.hadoop.security.AccessControlException: Permission denied: user=<my-user>, access=EXECUTE, inode="/apps/hive/warehouse/db.db":hdfs:hdfs:d---------
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:353)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:292)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:238)
at org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkDefaultEnforcer(RangerHdfsAuthorizer.java:428)
at org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkPermission(RangerHdfsAuthorizer.java:365)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1950)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:108)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4142)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1137)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:866)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
)
Notice it's tryign to access /apps/hive/warehouse/db.db, which I correctly don't have access to with this user, therefore the design requirement to use external tables with another path which I do have access to. It works fine from Ambari, but not from command line from where I need to script/automate it.
... View more
- Tags:
- Data Processing
- Hive
Labels:
- Labels:
-
Apache Hive
09-04-2018
03:18 PM
Using Hive 1.2.1000.2.6.4.0-91. When running a 'hive -e' from command line, I get a message about -chmod and hadoop fs. See below. Any idea why I get this, and how to fix it? The -S doesn't suppress it either, as shown. [ ~] hive -S -e 'select max(entry_timestamp) from xx.yy'
log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
-chmod: chmod : mode '0' does not match the expected pattern.
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] <path> ...]
[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
[-expunge]
[-find <path> ... <expression> ...]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-truncate [-w] <length> <path> ...]
[-usage [cmd ...]]
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
Usage: hadoop fs [generic options] -chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...
[ ~] 2018-08-31 12:42:14.41040
... View more
Labels:
- Labels:
-
Apache Hive
07-24-2018
09:56 AM
Got it working now by using another driver. Using net.sourceforge.jtds.jdbc.Driver it works (instead of com.sybase.jdbc4.jdbc.SybDriver). A little strange it doesn't work with Sybase's own drivers.
... View more
07-24-2018
08:49 AM
Using Sybase Adaptive Server Enterprise (ASE) 15.7, Sqoop 1.4.6.2.5.0.0-1245 and jconn4 as driver. I have a table in Sybase ASE with some columns of type 'char' with various length (sometimes called precision in metadata dump), some at 10, some at 1 - so basically char(10) and char(1). When importing with scoop with table creation it maps the char columns wrongly. Doing something like this sqoop import --driver com.sybase.jdbc4.jdbc.SybDriver --connect "jdbc:sybase:Tds:XXX/XXX" --username XXX --password-file XXX --table XXX -m 1 --create-hcatalog-table ... I notice it gets the column type readout wrong for the char columns. It says 0 for Precision, not 10, 1 or whatever is chosen in Sybase. 18/07/24 10:18:36 INFO hcat.SqoopHCatUtilities: Database column name - info map :
...
XXX : [Type : 1,Precision : 0,Scale : 0]
... Later, at create table phase, it decides to map the char columns to char(65535) no matter the actual length in Sybase, like this 18/07/24 10:18:36 INFO hcat.SqoopHCatUtilities: HCatalog Create table statement:
create table `XXX`.`XXX` (
`XXX` char(65535),
... That then fails, as char is only valid in Hive between 1 and 255. 18/07/24 10:18:42 INFO hcat.SqoopHCatUtilities: FAILED: RuntimeException Char length 65535 out of allowed range [1, 255] I can override the wrong automatic mapping manually with --map-column-hive XXX="char(10)", but I want to avoid that as I need it automatically to work for a lot of tables. Hope anyone can help fix this issue. Thanks.
... View more
Labels:
- Labels:
-
Apache Sqoop
07-24-2018
08:35 AM
I ended up with a followup procedure to make a "create external table like ...", insert into that and drop the managed. I might also try your external=true approach.
... View more
07-20-2018
07:56 AM
Hi @Vinicius Higa Murakami. I still need to figure out control of destination dir with --hcatalog. I guess I can create the database first with a location parameter. I still think the tables inside will be managed though and not external. Any way to get the data to be external?
... View more
07-19-2018
11:44 AM
Got it to work now, but finding a suitable keytab and pricipal used elsewhere (nifi) to access HDFS. By copying that keytab from the nifi/hdf server to our hdp server and kinit with that, I got the right permissions for sqoop and hive/hcatalog to do it's thing. Nice.
... View more
07-19-2018
07:56 AM
Thanks. Indeed I forgot the kinit. When trying just with kinit (no options) it uses my own user account. That user cannot access the Hive database dir though (hive:hdfs owns that as shown with access rights below). 18/07/19 09:51:13 INFO hcat.SqoopHCatUtilities: FAILED: SemanticException MetaException(message:java.security.AccessControlException: Permission denied: user=w19993, access=READ, inode="/user/w19993/hivedb":hive:hdfs:drwx------ I wonder how to come around that. Should I use kinit with a service account? I notice on the HDP server I'm running sqoop from, that there exists a set of keytabs under /etc/security/keytabs/, including hdfs.headless.keytab and hive.service.keytab. Don't know how to use those.
... View more
07-17-2018
07:44 AM
Letting it take it's time the command finally exited with the attached stack trace. sqoop.txt And I now see I get a similar Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient when just trying to start hive from the same shell where I run sqoop, so there must be some setup-issue. I don't have admin privileges, so likely rely on others to fix.
... View more
07-17-2018
07:09 AM
I think the --hive-home parameter is for the Hive installation path, not data placement. I currently encounter a problem where sqoop import is hanging, running sqoop import -Dmapreduce.job.user.classpath.first=true --verbose --driver com.sybase.jdbc4.jdbc.SybDriver --connect "jdbc:sybase:Tds:xxxx:4200/xxx" --username sxxx --password-file file:///home/xxx/.pw --table xxx -m 1 --create-hcatalog-table --hcatalog-database sandbox --hcatalog-table my_table --hcatalog-storage-stanza "stored as avro" where sandbox database was either created through ambari in default or custom location. The source table columns are read fine, and the create DDL prepared INFO hcat.SqoopHCatUtilities: HCatalog Create table statementcreate table xxx stored as avro 18/07/17 09:00:20 INFO hcat.SqoopHCatUtilities: Executing external HCatalog CLI process with args :-f,/tmp/hcat-script-1531810820345
18/07/17 09:00:21 INFO hcat.SqoopHCatUtilities: 18/07/17 09:00:21 WARN conf.HiveConf: HiveConf of name hive.mapred.supports.subdirectories does not exist Then it hangs. If I do Ctrl-C I get 18/07/17 09:07:58 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: HCat exited with status 130
at org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities.executeExternalHCatProgram(SqoopHCatUtilities.java:1196)
at org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities.launchHCatCli(SqoopHCatUtilities.java:1145)
at org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities.createHCatTable(SqoopHCatUtilities.java:679)
at org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities.configureHCat(SqoopHCatUtilities.java:342)
at org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities.configureImportOutputFormat(SqoopHCatUtilities.java:848)
at org.apache.sqoop.mapreduce.ImportJobBase.configureOutputFormat(ImportJobBase.java:102)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:263)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:692)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:507)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:615)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:225)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.main(Sqoop.java:243) Tips on how to troubleshoot this?
... View more
07-16-2018
11:28 AM
Is there a way to use the --hcatalog options with sqoop import while maintaining control over destination dir, like --warehouse-dir or --destination-dir? These options appear to be incompatible, but perhaps there's a trick. I'd like to import and create table definitions for Hive in a single go, storing as avro or orc, while controlling the destination dir, so I'm not forced to Hive's default/managed directory. There are various reasons for not placing in default warehouse dir, one being I don't currently have permissions to do to (so getting an error in sqoop import with --hcatalog options), another being we have a dir placement strategy not using the default dir.
... View more
Labels:
- Labels:
-
Apache Sqoop
07-09-2018
10:25 AM
@Pierre Villard: "A common approach is something like GenerateTableFetch on the primary node and QueryDatabaseTable on all nodes. The first processor will generate SQL queries to fetch the data by "page" of specified size, and the second will actually get the data. This way, all nodes of your NiFi cluster can be used to get the data from the database.": Will I need to make a (local) RPG after the GenerateTableFetch to get them running in parallel? Any experience on performance for making full RDBMS table dumps using this method vs Sqoop?
... View more