About simon_jespersen

simon_jespersen · ‎05-19-2017

@Matt Clarke Thank you very very much, your answer was very useful for me

simon_jespersen · ‎05-17-2017

Hi, I have a ingest flow which initially ingest approx 50 files, they are about 300mb each. After ingest i want to run some hive commands to create various hive tables to display. but i only need to do that once and not 50 times. I have been searching for some kind of trigger to start a PutHiveQl processor once without a useful hit. How could i acomplish that

simon_jespersen · ‎05-17-2017

Hi , I am nearly there but i still miss something out to get it totally right I have my flow built as listHDFS->fetchHdfs->SplitText on lines->executescript on eatch lines-mergecontent->putfiles When i run the flow without executescript everything looks fine, the file is similar after merge as i was before split. But when i run the files through my script data is somehow stripped of, and i cannot figure out why. It is standard files UTF-8 Unicode text, with very long lines separated by tabulator. The full script import org.apache.commons.io.IOUtils import java.nio.charset.* def flowFile = session.get() if(!flowFile) return def path = flowFile.getAttribute('path') def fail = false flowFile = session.write(flowFile, {inputStream, outputStream -> try { def recordIn = IOUtils.toString(inputStream, StandardCharsets.UTF_8) def cells = recordIn.split(' ') ; def output = "" cells.each(){it -> output = output + it + "\t" // do something } output = output + path + "\n" outputStream.write(output.getBytes(StandardCharsets.UTF_8)) recordOut = '' } catch(e) { log.error("Error during processing of validate.groovy", e) session.transfer(inputStream, REL_FAILURE) fail=true } } as StreamCallback) if(fail){ session.transfer(flowFile, REL_FAILURE) fail = false } else { session.transfer(flowFile, REL_SUCCESS) } It seems like something is cutoff in this line outputStream.write(output.getBytes(StandardCharsets.UTF_8)) I Tried to change it to outputStream.write(recordIn.getBytes(StandardCharsets.UTF_8)) with same result.

simon_jespersen · ‎05-16-2017

Forgive my incompetence in groovy but the job cuts off a lot of data i have 147 columns in the file so i want to iterate true the columns, and add columns ind the end of the line def recordIn = IOUtils.toString(inputStream, StandardCharsets.UTF_8) def cells = recordIn.split(' ') for (column in cells) { recordOut += column + ' ' // do stuff on each column // need to detect linefeed to determine newline ? } // add new column recordOut += path outputStream.write(recordOut.getBytes(StandardCharsets.UTF_8)) recordOut = '' I am not sure how i should understand data from InputStream , does it stream one char at the time or does it stream lines, normally i would iterate through lines, and iterate insite that loop through columns 2017-05-16 11:29:34,245 ERROR [Timer-Driven Process Thread-4] o.a.nifi.processors.script.ExecuteScript org.apache.nifi.processor.exception.ProcessException: javax.script.ScriptException: javax.script.ScriptException: groovy.lang.MissingMethodException: No signature of method: org.apache.nifi.controller.repository.StandardProcessSession.transfer() is applicable for argument types: (org.apache.nifi.controller.repository.io.FlowFileAccessInputStream, org.apache.nifi.processor.Relationship) values: [org.apache.nifi.controller.repository.io.FlowFileAccessInputStream@702d53aa, ...] Possible solutions: transfer(java.util.Collection, org.apache.nifi.processor.Relationship), transfer(org.apache.nifi.flowfile.FlowFile, org.apache.nifi.processor.Relationship), transfer(java.util.Collection), transfer(org.apache.nifi.flowfile.FlowFile) at org.apache.nifi.processors.script.ExecuteScript.onTrigger(ExecuteScript.java:214) ~[na:na] at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1099) ~[nifi-framework-core-1.1.0.2.1.1.0-2.jar:1.1.0.2.1.1.0-2] at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:136) [nifi-framework-core-1.1.0.2.1.1.0-2.jar:1.1.0.2.1.1.0-2] at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47) [nifi-framework-core-1.1.0.2.1.1.0-2.jar:1.1.0.2.1.1.0-2] at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:132) [nifi-framework-core-1.1.0.2.1.1.0-2.jar:1.1.0.2.1.1.0-2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_77] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_77] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77] Caused by: javax.script.ScriptException: javax.script.ScriptException: groovy.lang.MissingMethodException: No signature of method: org.apache.nifi.controller.repository.StandardProces

simon_jespersen · ‎05-15-2017

Hi, I am running a nififlow which ingest tabular separated text files into hdfs. I would lige to add columns to each files with flowfile attributes like ${path}.${filename} and so on Test every column for chars that i don't want to ingest Not having experience with groovy i wondered if this could be achieved by writing a groovy script I have been playing around without any good result until now, i could be nice with a small help to guide me in the right direction. import java.nio.charset.StandardCharsets def flowFile = session.get() if(!flowFile) return flowFile = session.write(flowFile, {inputStream, outputStream -> inputStream.eachLine { line -> a = line.tokenize('\t') a = a + ' flowfile' *trying to add column } } as StreamCallback) session.transfer(flowFile, REL_SUCCESS) session.commit()

simon_jespersen · ‎05-15-2017

Thank you very much Matt

simon_jespersen · ‎05-04-2017

Hi I am trying to import data from sybase-ASE table into hive When running following cmd sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" --driver net.sourceforge.jtds.jdbc.Driver --connect jdbc:jtds:sybase://11.9.179.5:4000/BAS_csrp --username user--password xxxxxx --table csrp_total_bas --split-by cpr_nr --hcatalog-database default --hcatalog-table csrp_total_baseline --hcatalog-home /tmp/simon/std/CPRUPD/BASELINE --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile" I get this error : Warning: /usr/hdp/2.5.0.0-1245/accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. 17/05/04 17:20:12 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.5.0.0-1245 17/05/04 17:20:12 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 17/05/04 17:20:12 WARN sqoop.ConnFactory: Parameter --driver is set to an explicit driver however appropriate connection manager is not being set (via --connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.GenericJdbcManager. Please specify explicitly which connection manager should be used next time. 17/05/04 17:20:12 INFO manager.SqlManager: Using default fetchSize of 1000 17/05/04 17:20:12 INFO tool.CodeGenTool: Beginning code generation 17/05/04 17:22:19 ERROR manager.SqlManager: Error executing statement: java.sql.SQLException: Network error IOException: Connection timed out java.sql.SQLException: Network error IOException: Connection timed out at net.sourceforge.jtds.jdbc.JtdsConnection.<init>(JtdsConnection.java:436) at net.sourceforge.jtds.jdbc.Driver.connect(Driver.java:184) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:247) at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:904) at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:763) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:786) at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:289) at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:260) at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:246) at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:328) at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1853) at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1653) at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:107) at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:488) at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:615) at org.apache.sqoop.Sqoop.run(Sqoop.java:147) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:225) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234) at org.apache.sqoop.Sqoop.main(Sqoop.java:243) Caused by: java.net.ConnectException: Connection timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at net.sourceforge.jtds.jdbc.SharedSocket.createSocketForJDBC3(SharedSocket.java:288) at net.sourceforge.jtds.jdbc.SharedSocket.<init>(SharedSocket.java:251) at net.sourceforge.jtds.jdbc.JtdsConnection.<init>(JtdsConnection.java:331) ... 22 more 17/05/04 17:22:19 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: No columns to generate for ClassWriter at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1659) at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:107) at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:488) at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:615) at org.apache.sqoop.Sqoop.run(Sqoop.java:147) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:225) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234) at org.apache.sqoop.Sqoop.main(Sqoop.java:243) When running sqoop without hcatalog i get some warnings but it runs on exatly the same database and table sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" --driver net.sourceforge.jtds.jdbc.Driver --connect jdbc:jtds:sybase://10.9.179.5:4000/BAS_csrp --username user --password xxxxxx --table csrp_total_bas --split-by cpr_nr --target-dir /tmp/simon/std/CPRUPD/BASELINE_file/ this is output for the sqoop which works Warning: /usr/hdp/2.5.0.0-1245/accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. 17/05/04 17:14:47 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.5.0.0-1245 17/05/04 17:14:47 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 17/05/04 17:14:47 WARN sqoop.ConnFactory: Parameter --driver is set to an explicit driver however appropriate connection manager is not being set (via --connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.GenericJdbcManager. Please specify explicitly which connection manager should be used next time. 17/05/04 17:14:47 INFO manager.SqlManager: Using default fetchSize of 1000 17/05/04 17:14:47 INFO tool.CodeGenTool: Beginning code generation 17/05/04 17:14:47 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM csrp_total_bas AS t WHERE 1=0 17/05/04 17:14:47 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM csrp_total_bas AS t WHERE 1=0 17/05/04 17:14:47 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.5.0.0-1245/hadoop-mapreduce Note: /tmp/sqoop-w20960/compile/ccd3eec0b22d23dc5103e861bf57e345/csrp_total_bas.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 17/05/04 17:14:49 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-w20960/compile/ccd3eec0b22d23dc5103e861bf57e345/csrp_total_bas.jar 17/05/04 17:14:49 INFO mapreduce.ImportJobBase: Beginning import of csrp_total_bas 17/05/04 17:14:49 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM csrp_total_bas AS t WHERE 1=0 17/05/04 17:14:50 INFO impl.TimelineClientImpl: Timeline service address: http://sktudv01hdp02.ccta.dk:8188/ws/v1/timeline/ 17/05/04 17:14:50 INFO client.AHSProxy: Connecting to Application History server at sktudv01hdp02.ccta.dk/172.20.242.53:10200 17/05/04 17:14:51 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 8899 for w20960 on ha-hdfs:hdpudv01 17/05/04 17:14:51 INFO security.TokenCache: Got dt for hdfs://hdpudv01; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:hdpudv01, Ident: (HDFS_DELEGATION_TOKEN token 8899 for w20960) 17/05/04 17:14:51 WARN token.Token: Cannot find class for token kind kms-dt 17/05/04 17:14:51 INFO security.TokenCache: Got dt for hdfs://hdpudv01; Kind: kms-dt, Service: 172.20.242.53:9292, Ident: 00 06 57 32 30 39 36 30 04 79 61 72 6e 00 8a 01 5b d4 07 2a 3d 8a 01 5b f8 13 ae 3d 3b 09 17/05/04 17:14:51 WARN ipc.Client: Failed to connect to server: sktudv01hdp02.ccta.dk/172.20.242.53:8032: retries get failed due to exceeded maximum allowed retries number: 0 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:650) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745) at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1618) at org.apache.hadoop.ipc.Client.call(Client.java:1449) at org.apache.hadoop.ipc.Client.call(Client.java:1396) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at com.sun.proxy.$Proxy22.getNewApplication(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:221) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176) at com.sun.proxy.$Proxy23.getNewApplication(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getNewApplication(YarnClientImpl.java:225) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createApplication(YarnClientImpl.java:233) at org.apache.hadoop.mapred.ResourceMgrDelegate.getNewJobID(ResourceMgrDelegate.java:188) at org.apache.hadoop.mapred.YARNRunner.getNewJobID(YARNRunner.java:231) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:153) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) at org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:200) at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:173) at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:270) at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:692) at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:507) at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:615) at org.apache.sqoop.Sqoop.run(Sqoop.java:147) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:225) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234) at org.apache.sqoop.Sqoop.main(Sqoop.java:243) 17/05/04 17:14:51 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 17/05/04 17:14:51 WARN token.Token: Cannot find class for token kind kms-dt 17/05/04 17:14:51 INFO security.TokenCache: Got dt for hdfs://hdpudv01; Kind: kms-dt, Service: 172.20.242.54:9292, Ident: 00 06 57 32 30 39 36 30 04 79 61 72 6e 00 8a 01 5b d4 07 2a c5 8a 01 5b f8 13 ae c5 8e 04 f8 2b 17/05/04 17:14:52 INFO db.DBInputFormat: Using read commited transaction isolation 17/05/04 17:14:52 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(cpr_nr), MAX(cpr_nr) FROM csrp_total_bas 17/05/04 17:17:30 WARN db.TextSplitter: Generating splits for a textual index column. 17/05/04 17:17:30 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records. 17/05/04 17:17:30 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column. 17/05/04 17:17:30 INFO mapreduce.JobSubmitter: number of splits:6 17/05/04 17:17:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1491967685506_0645 17/05/04 17:17:31 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:hdpudv01, Ident: (HDFS_DELEGATION_TOKEN token 8899 for w20960) 17/05/04 17:17:31 WARN token.Token: Cannot find class for token kind kms-dt 17/05/04 17:17:31 WARN token.Token: Cannot find class for token kind kms-dt Kind: kms-dt, Service: 172.20.242.53:9292, Ident: 00 06 57 32 30 39 36 30 04 79 61 72 6e 00 8a 01 5b d4 07 2a 3d 8a 01 5b f8 13 ae 3d 3b 09 17/05/04 17:17:31 WARN token.Token: Cannot find class for token kind kms-dt 17/05/04 17:17:31 WARN token.Token: Cannot find class for token kind kms-dt Kind: kms-dt, Service: 172.20.242.54:9292, Ident: 00 06 57 32 30 39 36 30 04 79 61 72 6e 00 8a 01 5b d4 07 2a c5 8a 01 5b f8 13 ae c5 8e 04 f8 2b 17/05/04 17:17:32 INFO impl.YarnClientImpl: Submitted application application_1491967685506_0645 17/05/04 17:17:32 INFO mapreduce.Job: The url to track the job: http://sktudv01hdp01.ccta.dk:8088/proxy/application_1491967685506_0645/ 17/05/04 17:17:32 INFO mapreduce.Job: Running job: job_1491967685506_0645 17/05/04 17:17:39 INFO mapreduce.Job: Job job_1491967685506_0645 running in uber mode : false 17/05/04 17:17:39 INFO mapreduce.Job: map 0% reduce 0% 17/05/04 17:20:47 INFO mapreduce.Job: map 33% reduce 0% 17/05/04 17:20:48 INFO mapreduce.Job: map 83% reduce 0% 17/05/04 17:36:50 INFO mapreduce.Job: map 100% reduce 0% 17/05/04 17:36:50 INFO mapreduce.Job: Job job_1491967685506_0645 completed successfully 17/05/04 17:36:51 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=1037068 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=774 HDFS: Number of bytes written=47578984831 HDFS: Number of read operations=24 HDFS: Number of large read operations=0 HDFS: Number of write operations=12 Job Counters Launched map tasks=6 Other local map tasks=6 Total time spent by all maps in occupied slots (ms)=2083062 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=2083062 Total vcore-milliseconds taken by all map tasks=2083062 Total megabyte-milliseconds taken by all map tasks=4266110976 Map-Reduce Framework Map input records=46527615 Map output records=46527615 Input split bytes=774 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=14955 CPU time spent (ms)=1281100 Physical memory (bytes) snapshot=1798893568 Virtual memory (bytes) snapshot=22238547968 Total committed heap usage (bytes)=1817706496 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=47578984831 17/05/04 17:36:51 INFO mapreduce.ImportJobBase: Transferred 44.3114 GB in 1,320.7926 seconds (34.3543 MB/sec) 17/05/04 17:36:51 INFO mapreduce.ImportJobBase: Retrieved 46527615 records. I really don't understand it, i guess it might be some configuration issues on hive or in sqoop but i am just guessing

simon_jespersen · ‎05-03-2017

Hi, I got this problem I need to do some cleaning within columns in my flowfile. the columns are seperated by tabs, sometimes there can be \n within the columns so i cannot use splitlines to access the data, so i tried to use the csv library to read the inputstream. But i get an error : at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77] Caused by: javax.script.ScriptException: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Cannot create PyString with non-byte value in <script> at line number 21 Here are my script which i am running in the ExecuteScript body from org.apache.commons.io import IOUtils from java.nio.charset import StandardCharsets from org.apache.nifi.processor.io import StreamCallback import csv # Define a subclass of StreamCallback for use in session.write() class PyStreamCallback(StreamCallback): def __init__(self): pass def process(self, inputStream, outputStream): newText ='' Text = IOUtils.toString(inputStream, StandardCharsets.UTF_8) reader = csv.reader(Text,delimiter=' ') for row in reader: newText+=(' '.join(row)).rstrip('\n\r') outputStream.write(newText.encode('utf-8')) # end class flowFile = session.get() if(flowFile != None): flowFile = session.write(flowFile, PyStreamCallback()) session.transfer(flowFile, REL_SUCCESS) # implicit return at the end

simon_jespersen · ‎04-27-2017

Hello, I want to do some stuff on my data while ingesting so i created a little python script to do that . I am ingesting csv files and want to remove carriage return, linefeed etc in the columns. Also adding new column with timestamp My script looks lige this import csv import sys flowFile = session.get() with open(flowFile, 'rb') as csvfile: spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|') for row in spamreader: for col in row: col.rstrip('\n\r ') print ', '.join(row) Right now i got this error when running it 2017-04-27 12:40:13,463 ERROR [Timer-Driven Process Thread-8] o.a.nifi.processors.script.ExecuteScript ExecuteScript[id=af58777c-015b-1000-ffff-ffff972978f8] Failed to process session due to org.apache.nifi.processor.exception.ProcessException: javax.script.ScriptException: NameError: name 'flowfile' is not defined in <script> at line number 3: org.apache.nifi.processor.exception.ProcessException: javax.script.ScriptException: NameError: name 'flowfile' is not defined in <script> at line number 3 What should i add to this script in order to output the manipulated flowfiles to the next process. Do i need to rewrite them like session.write(flowfile). I am not a python programmer so please be kind and supply small example Thank you

simon_jespersen · ‎04-27-2017

Thank you very much

Online	Offline
Last Visited	‎01-08-2019 02:20 PM

Member Since	‎03-03-2017 08:44 AM
Last Visited	‎01-08-2019 02:20 PM
Posts	74
Kudos received	9

Cloudera Community

Re: Nifi executeSql on postgresql produce error ca...

Re: manipulate flowfile with executescript process...

Re: how to run processor once on many flowfiles

how to run processor once on many flowfiles

Re: change and add data to flowfiles in executescr...

Re: change and add data to flowfiles in executescr...

change and add data to flowfiles in executescript ...

Re: listSFTP Failed to obtain connection to remot...

sqoop import gives me network timeout using hcatal...

manipulate CSV flowfiles with ExecuteScript pytho...

manipulate flowfile with executescript processor i...

Re: need to reset timestamp in nifi listsftp