Member since
10-24-2017
101
Posts
14
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2170 | 07-26-2017 09:57 PM | |
4075 | 12-13-2016 12:08 PM | |
1184 | 07-28-2016 08:41 PM | |
4304 | 06-15-2016 07:57 AM |
12-14-2018
08:39 AM
1 Kudo
How can I setup the list file processor so that it excludes certain folders? For example if we have the following director /root/A /root/B /root/C /root/D How can i exclude folders C and B in the path filter? Thanks, Ahmad
... View more
Labels:
- Labels:
-
Apache NiFi
05-25-2018
10:38 AM
It's still not working unfortunately
... View more
05-17-2018
07:27 AM
@Kiran Nittala I have thousands of flowfiles in queues in other processors. Will i lose them if i restart nifi? Thanks Ahmad
... View more
05-16-2018
09:10 AM
I am getting this error with the FetchFile or GetFile processors 2018-05-16 12:04:15,133 WARN [Timer-Driven Process Thread-7] o.a.n.controller.tasks.ConnectableTask Administratively Yielding ListFile[id=6804e073-0163-1000-56d7-1d884aef90a1] due to uncaught Exception: java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: /mnt/win/ArchivesCD/08103DOC.CFL/P1-Especifica????es.pdf
java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: /mnt/win/ArchivesCD/08103DOC.CFL/P1-Especifica????es.pdf How can i resolve this? And does the processor abandon the entire process or skip the "malformed" file? My nifi version is 1.6.0 Thanks Ahmad
... View more
Labels:
- Labels:
-
Apache NiFi
07-26-2017
09:57 PM
this is the code i came up with, is there a better approach? val ds = filteredDF.as[(Integer, String, String, String, String, Double, Integer)]
var df = ds.flatMap {
case (x1, x2, x3, x4, x5, x6, x7) => x3.split(",").map((x1, x2, _, x4, x5, x6, x7))
}.toDF
... View more
07-26-2017
06:33 PM
I am working with scala and i have a dataframe with one of its columns containing several values delimited by a comma. How can i turn these rows ["1", "x,y,z,", "A"] ["2", "x,y", "B"] into ["1", "x,", "A"] ["1", "y,", "A"] ["1", "z", "A"] ["2", "x", "B"] ["2", "y", "B"]
... View more
Labels:
- Labels:
-
Apache Spark
05-05-2017
03:00 PM
I am getting this error when i try to save a dataframe into a file 17/05/05 17:19:20 ERROR DefaultWriterContainer: Job job_201705051719_0000 aborted.
Traceback (most recent call last):
File "/opt/sqlscrapper.py", line 24, in <module>
df.write.format("orc").save("/tmp/orc_query_output")
File "/usr/hdp/2.5.3.0-37/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 397, in save
File "/usr/hdp/2.5.3.0-37/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/usr/hdp/2.5.3.0-37/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/usr/hdp/2.5.3.0-37/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o51.save.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:154)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:106)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:106)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:106)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, db-hdp-dn2.darbeirut.com): java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:53)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:52)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:347)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
import os
from pyspark import SparkConf,SparkContext
from pyspark.sql import HiveContext
import pandas as pd
conf = (SparkConf()
.setAppName("data_import")
.set("spark.dynamicAllocation.enabled","true")
.set("spark.shuffle.service.enabled","true"))
sc = SparkContext(conf = conf)
sqlctx = HiveContext(sc)
df = sqlctx.load(
source="jdbc",
url="jdbc:sqlserver://db-sqltech:1433;database=WebUsage;user=username;password=password",
dbtable="EmployeeMobiles",
properties={"driver": 'com.sqlserver.jdbc.Driver'})
df.write.format("orc").save("/tmp/orc_query_output")
df.write.mode('overwrite').format('orc').saveAsTable("WebLog")
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
05-05-2017
02:28 PM
Hey i am getting the following error 17/05/05 17:19:20 ERROR DefaultWriterContainer: Job job_201705051719_0000 aborted.
Traceback (most recent call last):
File "/opt/sqlscrapper.py", line 24, in <module>
df.write.format("orc").save("/tmp/orc_query_output")
File "/usr/hdp/2.5.3.0-37/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 397, in save
File "/usr/hdp/2.5.3.0-37/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/usr/hdp/2.5.3.0-37/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/usr/hdp/2.5.3.0-37/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o51.save.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:154)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:106)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:106)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:106)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, db-hdp-dn2.darbeirut.com): java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:53)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:52)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:347)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
... View more
04-06-2017
01:36 PM
i am trying the listhdfs processor, for some reason it is only retrieving around 5000 files
... View more
04-06-2017
08:12 AM
Hello Everytime i face an error in my nifi workflow, the gethdfs processor recrawls the hdfs directory right from the beginning. I want to keep the files where they are in hdfs (keep source file = true) How can i have the gethdfs processor continue from where it stopped? Thanks
... View more
Labels:
- Labels:
-
Apache NiFi