Member since
02-07-2017
23
Posts
2
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
14557 | 01-23-2018 03:02 PM |
03-30-2018
10:34 AM
Can you please share the complete log? A few lines of log doesn't help much.
... View more
02-15-2018
04:22 PM
Share the error details please.
... View more
02-15-2018
04:02 PM
Can you attach sample input data and the nifi-flow (export it as XML) ? It would be helpful to understand what you are doing and decide where things could be changed.
... View more
02-15-2018
03:57 PM
Why don't you use MergeContent processor to concatenate the flow-file content?
... View more
02-04-2018
05:31 AM
Please attach your nifi-app.log when the issue happens. That would help us with a bit more necessary details.
... View more
02-04-2018
05:18 AM
@Matt Krueger Your table is ACID i.e. transaction enabled. Spark doesn't support reading Hive ACID table. Take a look at SPARK-15348 and SPARK-16996
... View more
02-04-2018
04:26 AM
I believe when you say "on Azure", you actually mean you are using Hive shell(or beeline shell) from a node of Azure HDInsight cluster. In that case, add the following line in the beginning of your .hql script. set hive.cli.print.header=true;
... View more
02-04-2018
04:11 AM
Could you share a bit more about your issue, like the scenario or the process that you are doing? You would get more responses that way. 🙂
... View more
01-28-2018
03:41 AM
Take a look at this guide: https://cwiki.apache.org/confluence/display/hive/languagemanual+dml#LanguageManualDML-Loadingfilesintotables You should either try INSERT INTO TABLE '${hiveconf:inputtable}' SELECT * FROM datafactory7 limit 14; or LOAD DATA INPATH '<HDFS PATH WHERE FILES LOCATED>' INTO TABLE ${hiveconf:inputtable};
... View more
01-27-2018
10:34 AM
I could see the following in the error log: ERROR: org.apache.hadoop.security.authorize.AuthorizationException: User: livy is not allowed to impersonate admin Looks like your hadoop cluster is a secure one. You need to grant livy the ability to impersonate as the originating user. You need to add two properties to core-site.xml. Take a look at this guide.
... View more
01-27-2018
10:28 AM
How different is it from using livy to do the same?
... View more
01-27-2018
10:22 AM
Did you take a look at the log you shared? It says "cannot resolve" some columns. LogType:stdout
Log Upload Time:Fri Jan 26 04:38:43 -0500 2018
LogLength:1195
Log Contents:
Traceback (most recent call last):
File "Essbaselog.py", line 57, in <module>
dfinal=sqlContext.sql("Select metaDataTemp.id,errorDataTemp.ErrorNum,errorDataTemp.ExecutionTimeStamp,errorDataTemp.ExecutionDate,errorDataTemp.ExecutionTime,errorDataTemp.Message,date_format(current_date(), 'd/M/y') from metaDataTemp Inner Join errorDataTemp on 1=1 where errorDataTemp.ErrorNum BETWEEN metaDataTemp.ErrorStart and metaDataTemp.ErrorEnd")
File "/data/hadoop/yarn/local/usercache/user/appcache/application_1516887566537_0020/container_e22_1516887566537_0020_01_000001/pyspark.zip/pyspark/sql/context.py", line 583, in sql
File "/data/hadoop/yarn/local/usercache/user/appcache/application_1516887566537_0020/container_e22_1516887566537_0020_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/data/hadoop/yarn/local/usercache/user/appcache/application_1516887566537_0020/container_e22_1516887566537_0020_01_000001/pyspark.zip/pyspark/sql/utils.py", line 51, in deco
pyspark.sql.utils.AnalysisException: u"cannot resolve 'metaDataTemp.ErrorStart' given input columns ExecutionTime, ExecutionTimeStamp, Message, ErrorNum, id, ExecutionDate, ExecutionEpoch;"
End of LogType:stdout
... View more
01-27-2018
10:11 AM
A small correction. It's introduced in Ranger 0.7 and policies should look like this: //HDFS
resource: path=/home/{USER}
user: {USER}
//Hive
resource: database=db_{USER}; table=*; column=*
user: {USER} where {USER} would substitute the user id of the currently logged in user.
... View more
01-24-2018
07:08 AM
Spark by default looks for files in HDFS but for some reason if you want to load file from the local filesystem, you need to prepend "file://" before the file path. So your code will be Dataset<Row> jsonTest = spark.read().json("file:///tmp/testJSON.json"); However this will be a problem when you are submitting in cluster mode since cluster mode will execute on the worker nodes. All the worker nodes are expected to have that file in that exact path so it will fail. To overcome, you can pass the file path in the --files parameter while running spark-submit which will put the file on the classpath so you can refer the file by simply calling the file name alone. For ex, if you submitted the following way: > spark-submit --master <your_master> --files /tmp/testJSON.json --deploy-mode cluster --class <main_class> <application_jar> then you can simply read the file the following way: Dataset<Row> jsonTest = spark.read().json("testJSON.json");
... View more
01-23-2018
03:02 PM
1 Kudo
Cloudera has a nice two part tuning guide. Attaching the links: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
... View more
01-23-2018
02:28 PM
Just google, "Install Python 3 in Linux"
... View more
01-23-2018
02:14 PM
Looks like you have setup some security/access policies set up like with Ranger or something. Create/modify a policy to have the WRITE permission.
... View more
01-23-2018
02:13 PM
Apart form JVM limitation, which you can increase, there are no
definite limitation on size or number of flowfile records as such. I would say design your flow and if you feel that you're throttled, you can follow some good designing practices to tweak your flow. Take a look at this: https://community.hortonworks.com/articles/7882/hdfnifi-best-practices-for-setting-up-a-high-perfo.html
... View more
01-23-2018
01:49 PM
The attached logs doesn't have the error stack trace. Attach the correct nifi-app.log file and for the port info, you can get that from Ambari.
... View more
01-23-2018
01:03 PM
As @Matt Burgess said, when you use ReplaceText processor the way he said, your flowfile content would be changed from [flowfile name="11" size="1KB"........ timestamp="2018-01-21 10:00:00.00000000"][end of file] to name="11" size="1KB"........ timestamp="2018-01-21 10:00:00.00000000". You can then connect the Success relationship from ReplaceText to ExtractText and use (.*) as the regular expression to get the content and assign it to an attribute. You can ignore this step, if you are just trying to extract contents to an attribute and use ReplaceText to write that attribute to the flowfile content since ReplaceText itself does that in the above approach.
... View more
04-05-2017
03:36 PM
1 Kudo
I'm looking for a way to transfer flowfiles from one processor group to another processor group's input port using the NiFi's rest API. The rest API doc is kinda vague. I assume the "Data Transfer API" section is the one I should be interested in. I was going through the same but couldn't get it to work. I wanted to try this endpoint: http://localhost/nifi-api/data-transfer/input-ports/{portId}/transactions/{transactionId}/flow-files I know I can get the portId from the NiFi UI but what about the transactionId. Note: I thought that we may have to create a transaction ourselves using "/data-transfer/{portType}/{portId}/transactions" so I tried http://localhost/nifi-api/data-transfer/input-ports/xxxxx-xx-xxx-xx/transactions but it says "The requested port with id "xxxxxx" is not found in root level". I have this port not in root but inside a processor group then how should I give that here?
... View more
Labels:
- Labels:
-
Apache NiFi
02-07-2017
06:26 AM
I went with Load data command because, insert command takes time. Especially when we talk about thousands and ten-thousands of records coming in, it takes hours to insert them. Load is simple and fast since it just copies the files from HDFS into Hive's warehouse.
... View more
02-07-2017
05:06 AM
I have started working with NiFi. I am working on a use case to load data into Hive. I get a CSV file and then I use SplitText to split the incoming flow-file into multiple flow-files(split record by record). Then I use ConvertToAvro to convert the split CSV file into an AVRO file. After that, I put the AVRO files into a directory in HDFS and I trigger the "LOAD DATA" command using ReplaceText + PutHiveQL processor. I'm splitting the file record by record because to get the partition value(since LOAD DATA doesn't support dynamic partitioning). The flow looks like this: GetFile (CSV) --- SplitText (split line count :1 and header line count : 1) --- ExtractText (Use RegEx to get partition fields' values and assign to attribute) --- ConvertToAvro (Specifying the Schema) --- PutHDFS (Writing to a HDFS location) --- ReplaceText (LOAD DATA cmd with partition info) --- PutHiveQL The thing is, since I'm splitting the CSV file into each record at a time, it generates too many avro files. For ex, if the CSV file has 100 records, it creates 100 AVRO files. Since I want to get the partition values, I have to split them by one record at a time. I want to know is there any way, we can achieve this thing without splitting record by record. I mean like batching it. I'm quite new to this so I am unable to crack this yet. Help me with this. PS: Do suggest me if there is any alternate approach to achieve this use case.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache NiFi