Member since
06-08-2017
1049
Posts
518
Kudos Received
312
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 11171 | 04-15-2020 05:01 PM | |
| 7073 | 10-15-2019 08:12 PM | |
| 3081 | 10-12-2019 08:29 PM | |
| 11349 | 09-21-2019 10:04 AM | |
| 4251 | 09-19-2019 07:11 AM |
12-17-2017
10:06 PM
@deepak rathod
Hadoop Yarn Cluster Applications API supports to filter failed jobs for last 24 hours.
* http://<rm http address:port>/ws/v1/cluster/apps
**Please refer to the attached .txt file for the commands because Community web page is replacing some characters with some weird symbols**
Example:- GET "http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=10&startedTimeBegin=1510533313778&startedTimeEnd=1513533313778&states=FAILED" refer to attached txt file Query 1.
Use the above command as a reference and replace with your resource manager ipaddress and port number.
Explanation:-
i'm limiting the query results to 10 by specifying limit=10 parameter and started and finished times have a begin and end parameter to allow you to specify ranges and we need only the jobs having states FAILED.
Start Time :- GMT: Monday, November 13, 2017 12:35:13.778 AM
Finish Time:- GMT: Sunday, December 17, 2017 5:55:13.778 PM
The whole rest api call is going to result the jobs that got state as Failed for the period Nov 13 12:35:13.778 - Dec17 5:55:13.778 with first 10 applications(limit=10).
As You can change the start begin and end times according to your requirements.
Note:- startedTimeBegin,startedTimeEnd are specified in milliseconds since epoch
2. Query To get all the apps having states as FINISHED,KILLED by the specific user for specific time period GET "http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=20&states=FINISHED,KILLED&user=<user-id>&startedTimeBegin=1510533313778&startedTimeEnd=1513533313778" refer to attached txt file Query 2.
Below is the list of App states allowed in the query as we can use one state or more states in the query
NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED
In Addition
Supported Query Parameters
* states - applications matching the given application states, specified as a comma-separated list.
* finalStatus - the final status of the application - reported by the application itself
* user - user name
* queue - queue name
* limit - total number of app objects to be returned
* startedTimeBegin - applications with start time beginning with this time, specified in ms since epoch
* startedTimeEnd - applications with start time ending with this time, specified in ms since epoch
* finishedTimeBegin - applications with finish time beginning with this time, specified in ms since epoch
* finishedTimeEnd - applications with finish time ending with this time, specified in ms since epoch
* applicationTypes - applications matching the given application types, specified as a comma-separated list.
* applicationTags - applications matching any of the given application tags, specified as a comma-separated list.
* deSelects - a generic fields which will be skipped in the result.
To get failed jobs for the specific user and for specific time period GET "http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&startedTimeBegin=1510533313778&startedTimeEnd=1513533313778&states=FAILED&user=<user-id>" refer to attached txt file Query 3.
To get finished jobs for the specific user
GET "http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&states=FINISHED&user=<user-id>"
refer to attached txt file Query 4.
To get finished jobs based on the application type
In the below query i'm resulting tez application type finished jobs and limiting the results to 1.
GET "http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&states=FINISHED&applicationTypes=tez"
refer to attached txt file Query 5.
get spark application type jobs
GET "http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&states=FINISHED&applicationTypes=spark"
refer to attached txt file Query 6
As we can use any or combination of above parameters in our Rest Api Queries. cluster-applications-api.txt
For Reference
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API
... View more
12-17-2017
04:00 AM
@Jonathan Bell We also faced these dropping
connection issues but once we added
validation query to the connection pool the issue got resolved but we
haven't tried with oracle, Open a jira about lose of connection to oracle. As a workaround to fix this issue you need to use Rest Api commands to disabling the
controller service and to stop refererring put sql processor after that process
got completed then again Enable DBCP Connection pool and start referring PutSQL
processor will help. I tried with the below example I am having a DBCPConnection
pool which is referring to put sql processor, so when you are trying to
push records to oracle Schedule another script which is going to 1.Stop the PutSQL processor Rest Api Command
to Stop the PutSQL processor as
Follows curl -i -X PUT -H 'Content-Type:application/json' -d
'{"component": {"state":
"STOPPED","id":
"61fe0748-0160-1000-3bda-82285b30a012"},"revision":
{"version": 1,"clientId":"6082f82f-0160-1000-c7de-88e9e0df0382"}}'
http://localhost:8080/nifi-api/processors/61fe0748-0160-1000-3bda-82285b30a012 Explanation:- We need to use PUT
http method in curl and my PutSQL
Processor id is 61fe0748-0160-1000-3bda-82285b30a012 and i need to stop the
processor so i used STOPPED as the
state, if you want to start the processor then you need to use RUNNING as state. To find out client id and version number use Developer tools
(chrome, firefox etc) and perform any action (start, stop ...etc) in the NiFi
UI and look at the calls made for the processor id. for reference take a look in the below screenshot 1.Click on Network 2.In filter keep your processor id 3.Click on Response then you can find clientid,version Once you get all the values prepare your curl command to
stop PutSQL processor. 2.Start PutSQL
processor Just change the state as RUNNING then processor will change from Stopped state to running
state. curl -i -X PUT -H 'Content-Type:application/json' -d
'{"component": {"state":
"RUNNING","id":
"61fe0748-0160-1000-3bda-82285b30a012"},"revision":
{"version": 1,"clientId":"6082f82f-0160-1000-c7de-88e9e0df0382"}}'
http://localhost:8080/nifi-api/processors/61fe0748-0160-1000-3bda-82285b30a012 3.Stop
DBCPConnection Pool Controller service To Disable connection pool we need to change the state
element value to DISABLED curl -i -X PUT -H 'Content-Type:application/json' -d
'{"revision":{"clientId":"6082f82f-0160-1000-c7de-88e9e0df0382","version":1},"component":{"id":"61fc97d3-0160-1000-49e3-46201fe71092","state":"DISABLED"}}'
http://localhost:8080/nifi-api/controller-services/61fc97d3-0160-1000-49e3-46201fe71092 Explanation:- DBCPConnectionPool Controller service id is
61fc97d3-0160-1000-49e3-46201fe71092 we need to use state as DISABLED
i.e we are disabling the service. 4.Start
DBCPConnection Pool Controller service Just change the state to ENABLED then the controller service will be enabled. curl -i -X PUT -H 'Content-Type:application/json' -d
'{"revision":{"clientId":"6082f82f-0160-1000-c7de-88e9e0df0382","version":1},"component":{"id":"61fc97d3-0160-1000-49e3-46201fe71092","state":"ENABLED"}}'
http://localhost:8080/nifi-api/controller-services/61fc97d3-0160-1000-49e3-46201fe71092 To find the version number just use the same method using
developer tools as i mentioned above. In addition if you want to stop and start Process Groups use the below curl commands To Stop the Process Group:- My Process Group id is 5d325978-0160-1000-f734-cbbf324b3ec3 To stop process group state needs to be STOPPED curl -i -X PUT -H 'Content-Type: application/json' -d '{"id": "5d325978-0160-1000-f734-cbbf324b3ec3","state": "STOPPED"}' http://localhost:8080/nifi-api/flow/process-groups/5d325978-0160-1000-f734-cbbf324b3ec3 To Start Process Group:- Change the state to RUNNING curl -i -X PUT -H 'Content-Type: application/json' -d '{"id": "5d325978-0160-1000-f734-cbbf324b3ec3","state": "RUNNING"}' http://localhost:8080/nifi-api/flow/process-groups/5d325978-0160-1000-f734-cbbf324b3ec3 By using Rest Api commands prepare a script that will trigger
before PutSQL processor. Script:- 1.Stop the PutSQL Processor 2.Disable DBCP Connection Pool 3.Enable DBCP Connection Pool 4.Start PutSQL Processor restapi-cmnds.txt
... View more
12-15-2017
04:35 AM
6 Kudos
@zkfs Block Size: Physical Location where the data been stored i.e default
size of the HDFS block is 128 MB which we can configure as per our requirement. All blocks of the file are of the same size except the last block, which can be
of same size or smaller. The files are split into 128 MB blocks and then stored
into Hadoop FileSystem. in HDFS each file will be divided into blocks based on
configuration of the size of block and Hadoop application will distributes
those blocks across the cluster. The main aim of splitting the file and storing them across
the cluster is to get more parallelism and replication factor is helpful to get
fault tolerance, but it also helps in running your map tasks close to the data
to avoid putting extra load on the network. Input Split:- Logical representation of Block or more/lesser
than a Block size It is used during data processing in MapReduce program or
other processing techniques. InputSplit doesn’t contain actual data, but a
reference to the data. During MapReduce execution, Hadoop scans through the blocks
and create InputSplits and each inputSplit will be assigned to individual
mappers for processing. Split act as a broker between block and mapper. Let's take If we are have 1.2GB file divided into 10 blocks
i.e each block is almost 128 MB. InputFormat.getSplits() is
responsible for generating the input splits which are going to be used each
split as input for each mapper. By default this class is going to create
one input split for each HDFS block. if input split is not specified and start and end
positions of records are in the same block,then HDFS block size
will be split size then 10 mappers are initialized to load the file, each
mapper loads one block. If the start and end positions of the records are not in the
same block, this is the exact problem that input
splits solve, Input split is going to provide the Start and end positions(offsets) of the records to make sure split having complete record as
key/value pairs to the mappers, then mapper is going to load the block of data according to start and end offset values. If we specify split size is false then whole file will form
one input split and processed by one map which it takes more time to process
when file is big. If your resource is limited and you want to limit the number
of maps then you can mention Split size as 256 MB then then logical grouping of 256 MB
is formed and only 5 maps will be executed with a size of 256 MB.
... View more
12-14-2017
03:58 PM
2 Kudos
@balalaika For that case you need to specify Demarcator property as Shift+enter Configs:- For merge content reference https://community.hortonworks.com/questions/149047/nifi-how-to-handle-with-mergecontent-processor.html
... View more
12-14-2017
02:46 PM
3 Kudos
@balalaika I suspect duplicates are from Replace Text processor you have configured Evaluation Mode
Line-by-Line That means let's take the your json having more than 1 new line, Replace text processor is going to be Replace the whole line with Replacement Value
${attribute1}${attribute2}${attribute3} Example:- Input:- {
"features": [{
"feature": {
"paths": [[[214985.27600000054,
427573.33100000024],
[215011.98900000006,
427568.84200000018],
[215035.35300000012,
427565.00499999896],
[215128.48900000006,
427549.4290000014],
[215134.43699999899,
427548.65599999949],
[215150.86800000072,
427546.87900000066],
[215179.33199999854,
427544.19799999893]]]
},
"attributes": {
"attribute1": "value",
"attribute2": "value",
"attribute3": "value",
"attribute4": "value"
}
}]
} In this input json message we are having 27 lines and My evaluate Json Path configs are same as you mentioned in comments. Replace Text Configs:- Output:- As output we got 27 lines because we are having evaluation mode as line by line. If you change the Evaluation mode to Entire text then Output:- And you are having json message in one line i.e {"features":[{"feature":{"paths":[[[214985.27600000054,427573.33100000024],[215011.98900000006,427568.84200000018],[215035.35300000012,427565.00499999896],[215128.48900000006,427549.4290000014],[215134.43699999899,427548.65599999949],[215150.86800000072,427546.87900000066],[215179.33199999854,427544.19799999893]]]},"attributes":{"attribute1":"value","attribute2":"value","attribute3":"value","attribute4":"value",}}]} Then if you keep replace text configs as line by line or entire text it doesn't matter because we are having just one line as input to the processor and we will get result from replace text as Try to change the configs as per your Input Json Message and run again the processor. Let us know if the processor still resulting duplicate data.
... View more
12-13-2017
05:05 PM
1 Kudo
@sunil kumar We can do your case in different methods
Before Put Elastic search processor we need to make the contents of the flowfile
without city in the contents of flowfile. Let’s consider [{ "id":
"1", "name":
"Michael", "city":
"orlando" }, { "id":
"2", "name":
"John", "city":
"miami" }] You are having above json array with 2 records then you are
expecting [{"id":1,"name":"Michael"},{"id":2,"name":"John"}] i.e without city in the json message. Method 1:- For this case if you are using NiFi 1.2+ there are convert
record processor which will help you to get this work done very easily. I am attaching the xml document here and I tested with the
above json as input. nifi12.xml Method 2:- If you are using prior to NiFi 1.2 Then you need to use SplitJson processor after ConvertAVROtojson. Then use Evaluate json Processor with Destination as
flowfile attribute Then Attributes to Json processor to get only the required
attributes into your resultant content. Now put elastic search processor gets only the required json
message elements. Method 3:- After splitjson use Replace text processor and put your logic
to capture the json message execept city element. I have attached all the flow xml’s, you can download and
make sure with method will best fit for your case. prior-nifi-12.xml
... View more
12-12-2017
07:03 PM
1 Kudo
@sunil kumar, In update attributes processor delete attributes expression property expects attributes in regular expressions. Delete Attributes Expression Regular expression for attributes to be deleted from flowfiles. Supports Expression Language: true If you want to delete id,CompanyName attributes from the flowfile then Delete Attributes Expression id|CompanyName Same way if you want to delete more attrbutes keep them with pipe separated or you can include matching regex that can match all the attributes that you are desired to delete. Configs:- id|Company.* //matches with id attribute and all attribute names starting with Company
id|Company //matches only id,Company attributes only By taking reference you can change and configure Update Attribute processor as per your needs. If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of errors.
... View more
12-12-2017
02:32 PM
@Rakesh AN
Yes it need to update the metadata because let's assume your existing file in HDFS is 127 MB size and you are appending 3 MB file to the existing file i.e 130 MB.Now we are going to split the 130 MB size file to 2 (128+2 MB) and make sure all the replicated files are also updated with the new data. Example:- $ hdfs dfs -ls /user/yashu/test4/
Found 1 items
-rw-r--r-- 3 hdfs hdfs 21 2017-12-11 15:42 /user/yashu/test4/sam.txt
$ hadoop fs -appendToFile sam.txt /user/yashu/test4/sam.txt
$ hdfs dfs -ls /user/yashu/test4/
Found 1 items
-rw-r--r-- 3 hdfs hdfs 30 2017-12-12 09:19 /user/yashu/test4/sam.txt
$ echo "hi"|hdfs dfs -appendToFile - /user/yashu/test4/sam.txt
$ hdfs dfs -ls /user/yashu/test4/
Found 1 items
-rw-r--r-- 3 hdfs hdfs 33 2017-12-12 09:20 /user/yashu/test4/sam.txt In this above example you can see my HDFS file is having size 21 and date is 2017-12-11 15:42 and then i appended the file then the size and date has changed. Name node needs to update the new metadata of the file and update the replicated blocks also.HDFS MetaData It won't reduce the performance if you are having big file sizes also. Append new data to the existing file. https://community.hortonworks.com/questions/16278/best-practises-beetwen-size-block-size-file-and-re.html
... View more
12-12-2017
01:40 PM
2 Kudos
@Ranith Ranawaka
After Attributes to Json processor use Replace Text processor with Search Value "clientid"\s+:\s+"(.*)" Replacement Value "clientid" : $1 Replacement Strategy Regex Replace Evaluation Mode Entire text
Configs:- Input:- {
"clientid" : "2",
"id":"1",
"name":"HCC"
} Output:- {
"clientid" : 2,
"id":"1",
"name":"HCC"
} So we are searching for the value of client id and replacing the value without quotes by using replace text processor.
... View more
12-11-2017
09:32 PM
@Rakesh AN Yes, you can append some rows to the existing text file in
hdfs appendToFile Usage: hdfs dfs -appendToFile <localsrc> ...
<dst> Append single src, or multiple srcs from local file system
to the destination file system. Also reads input from stdin and appends to
destination file system. hdfs dfs -appendToFile localfile /user/hadoop/hadoopfile hdfs dfs -appendToFile localfile1 localfile2
/user/hadoop/hadoopfile hdfs dfs -appendToFile localfile
hdfs://nn.example.com/hadoop/hadoopfile hdfs dfs -appendToFile -
hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin. echo "hi"|hdfs dfs -appendToFile - /user/hadoop/hadoopfile pros:- A small file is one which is significantly smaller than the HDFS block sizeEvery file, Directory and block in HDFS is represented as an object in the namenode’s memory, the problem is that HDFS can’t handle lots of files, it is
good to have large files in HDFS instead of small files. more info Cons:- When we wants append to hdfs file we must need to obtain a
lease which is essentially a lock, to ensure the single writer
semantics.more info In addition if you are having n part files in hdfs directory then wants to merge them into 1 file then hadoop jar
/usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \
-Dmapred.reduce.tasks=1 \
-input "<path-to-input-directory>" \
-output "<path-to-output-directory>" \
-mapper cat \
-reducer cat make sure which version of hadoop streaming jar you are
using by going to /usr/hdp then give the input path and make sure the output directory
is not existed as this job will merge the files and creates the output
directory for you. Here what i tried:- #hdfs dfs -ls /user/yashu/folder2/
Found 2 items
-rw-r--r-- 3 hdfs hdfs 150 2017-09-26 17:55 /user/yashu/folder2/part1.txt
-rw-r--r-- 3 hdfs hdfs 20 2017-09-27 09:07 /user/yashu/folder2/part1_sed.txt #hadoop jar
/usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \>
-Dmapred.reduce.tasks=1 \>
-input "/user/yashu/folder2/" \>
-output "/user/yashu/folder1/" \>
-mapper cat \>
-reducer cat Folder2 having 2 files after running the above command, i am
storing the merged files to folder1 directory and the 2 files got merged into 1
file as you can see below. #hdfs dfs -ls /user/yashu/folder1/
Found 2 items
-rw-r--r-- 3 hdfs hdfs 0 2017-10-09 16:00 /user/yashu/folder1/_SUCCESS
-rw-r--r-- 3 hdfs hdfs 174 2017-10-09 16:00 /user/yashu/folder1/part-00000 If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of errors.
... View more