About Shu_ashu

Shu_ashu · ‎12-17-2017

@deepak rathod Hadoop Yarn Cluster Applications API supports to filter failed jobs for last 24 hours. * http://<rm http address:port>/ws/v1/cluster/apps **Please refer to the attached .txt file for the commands because Community web page is replacing some characters with some weird symbols** Example:- GET "http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=10&startedTimeBegin=1510533313778&startedTimeEnd=1513533313778&states=FAILED" refer to attached txt file Query 1. Use the above command as a reference and replace with your resource manager ipaddress and port number. Explanation:- i'm limiting the query results to 10 by specifying limit=10 parameter and started and finished times have a begin and end parameter to allow you to specify ranges and we need only the jobs having states FAILED. Start Time :- GMT: Monday, November 13, 2017 12:35:13.778 AM Finish Time:- GMT: Sunday, December 17, 2017 5:55:13.778 PM The whole rest api call is going to result the jobs that got state as Failed for the period Nov 13 12:35:13.778 - Dec17 5:55:13.778 with first 10 applications(limit=10). As You can change the start begin and end times according to your requirements. Note:- startedTimeBegin,startedTimeEnd are specified in milliseconds since epoch 2. Query To get all the apps having states as FINISHED,KILLED by the specific user for specific time period GET "http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=20&states=FINISHED,KILLED&user=<user-id>&startedTimeBegin=1510533313778&startedTimeEnd=1513533313778" refer to attached txt file Query 2. Below is the list of App states allowed in the query as we can use one state or more states in the query NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED In Addition Supported Query Parameters * states - applications matching the given application states, specified as a comma-separated list. * finalStatus - the final status of the application - reported by the application itself * user - user name * queue - queue name * limit - total number of app objects to be returned * startedTimeBegin - applications with start time beginning with this time, specified in ms since epoch * startedTimeEnd - applications with start time ending with this time, specified in ms since epoch * finishedTimeBegin - applications with finish time beginning with this time, specified in ms since epoch * finishedTimeEnd - applications with finish time ending with this time, specified in ms since epoch * applicationTypes - applications matching the given application types, specified as a comma-separated list. * applicationTags - applications matching any of the given application tags, specified as a comma-separated list. * deSelects - a generic fields which will be skipped in the result. To get failed jobs for the specific user and for specific time period GET "http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&startedTimeBegin=1510533313778&startedTimeEnd=1513533313778&states=FAILED&user=<user-id>" refer to attached txt file Query 3. To get finished jobs for the specific user GET "http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&states=FINISHED&user=<user-id>" refer to attached txt file Query 4. To get finished jobs based on the application type In the below query i'm resulting tez application type finished jobs and limiting the results to 1. GET "http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&states=FINISHED&applicationTypes=tez" refer to attached txt file Query 5. get spark application type jobs GET "http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&states=FINISHED&applicationTypes=spark" refer to attached txt file Query 6 As we can use any or combination of above parameters in our Rest Api Queries. cluster-applications-api.txt For Reference https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API

Shu_ashu · ‎12-17-2017

@Jonathan Bell We also faced these dropping connection issues but once we added validation query to the connection pool the issue got resolved but we haven't tried with oracle, Open a jira about lose of connection to oracle. As a workaround to fix this issue you need to use Rest Api commands to disabling the controller service and to stop refererring put sql processor after that process got completed then again Enable DBCP Connection pool and start referring PutSQL processor will help. I tried with the below example I am having a DBCPConnection pool which is referring to put sql processor, so when you are trying to push records to oracle Schedule another script which is going to 1.Stop the PutSQL processor Rest Api Command to Stop the PutSQL processor as Follows curl -i -X PUT -H 'Content-Type:application/json' -d '{"component": {"state": "STOPPED","id": "61fe0748-0160-1000-3bda-82285b30a012"},"revision": {"version": 1,"clientId":"6082f82f-0160-1000-c7de-88e9e0df0382"}}' http://localhost:8080/nifi-api/processors/61fe0748-0160-1000-3bda-82285b30a012 Explanation:- We need to use PUT http method in curl and my PutSQL Processor id is 61fe0748-0160-1000-3bda-82285b30a012 and i need to stop the processor so i used STOPPED as the state, if you want to start the processor then you need to use RUNNING as state. To find out client id and version number use Developer tools (chrome, firefox etc) and perform any action (start, stop ...etc) in the NiFi UI and look at the calls made for the processor id. for reference take a look in the below screenshot 1.Click on Network 2.In filter keep your processor id 3.Click on Response then you can find clientid,version Once you get all the values prepare your curl command to stop PutSQL processor. 2.Start PutSQL processor Just change the state as RUNNING then processor will change from Stopped state to running state. curl -i -X PUT -H 'Content-Type:application/json' -d '{"component": {"state": "RUNNING","id": "61fe0748-0160-1000-3bda-82285b30a012"},"revision": {"version": 1,"clientId":"6082f82f-0160-1000-c7de-88e9e0df0382"}}' http://localhost:8080/nifi-api/processors/61fe0748-0160-1000-3bda-82285b30a012 3.Stop DBCPConnection Pool Controller service To Disable connection pool we need to change the state element value to DISABLED curl -i -X PUT -H 'Content-Type:application/json' -d '{"revision":{"clientId":"6082f82f-0160-1000-c7de-88e9e0df0382","version":1},"component":{"id":"61fc97d3-0160-1000-49e3-46201fe71092","state":"DISABLED"}}' http://localhost:8080/nifi-api/controller-services/61fc97d3-0160-1000-49e3-46201fe71092 Explanation:- DBCPConnectionPool Controller service id is 61fc97d3-0160-1000-49e3-46201fe71092 we need to use state as DISABLED i.e we are disabling the service. 4.Start DBCPConnection Pool Controller service Just change the state to ENABLED then the controller service will be enabled. curl -i -X PUT -H 'Content-Type:application/json' -d '{"revision":{"clientId":"6082f82f-0160-1000-c7de-88e9e0df0382","version":1},"component":{"id":"61fc97d3-0160-1000-49e3-46201fe71092","state":"ENABLED"}}' http://localhost:8080/nifi-api/controller-services/61fc97d3-0160-1000-49e3-46201fe71092 To find the version number just use the same method using developer tools as i mentioned above. In addition if you want to stop and start Process Groups use the below curl commands To Stop the Process Group:- My Process Group id is 5d325978-0160-1000-f734-cbbf324b3ec3 To stop process group state needs to be STOPPED curl -i -X PUT -H 'Content-Type: application/json' -d '{"id": "5d325978-0160-1000-f734-cbbf324b3ec3","state": "STOPPED"}' http://localhost:8080/nifi-api/flow/process-groups/5d325978-0160-1000-f734-cbbf324b3ec3 To Start Process Group:- Change the state to RUNNING curl -i -X PUT -H 'Content-Type: application/json' -d '{"id": "5d325978-0160-1000-f734-cbbf324b3ec3","state": "RUNNING"}' http://localhost:8080/nifi-api/flow/process-groups/5d325978-0160-1000-f734-cbbf324b3ec3 By using Rest Api commands prepare a script that will trigger before PutSQL processor. Script:- 1.Stop the PutSQL Processor 2.Disable DBCP Connection Pool 3.Enable DBCP Connection Pool 4.Start PutSQL Processor restapi-cmnds.txt

Shu_ashu · ‎12-15-2017

@zkfs Block Size: Physical Location where the data been stored i.e default size of the HDFS block is 128 MB which we can configure as per our requirement. All blocks of the file are of the same size except the last block, which can be of same size or smaller. The files are split into 128 MB blocks and then stored into Hadoop FileSystem. in HDFS each file will be divided into blocks based on configuration of the size of block and Hadoop application will distributes those blocks across the cluster. The main aim of splitting the file and storing them across the cluster is to get more parallelism and replication factor is helpful to get fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network. Input Split:- Logical representation of Block or more/lesser than a Block size It is used during data processing in MapReduce program or other processing techniques. InputSplit doesn’t contain actual data, but a reference to the data. During MapReduce execution, Hadoop scans through the blocks and create InputSplits and each inputSplit will be assigned to individual mappers for processing. Split act as a broker between block and mapper. Let's take If we are have 1.2GB file divided into 10 blocks i.e each block is almost 128 MB. InputFormat.getSplits() is responsible for generating the input splits which are going to be used each split as input for each mapper. By default this class is going to create one input split for each HDFS block. if input split is not specified and start and end positions of records are in the same block,then HDFS block size will be split size then 10 mappers are initialized to load the file, each mapper loads one block. If the start and end positions of the records are not in the same block, this is the exact problem that input splits solve, Input split is going to provide the Start and end positions(offsets) of the records to make sure split having complete record as key/value pairs to the mappers, then mapper is going to load the block of data according to start and end offset values. If we specify split size is false then whole file will form one input split and processed by one map which it takes more time to process when file is big. If your resource is limited and you want to limit the number of maps then you can mention Split size as 256 MB then then logical grouping of 256 MB is formed and only 5 maps will be executed with a size of 256 MB.

Shu_ashu · ‎12-14-2017

@balalaika For that case you need to specify Demarcator property as Shift+enter Configs:- For merge content reference https://community.hortonworks.com/questions/149047/nifi-how-to-handle-with-mergecontent-processor.html

Shu_ashu · ‎12-14-2017

@balalaika I suspect duplicates are from Replace Text processor you have configured Evaluation Mode Line-by-Line That means let's take the your json having more than 1 new line, Replace text processor is going to be Replace the whole line with Replacement Value ${attribute1}${attribute2}${attribute3} Example:- Input:- { "features": [{ "feature": { "paths": [[[214985.27600000054, 427573.33100000024], [215011.98900000006, 427568.84200000018], [215035.35300000012, 427565.00499999896], [215128.48900000006, 427549.4290000014], [215134.43699999899, 427548.65599999949], [215150.86800000072, 427546.87900000066], [215179.33199999854, 427544.19799999893]]] }, "attributes": { "attribute1": "value", "attribute2": "value", "attribute3": "value", "attribute4": "value" } }] } In this input json message we are having 27 lines and My evaluate Json Path configs are same as you mentioned in comments. Replace Text Configs:- Output:- As output we got 27 lines because we are having evaluation mode as line by line. If you change the Evaluation mode to Entire text then Output:- And you are having json message in one line i.e {"features":[{"feature":{"paths":[[[214985.27600000054,427573.33100000024],[215011.98900000006,427568.84200000018],[215035.35300000012,427565.00499999896],[215128.48900000006,427549.4290000014],[215134.43699999899,427548.65599999949],[215150.86800000072,427546.87900000066],[215179.33199999854,427544.19799999893]]]},"attributes":{"attribute1":"value","attribute2":"value","attribute3":"value","attribute4":"value",}}]} Then if you keep replace text configs as line by line or entire text it doesn't matter because we are having just one line as input to the processor and we will get result from replace text as Try to change the configs as per your Input Json Message and run again the processor. Let us know if the processor still resulting duplicate data.

Shu_ashu · ‎12-13-2017

@sunil kumar We can do your case in different methods Before Put Elastic search processor we need to make the contents of the flowfile without city in the contents of flowfile. Let’s consider [{ "id": "1", "name": "Michael", "city": "orlando" }, { "id": "2", "name": "John", "city": "miami" }] You are having above json array with 2 records then you are expecting [{"id":1,"name":"Michael"},{"id":2,"name":"John"}] i.e without city in the json message. Method 1:- For this case if you are using NiFi 1.2+ there are convert record processor which will help you to get this work done very easily. I am attaching the xml document here and I tested with the above json as input. nifi12.xml Method 2:- If you are using prior to NiFi 1.2 Then you need to use SplitJson processor after ConvertAVROtojson. Then use Evaluate json Processor with Destination as flowfile attribute Then Attributes to Json processor to get only the required attributes into your resultant content. Now put elastic search processor gets only the required json message elements. Method 3:- After splitjson use Replace text processor and put your logic to capture the json message execept city element. I have attached all the flow xml’s, you can download and make sure with method will best fit for your case. prior-nifi-12.xml

Shu_ashu · ‎12-12-2017

@sunil kumar, In update attributes processor delete attributes expression property expects attributes in regular expressions. Delete Attributes Expression Regular expression for attributes to be deleted from flowfiles. Supports Expression Language: true If you want to delete id,CompanyName attributes from the flowfile then Delete Attributes Expression id|CompanyName Same way if you want to delete more attrbutes keep them with pipe separated or you can include matching regex that can match all the attributes that you are desired to delete. Configs:- id|Company.* //matches with id attribute and all attribute names starting with Company id|Company //matches only id,Company attributes only By taking reference you can change and configure Update Attribute processor as per your needs. If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of errors.

Shu_ashu · ‎12-12-2017

@Rakesh AN Yes it need to update the metadata because let's assume your existing file in HDFS is 127 MB size and you are appending 3 MB file to the existing file i.e 130 MB.Now we are going to split the 130 MB size file to 2 (128+2 MB) and make sure all the replicated files are also updated with the new data. Example:- $ hdfs dfs -ls /user/yashu/test4/ Found 1 items -rw-r--r-- 3 hdfs hdfs 21 2017-12-11 15:42 /user/yashu/test4/sam.txt $ hadoop fs -appendToFile sam.txt /user/yashu/test4/sam.txt $ hdfs dfs -ls /user/yashu/test4/ Found 1 items -rw-r--r-- 3 hdfs hdfs 30 2017-12-12 09:19 /user/yashu/test4/sam.txt $ echo "hi"|hdfs dfs -appendToFile - /user/yashu/test4/sam.txt $ hdfs dfs -ls /user/yashu/test4/ Found 1 items -rw-r--r-- 3 hdfs hdfs 33 2017-12-12 09:20 /user/yashu/test4/sam.txt In this above example you can see my HDFS file is having size 21 and date is 2017-12-11 15:42 and then i appended the file then the size and date has changed. Name node needs to update the new metadata of the file and update the replicated blocks also.HDFS MetaData It won't reduce the performance if you are having big file sizes also. Append new data to the existing file. https://community.hortonworks.com/questions/16278/best-practises-beetwen-size-block-size-file-and-re.html

Shu_ashu · ‎12-12-2017

@Ranith Ranawaka After Attributes to Json processor use Replace Text processor with Search Value "clientid"\s+:\s+"(.*)" Replacement Value "clientid" : $1 Replacement Strategy Regex Replace Evaluation Mode Entire text Configs:- Input:- { "clientid" : "2", "id":"1", "name":"HCC" } Output:- { "clientid" : 2, "id":"1", "name":"HCC" } So we are searching for the value of client id and replacing the value without quotes by using replace text processor.

Shu_ashu · ‎12-11-2017

@Rakesh AN Yes, you can append some rows to the existing text file in hdfs appendToFile Usage: hdfs dfs -appendToFile <localsrc> ... <dst> Append single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and appends to destination file system. hdfs dfs -appendToFile localfile /user/hadoop/hadoopfile hdfs dfs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile hdfs dfs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile hdfs dfs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin. echo "hi"|hdfs dfs -appendToFile - /user/hadoop/hadoopfile pros:- A small file is one which is significantly smaller than the HDFS block sizeEvery file, Directory and block in HDFS is represented as an object in the namenode’s memory, the problem is that HDFS can’t handle lots of files, it is good to have large files in HDFS instead of small files. more info Cons:- When we wants append to hdfs file we must need to obtain a lease which is essentially a lock, to ensure the single writer semantics.more info In addition if you are having n part files in hdfs directory then wants to merge them into 1 file then hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \ -Dmapred.reduce.tasks=1 \ -input "<path-to-input-directory>" \ -output "<path-to-output-directory>" \ -mapper cat \ -reducer cat make sure which version of hadoop streaming jar you are using by going to /usr/hdp then give the input path and make sure the output directory is not existed as this job will merge the files and creates the output directory for you. Here what i tried:- #hdfs dfs -ls /user/yashu/folder2/ Found 2 items -rw-r--r-- 3 hdfs hdfs 150 2017-09-26 17:55 /user/yashu/folder2/part1.txt -rw-r--r-- 3 hdfs hdfs 20 2017-09-27 09:07 /user/yashu/folder2/part1_sed.txt #hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \> -Dmapred.reduce.tasks=1 \> -input "/user/yashu/folder2/" \> -output "/user/yashu/folder1/" \> -mapper cat \> -reducer cat Folder2 having 2 files after running the above command, i am storing the merged files to folder1 directory and the 2 files got merged into 1 file as you can see below. #hdfs dfs -ls /user/yashu/folder1/ Found 2 items -rw-r--r-- 3 hdfs hdfs 0 2017-10-09 16:00 /user/yashu/folder1/_SUCCESS -rw-r--r-- 3 hdfs hdfs 174 2017-10-09 16:00 /user/yashu/folder1/part-00000 If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of errors.

Online	Offline
Last Visited	‎04-04-2021 06:38 PM

Member Since	‎06-08-2017 08:15 PM
Last Visited	‎04-04-2021 06:38 PM
Posts	1,049
Kudos received	516

Cloudera Community

Re: Get column values in comma separated value

Re: nifi Json data using routeonattributeto to spl...

Re: HIVE MANAGED TABLE

Re: CSV file with Duplicate Headers

Re: NIFI - SQL Server Lookup

Re: How to get todays failed jobs from "yarn appli...

Re: dbcpconnection pool drops connection.

Re: Difference between hadoop block Size and Input...

Re: NiFi: JSON to CSV to Hive

Re: NiFi: JSON to CSV to Hive

Re: I want to delete attributes after pulling the ...

Re: I want to delete attributes after pulling the ...

Re: Can I change the contents of a file present in...

Re: Nifi Remove double quotes in attribute value

Re: Can I change the contents of a file present in...