About RahulSoni

RahulSoni · ‎03-23-2018

So the issue is with the "PK" column used in distributing the data in case of multiple mappers. It has always been recommended that an "integral" column is used as the "split by" column and your import is trying to use the column "CustID" which is String. Have a look at how your splits are calculated during the import. 8020 [main] WARN org.apache.sqoop.mapreduce.db.TextSplitter - You are strongly encouraged to choose an integral split column. 8025 [main] DEBUG org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat - Creating input split with lower bound '`CustID` >= '1'' and upper bound '`CustID` < '3?????'' 8025 [main] DEBUG org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat - Creating input split with lower bound '`CustID` >= '1'' and upper bound '`CustID` < '3?????'' 8025 [main] DEBUG org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat - Creating input split with lower bound '`CustID` >= '3?????'' and upper bound '`CustID` < '5?????'' 8025 [main] DEBUG org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat - Creating input split with lower bound '`CustID` >= '3?????'' and upper bound '`CustID` < '5?????'' 8025 [main] DEBUG org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat - Creating input split with lower bound '`CustID` >= '5?????'' and upper bound '`CustID` < '7*?????'' 8025 [main] DEBUG org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat - Creating input split with lower bound '`CustID` >= '5?????'' and upper bound '`CustID` < '7*?????'' 8025 [main] DEBUG org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat - Creating input split with lower bound '`CustID` >= '7*?????'' and upper bound '`CustID` <= '999999'' 8025 [main] DEBUG org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat - Creating input split with lower bound '`CustID` >= '7*?????'' and upper bound '`CustID` <= '999999'' 8068 [main] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:4 The "?" indicates some foreign characters probably not parsed properly and hence resulted in the failure of your tasks. However, when you have only a single mapper, there is no such parse needed for CustID column and the data is "copied and pasted" to HDFS and your job ends OK.

RahulSoni · ‎03-23-2018

@Shantanu kumar Your question can mean two things. Based on the value, redirect the flow files accordingly. Split the flow file based on a "custom value as separator". I am answering both of them. Solution 1 Follows a sample flow. What am I doing in this flow? 1. Generate flow file In this processor, I am generating a sample flow file with the content This is success. This is failure. 2. Split text In this processor, I am splitting every individual row into a flow file. 3. Route on content This is the processor which is sending the flow files to respective relationships based on the content. Follows a snapshot of the processor config. In this processor, I am checking if the flow file contains "failure" in the content, then redirect to a similarly named relation. "success" in the content, then redirect it accordingly. else the flow file will go to "unmatched" relation And this way you can have your file split based on the content of it. PS - If you have structured/semi-structured data,eg CSV or JSON, you can change the logic to check for the value of that specific column and then redirect the flow files accordingly. Solution 2 Split the content based on a custom value. In this flow, I am using the SplitContent Processor. It can take either of two following options as splitting value. Hexadecimal byte stream, or Text My input flow file from GenerateFlowFile processor has following content. This is success. # This is failure And the SplitContent processor is using # as the split value. Follows the snapshot of the processor config. And I am able to get 2 flow files based on splits happened by my custom value passed to the processor. Hope this helps!

RahulSoni · ‎03-23-2018

@ANKIT PATEL Here's a sample flow. For simplicity, instead of reding from FTP location, I am reading from local path. I have 3 processors in my flow. ListFile - Will list the files in the directory I passed in the Configuration tab. Similar to ListFTP, just local. FetchFile - Will fetch the files that I will mention. Again, similar to FetchFTP, but just local. PutFile - Will write the data. If you pay attention, ListFile processor is giving me the list of the files in folder, and since the downstream processor is stopped, the flow files are queued up. So I went ahead and did a "List Queue" to see the flow files which are queued up. I saw something like this. These are called as "flow files" in NiFi. If you click the "i" button to the leftmost side, you will see Attributes tab as shown below. You can see many attributes, but the main attributes that we need in this example are absolute.path - The location from where the file is filename - The name of the file These are the properties regarding the data we are about to fetch. Now FetchFile can read these files from the given directory and I can tell my FetchFile processor to read these files by using these "attributes" as shown below. Since I have the directory name information available in the form an attribute, I can use it while storing the data as well and hence mimicking the exact directory structure from the source. Hope that helps!

RahulSoni · ‎03-23-2018

Also, can you please share the "actual" MR job logs that you can see when you are running your job with multiple mappers?

RahulSoni · ‎03-23-2018

Are you using "split-by" column while not setting number of mappers to 1?

RahulSoni · ‎03-23-2018

@Mark Lin Something like this will help you. In this case, I am trying to write the data to S3 and if it fails, redirect it through an UpdateAttribute process back to the parent processor again. For your scenario, you can fit in the e-mail logic and of course can stop the processor for some later action 🙂 Hope that helps!

RahulSoni · ‎03-23-2018

@Christian Lunesa Can you please share your sqoop command? Are you using --direct by any chance?

RahulSoni · ‎03-23-2018

@vishal dutt NiFi registry is not supported on Windows at this point in time. Please have a look at NiFi Registry Admin Guide for more details

RahulSoni · ‎03-22-2018

@heta desai Have a look at this link. You can use the Hive table structure(s) given according to your log file format and process them as needed. The key is using Regex for parsing the records into individual columns. The tutorial talks about using HBase, but you can skip it if you don't want to use it at this point of time. Let know if you need any help.

RahulSoni · ‎03-22-2018

@Vivek Singh did the suggestion worked for you?

Online	Offline
Last Visited	‎10-08-2020 11:27 AM

Member Since	‎08-03-2019 10:44 AM
Last Visited	‎10-08-2020 11:27 AM
Posts	186
Kudos received	33

Cloudera Community

Re: Hive / HBase migration - Different clusters

Re: Flowfiles are stuck in que/connection of Nifi

Re: Save dataframe with header in spark 1.6

Re: hive external table pointing to AVRO files

Re: sqoop 1.4.6.2.6.3.0-235 import failing

Re: why do I get error during sqoop import every t...

Re: Splitting Single file in to two file based on ...

Re: Files in database are following Folder structu...

Re: why do I get error during sqoop import every t...

Re: why do I get error during sqoop import every t...

Re: Dumy processor for error handling

Re: why do I get error during sqoop import every t...

Re: NIFI registry Windows install

Re: how to perform Log file analysis in hadoop ?

Re: How to write mongoDb query in Apache Nifi GetM...