Member since
06-09-2016
48
Posts
10
Kudos Received
0
Solutions
11-23-2017
06:21 PM
@Sindhu We are trying to execute a query given below using HPLSQL: CREATE TABLE1 AS SELECT * FROM TABLE2 The query gets executed properly through the HPLSQL script but the table is not getting created.Actually we are planning to execute queries with both DDL/DML queries as well as procedural language inside HIVE through HPLSQL.Is there anything wrong with this?
... View more
11-14-2016
12:44 PM
But as per the documentation Teradata interval/period data types are supported. Also we are using Teradata Connector For Hadoop 15.0 which is a higher version compared to the version listed in the dependency. So ideally the Interval/Period data types should be supported?
... View more
11-14-2016
07:11 AM
Hi, We are using Apache Sqoop to import data from Teradata to Hadoop which connects to Teradata using hortonworks-teradata-connector which uses the Teradata Connector For Hadoop. We are using hortonworks-teradata-connector-1.4.1 Teradata connector for hadoop version is: 1.5.0 The process fails when there is a Interval/Period data type in the source table. Does the Hortonworks Connector for Teradata support Interval/Period data type because I could not find any mention otherwise in the documentation? https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_HortonworksConnectorForTeradata/content/ch_HortonworksConnectorForTeradata.html I am trying to create a table in Hive corresponding to the table in Teradata using Sqoop command create-hive-table with --map-column-hive option and mapping the Interval and Period types to "STRING" type in Hive. But the above is not working I am getting the error log as given below ERROR LOG: java.sql.SQLException: [Teradata JDBC Driver] [TeraJDBC 15.10.00.26] [Error 1006] [SQLState HY000] Unrecognized data type: 845
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDriverJDBCException(ErrorFactory.java:95)
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDriverJDBCException(ErrorFactory.java:70)
at com.teradata.jdbc.jdbc_4.parcel.PrepInfoParcel.createPrepInfoItem(PrepInfoParcel.java:227)
at com.teradata.jdbc.jdbc_4.parcel.StatementInfoParcel.translateMetadataItems(StatementInfoParcel.java:169)
at com.teradata.jdbc.jdbc_4.parcel.StatementInfoParcel.translateSIPintoPrepInfoItem(StatementInfoParcel.java:154)
at com.teradata.jdbc.jdbc_4.parcel.StatementInfoParcel.<init>(StatementInfoParcel.java:123)
at com.teradata.jdbc.jdbc_4.parcel.ParcelFactory.nextParcel(ParcelFactory.java:348)
at com.teradata.jdbc.jdbc_4.io.TDPacket.nextParcel(TDPacket.java:135)
at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.getNextParcel(StatementReceiveState.java:299)
at com.teradata.jdbc.jdbc_4.statemachine.ReceiveSuccessSubState.action(ReceiveSuccessSubState.java:72)
at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.subStateMachine(StatementReceiveState.java:311)
at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.action(StatementReceiveState.java:200)
at com.teradata.jdbc.jdbc_4.statemachine.StatementController.runBody(StatementController.java:137)
at com.teradata.jdbc.jdbc_4.statemachine.StatementController.run(StatementController.java:128)
at com.teradata.jdbc.jdbc_4.TDStatement.executeStatement(TDStatement.java:389)
at com.teradata.jdbc.jdbc_4.TDStatement.prepareRequest(TDStatement.java:576)
at com.teradata.jdbc.jdbc_4.TDPreparedStatement.<init>(TDPreparedStatement.java:127)
at com.teradata.jdbc.jdk6.JDK6_SQL_PreparedStatement.<init>(JDK6_SQL_PreparedStatement.java:30)
at com.teradata.jdbc.jdk6.JDK6_SQL_Connection.constructPreparedStatement(JDK6_SQL_Connection.java:81)
at com.teradata.jdbc.jdbc_4.TDSession.prepareStatement(TDSession.java:1330)
at com.teradata.jdbc.jdbc_4.TDSession.prepareStatement(TDSession.java:1374)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:744)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:767)
at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:270)
at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:241)
at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:227)
at org.apache.sqoop.hive.TableDefWriter.getCreateTableStmt(TableDefWriter.java:126)
at org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:188)
at org.apache.sqoop.tool.CreateHiveTableTool.run(CreateHiveTableTool.java:58)
at org.apache.sqoop.Sqoop.run(Sqoop.java:148)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235)
at org.apache.sqoop.Sqoop.main(Sqoop.java:244)
16/11/14 10:57:14 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.sqoop.hive.TableDefWriter.getCreateTableStmt(TableDefWriter.java:175)
at org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:188)
at org.apache.sqoop.tool.CreateHiveTableTool.run(CreateHiveTableTool.java:58)
at org.apache.sqoop.Sqoop.run(Sqoop.java:148)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235)
at org.apache.sqoop.Sqoop.main(Sqoop.java:244 Can anyone throw some light into the above problem? Let me know if you need any further information. Regards, Indranil Roy
... View more
Labels:
11-03-2016
01:54 PM
2 Kudos
Background:We are using Apache NIFI data flow to move data local from to Hadoop based file systems.We are executing the NIFI processors by calling the NIFI rest API using groovy script wherein we use json builders in groovy to generate the Json and then passing the json to put methods to execute the processors. NIFI Version:0.6.0 While planning to migrate to NIFI 1.0.0 and using the same groovy script we are facing a few errors in the latest version of NIFI(1.0.0): Having controller/revision in nifi.get method does not
return the response json, instead its throwing 404, verified in browser too.
This works fine in 0.6.1 Nifi version. Reference : resp = nifi.get(path: 'controller/revision') This below does not work too, since having controller
in path as a pretext to process-groups is no longer valid. It also returns a
404/bad request error.This works fine in 0.6.1 Nifi version Reference : resp = nifi.put(
path:
"controller/process-groups/$processGroup/processors/$processorId",
body: builder.toPrettyString(),
requestContentType: JSON
) PS: While trying to verify the below in browser, GET only
responds when we have structures like /process-groups/{id} or
/process-groups/{id}/processors , i.e. without controller string.This works
fine in 0.6.1 Nifi version Reference: host://port/nifi-api/process-groups/root Below syntax does not work in script either. This works fine
in 0.6.1 Nifi version
resp = nifi.put(
path: "process-groups/$processGroup/processors/$processorId",
body: builder.toPrettyString(),
requestContentType: JSON
) Since the syntax provided above works perfectly fine in 0.6.0 I would like to know if any changes are made in NIFI 1.0.0 in the rest API or in the way the various HTTP requests are passed to methods like 'get' and 'put'? I could not find any changes in the release notes or the NIFI API documentation provided in the link below: https://nifi.apache.org/docs/nifi-docs/rest-api/ Please let me know if you need any other information. Regards, Indranil Roy
... View more
Labels:
09-09-2016
11:27 AM
@Bryan Bende Thanks for the input it really helped a lot in our case.Say I have 2 rows in my table 1|Indranil|ETL 2|Reporting|Joy I want to convert it to JSON so that I am able to insert each row into multiple cells in a single Hbase row. This is my converted JSON {
"Personal":
[ {
"id":"1",
"name":"Indranil",
"Skill":"ETL"
}
,
{
"id":"2",
"name":"Joy",
"Skill":"Reporting"
}
]
} Is this JSON in the correct format to be consumed by the PutHBaseJSON. My end goal is to insert all the values in a row to different cell."Personal" refers to the "column family" and "id" refers to the "Row Identifier Field Name".
... View more
09-08-2016
05:37 PM
@Bryan Bende In such a scenario since we want to store the rows with different row ids is there a workaround possible?If I assume correctly using PutHBaseJSON might help.So is there any processors available to convert the pipe delimited source file into a JSON file to be consumed by the PutHbaseJSON processor to insert multiple values?
... View more
09-08-2016
04:55 PM
@Bryan Bende As you mentioned PutHBaseCell is used to write a single cell to HBase and it uses the content of the FlowFile as the value of the cell.Now if my input flowfile has say 50 lines of pipe separated values,will it insert all those rows into 50 cells with 50 different row id's or it will enter all the rows into same row?
... View more
09-08-2016
01:34 PM
2 Kudos
I want to insert data into Hbase from a flowfile using NIFI. Does putHbaseCell supports Hbase tables with multiple column families.Say I have create an Hbase table with 2 column families cf1(column1,column2,column3) and cf2(column4,column5). How do I specify "Column Family" and "Column Qualifier" properties in the putHbaseCell configuration. Where do I specify the mapping between the flowfile(Text file with pipe comma separated values) and the Hbase table? The flowfile will have pipe separated columns.And I want to store a subset of columns into each column families. Regards, Indranil Roy
... View more
Labels:
09-07-2016
04:20 PM
@mclark Thanks for your inputs.The above solution worked perfectly fine in my case both in terms of the error and performance.But as you already mentioned above in this situation we have a large number of files in the HDFS. Even if I use a MergeContent processor in the flow I am getting more than I files.For what I can understand by looking at the provenence the MergeContent processor is merging files in block.Say we have 100 flow files coming to the MergeContent processor batches of 30,30,20,20.If will not wait for 100 files and generate 4 output files by merging in groups.Is there a way by which we can control this behavior and enforce it to produce only 1 output files for each output path. mergecontent.png This is the configuration of MergeContent processor.Any inputs will be very helpful. Regards, Indranil Roy
... View more
09-06-2016
04:14 PM
@jwitt According to you the flow should look like GetFile->SplitFile->RouteText->PutHDFS Since we are using only a standalone cluster if we Split the file into 5000 splits do we need to do a UpdateAttribute/MergeContent after the RouteText processor or the flow shown above should be fine? Also do we need to set the "No of Concurrent Task" in all the processor(GetFile,putHDFS,splitText) or only the RouteText processor? Regards, Indranil Roy
... View more
09-06-2016
04:05 PM
@mclark Sure I will try that option.I can understand that I need to increase my split count in order to achieve better load balancing.But if I go back to the main issue I pointed in the thread above was that apart from the performance aspect we were getting only a subset of records in HDFS. It seems the process was trying to create the same file and overwrite it multiple times hence giving an error as shown below. When I use splitText processor and send it to the RPG and then merge it I am getting the error as shown(attached). 7300-afiuy.png Just to be on the same page here my flow looks like below: In the NCM of the cluster flow.png In the standalone cluster flow1.png Does increasing the splitCount solve this problem? Also is it necessary to use a MergeContent/UpdateAttribute if we use a splitText?Can't we achieve this flow without using the MergeContent/UpdateAttribute processor in the RPG? Regards, Indranil Roy
... View more
09-06-2016
01:36 PM
We have a data flow as shown above wherein we have a single pipe delimited source file around 15 GB having 50 million records. We are routing the rows into two different paths in HDFS based on routing condition as shown above in the RouteText configuration window. The following process in taking around 20 minutes to complete on a standalone server. The number of concurrent processors are set to 10 for all the processors. Is this performance exhaustive or there is any way to improve the performance further in this standalone server considering the server has 4 cores and 16 GB RAM. Also as I can observe most the processing time is consumed by the RouteText processor. Is this design suitable for this kind of use case to send the records of a pipe delimited file to different outputs based on some conditions since RouteText processes records line by line. We are using NIFI 0.6.1
... View more
Labels:
09-02-2016
05:21 PM
@mclark Our RouteText processor is configured as shown below 7231-condition.png as it shows, records with (first field is record number) number <= 5000000 goes to one direction and records number >= 5100000 goes to another. The split Text processor is configured with the below properties: 7234-split-config-final.png Just to give an overview of our requirement: 1)We have a single file as source coming to a standalone server. 2)We fetch the file and then split it into multiple files and then send to the cluster in order to distribute the processing to all the nodes of the cluster. 3)In the cluster we route the files based on condition so that records with (first field is record number) number <= 5000000 goes to one output directory in HDFS and records number >= 5100000 goes to another output directory in HDFS as mentioned in the two putHDFS processors. 4)But after executing the process we have around 1000000 records in each output directory whereas ideally we should have 5000000 records approximately in either of the HDFS directory. Also we are getting below error in the PutHDFS processors 7300-afiuy.png Please let me know if you need any further information. Just to add to the above set up works perfectly fine when we are using a standalone node and we are using putFile instead of putHDFS to output the files to a local path instead of hadoop. Regards, Indranil Roy
... View more
09-02-2016
01:19 PM
Thanks @mclark for your input. As suggested we implemented the data flow as shown below: 1)The standalone flow(This is where the single source file arrive).The RPG in this flow refers to the cluster i.e NCM url 2)The NCM of the cluster has flow as shown below: 3)In this approach were facing the error as shown below as getting only a subset of the records in HDFS 4)So to avoid the situation we used a MergeContent processor to merge the flowfiles since we were splitting them before loading them to HDFS 5)We configured the MergeContent in the way as shown below: But even after this implementation we are not getting all the records in HDFS. The source file has 10000000 records and approximately 5000000 records should go to each HDFS directory. But we are getting around 1000000 records in each Target and the error as shown below in the PutHDFS processors. We are getting the same error as mentioned in the snapshot attached with the point 3 above. Are we missing something very intrinsic here? Is there something wrong with the design? We are using a 3 node cluster with NCM and 2 slave nodes. And the source file is coming to a standalone server. Let me know if you need any other information. Any inputs would be appreciated. Regards, Indranil Roy
... View more
08-31-2016
12:56 PM
@mclark We have a single large(in TB's) flowfile coming to a standalone node.We want to distribute the processing.Is it a good approach to split the file into multiple smaller files using SplitText processor so that the processing is distributed to the remaining clusters. In such a case we are considering the the flow given below: In the NCM of the cluster input->RouteText->PutHDFS In the standalone processor that has the incoming flow file ListFile->FetchFile->SplitText->UpdateAttribute->RPG(NCM url) Does this set up ensure the processing to be distributed?
... View more
08-29-2016
03:42 PM
Hi @Pierre Villard We are using NIFI 0.6.0 To answer your second question a subset of the whole file is coming to the ReplaceText processor based on the condition in the RouteText processor.Say we have 100 records and some 50 records satisfies the condition so 50 records will come to the processor. The regular expression we are using is (.+)\|(.+)\|(.+).... Where (.+) is repeated n of times based on the number of columns in the flowfile. So as per your observation we should be using ^(.+)\|(.+)\|(.+)....$ Any other suggestion to improve the performance?
... View more
08-29-2016
01:23 PM
2 Kudos
We have a source file with pipe delimited rows and we need to fetch specific columns from the flow file. We are using a regular expression in a replaceText processor to extract the columns.The flow we are using is as shown ListFile->FetchFile->Routetext->ReplaceText->PutFile The source file we are using has some 21 columns and around 100000 records.The file size is around 25 MB.As soon as I start the processor the records are getting queued before the replaceText processor and the job is running indefinitely.In fact even after stopping the job we are unable to empty the queue or even delete any processor for that matter. The Replace Text processor is configured as shown below: I have increased the Maximum buffer size to 10 MB(1 MB Default) but still it is of no use. Considering there are only 100000 records in the file(25 MB) this should not take so long? Is there anything wrong with the configuration or the way we are using the flow? Any inputs would be very helpful. The system we are using has 16 GB RAM and 4 cores. Regards, Indranil Roy
... View more
Labels:
08-25-2016
04:02 PM
@mclark 1)We are talking about a single file in TB. 2)There is a single file and the processing should be distributed. 3)The file are in the local directory. So is it a good idea?
... View more
08-25-2016
12:39 PM
Hi @mclark Thanks for the alternate approach that you suggested.It could be helpful in my case. Say in the scenario mentioned above we have a single input file of size in the order of TB's.If we use a ListSFTP/FetchSFTP processor in the way you mentioned to distribute the fetching of data: Do we need to establish a SFTP channel between every slave node of the cluster and the remote server that houses the source file for this approach to work? Is it a good idea to use SFTP to fetch the file considering the size of the file will be in TB's? What are the parameters on which the the performance of the fetch using ListSFTP/FetchSFTP will depend?
... View more
08-25-2016
06:03 AM
Thanks @Bryan Bende for your input As per your suggestion I have included two different flows as given below: Flow1 ===== In the standalone instance ListFile->FetchFile->Outputport In the NCM of the cluster RPG(Standalone NIFI Instance)-->Routetext->PutHDFS(This is the processing done in the main cluster) Flow2 ===== In the standalone instance ListFile->FetchFile->Input port of RPG(NIFI Cluster URL of the NCM) In the NCM of the cluster Input port->RouteText->OutputPort Which according to you is the correct flow.I can understand that the fetching of the source file cannot be distributed if the file is not shared but which of the flows will be apt to distribute the part of the processing done inside the cluster? Regards, Indranil Roy
... View more
08-24-2016
01:23 PM
We have a NIFI setup where in we have a NIFI cluster installed in a hadoop cluster and a standalone NIFI instance running on another server.The input file will be generated in the file system of the server of the standalone instance.We are fetching the file using ListFile/FetchFile processor in the standalone instance.Then in the main cluster we are connecting to the standalone instance using RPG group and then send the output to the NIFI cluster(RPG) using site-site.As per my understanding the part of the processing done inside the cluster will be distributed.Is this understanding correct?I also would like to know if there is a way to distribute the fetching of the source file that we are doing in the standalone NIFI instance? The flow we are using In the standalone instance ListFile->FetchFile->Outputport In the NCM of the cluster RPG(Standalone NIFI Instance)-->RPG(NIFI cluster) Inputport->Routetext->PutHDFS(This is the processing done in the main cluster) Let me know if you need any other information.Any inputs will be appretiated. Regards, Indranil Roy
... View more
Labels:
08-18-2016
10:56 AM
Thanks for the info
... View more
08-17-2016
04:17 PM
Thanks @Simon Elliston Ball for the input. I have a couple of queries regarding the second model(using listfile+fetch file)that you suggested: How the load balancing is done while the file is pulled by the Fetch File processor? Is it done by automatically or it we have use some load balancer to do it? Both the list file and FetchFile processor will be created inside the GUI of the NCM since we only have access to the GUI of the NCM of the cluster?
... View more
08-17-2016
12:05 PM
We have set up a NIFI cluster with a NCM and two slave nodes.We have created a simple flow in the NCM GUI to pull file from the local file system from the NCM and after updating some attributes load into the local file system of one of the slave node. The GetFile processor is unable to recognize the input directory that refers to a path in the file system of the NCM. Is it possible to set the 'Input Directory' property in GetFile so that it points to a specific path in the local file system of either the NCM or the slave nodes?
... View more
Labels:
08-11-2016
05:45 AM
Thanks a lot.It works.
... View more
08-10-2016
04:30 PM
It was in stopped state.As I already pointed out I was able to update other properties.This is specific to "Search Value" property in replaceText processor.
... View more
08-10-2016
04:20 PM
I am trying to update the properties of a routeText processor using the rest api using the syntax given below. curl -i -X PUT -H 'Content-Type: application/json' -d '{"revision":{"clientId":"7a1f42ec-f805-4869-a47c-27306a38490a"},"processor":{"id":"5a93362a-482a-42d0-9ef6-f965a08202eb","config":{"properties":{"Search Value":"abcd"}}}}' http://localhost:8080/nifi-api/controller/process-groups/root/processors/5a93362a-482a-42d0-9ef6-f965a08202eb Although I am able to update properties like "Replacement Values" using the syntax given above but unable to update the "Search Value" property.When I execute it is trying to create a dynamic property named "Search Value" instead of updating the updating the original "Search Value" property provided by ReplaceText processor.Is there any problem specific to this property?Has anyone faced this problem before?
... View more
Labels:
08-10-2016
11:02 AM
But that is not possible in my scenario because I won't be having the conditions and number of putFile processor.So how can I put multiple putFile processors.Can you please post a image of your flow?That will help me to understand if your case is same.
... View more
08-09-2016
05:24 PM
I have a test file containing comma separated values. Ex 1,ab,cd 2,ef,gh I want to select only the 2nd and 3rd column and put into HDFS.How do I select only the 2nd and 3 rd column only from each line?The output should be ab,cd ef,gh.
... View more
Labels:
08-09-2016
02:30 PM
I created a flow as shown below: I configured the RouteText processor as shown below: Finally in the PutFile processor I added the expression "/user/tmp/${RouteText.Route:substringBefore('.')" But I cannot find any output file in the paths /user/tmp/TEST1 and /user/tmp/TEST2.But when I create the same flow with 2 PutFile processor and hardcode the Directories it is working fine.Am I missing something here?
... View more