About mpandit

mpandit · ‎10-21-2016

It seems I fixed this by using the ConvertCharacterSet processor. I will test more.

mpandit · ‎10-17-2016

I didnt copy the flow.xml.gz but imported some templates which must have updated the flow.xml. Whats the fix to restart the NiFi instance?

mpandit · ‎09-19-2016

You can use the Jms Connection Factory Provider to specify your vendor specific details. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.jms.cf.JMSConnectionFactoryProvider/index.html

sunile_manjee · ‎08-25-2016

@milind pandit loaded question. First you have to define what the unique entity is. once that solved then you can use various tools like pig to parse through data and provide you single record. This can also be done via hive by using group by statement on your natural key to provide you single record from source. Lastly you can use tools like information or talend to do the same.

cstanca · ‎08-25-2016

Another option would be to pre-convert XML to JSON.

chris_ottinger · ‎09-01-2016

We're going through this process now, migrating a non-trivial amount of data from an older cluster onto a new cluster and environment. We have a couple of requirements and constraints that limited some of the options: The datanodes on the 2 clusters don't have network connectivity. Each cluster resides in it's own private firewalled network. (As an added complication, we also use the same hostnames in each of the two private environments.) distcp scales requires the datanodes in the 2 clusters to be able communicate directly. We have different security models in the two models. The old cluster uses simple authentication. The new cluster uses kerberos for authentication. I've found that getting some of the tools to work with 2 different authentication models can be difficult. I want to preserve the file metadata from the old cluster on the new cluster - e.g. file create time, ownership, file system permissions. Some of the options can move the data from the source cluster, but they write 'new' files on the target cluster. The old cluster has been running running for around 2 years so there's alot of useful information in those file timestamps. I need to perform a near-live migration. I have the keep the old cluster running in parallel while migrating data and users to the new cluster. Can't just cut access to the old cluster After trying a number of tools and combinations, inculding WebHDFS and Knox combinations. we've settled on the following: Export the old cluster via NFS gateways. We lock the NFS access controls to only allow the edge servers on the new cluster to mount the HDFS NFS volume. The edge servers in our target cluster are airflow workers running as a grid. We've created a source NFS gateway for each target edge server airflow worker enabling a degree of scale-out. Not as good as distcp scale-out but better than a single point pipe. run good old fashioned hdfs dfs -copyFromLocal -p <old_cluster_nfs_dir> <new_cluster_hdfs_dir>. This enables us to preserve the file timestamps as well as ownerships. As part of managing the migration process, we're also making use of HDFS snapshots on both source and target to enable consistency management. Our migration jobs take snapshots at the beginning and end of each migration job and issue delta or difference reports to identify if data was modified and possibly missed during the migration process. I'm expecting that some of our larger data sets will take hours to complete, for the largest few, possible > 24hrs. In order to perform the snapshot management we also added some additional wrapper code. WebHDFS can be used to create and list snapshots, but it doesn't yet have an operation for returning a snapshot difference report. For the hive metadata, the majority of our hive DDL exists in git/source code control. We're actually using this migration as an opportunity to enforce this for our production objects. For end user objects, e.g. analysts data labs, we're exporting the DDL on the old cluster and re-playing DDL on the new cluster - with tweeks for any reserved words collisions. We don't have HBase operating on our old cluster so I didn't have to come up with a solution for that problem.

arooney · ‎09-22-2016

Awesome! Thanks Dominika!

iwan12iwan12 · ‎08-16-2016

Yes, I did this, but my output directory is d:/abc/${path}

76_subhasis · ‎02-07-2017

the bucket ofcourse created and I could access them s3 browser as well as s3 command line.

arsalan_siddiqi · ‎11-30-2016

2016-11-29 14:50:59,544 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog@6dc5e857 checkpointed with 3 Records and 0 Swap Files in 25 milliseconds (Stop-the-world time = 11 milliseconds, Clear Edit Logs time = 9 millis), max Transaction ID 8 2016-11-29 14:51:06,659 WARN [Timer-Driven Process Thread-7] o.apache.hadoop.hdfs.BlockReaderFactory I/O error constructing remote block reader. java.io.IOException: An existing connection was forcibly closed by the remote host at sun.nio.ch.SocketDispatcher.read0(Native Method) ~[na:1.8.0_111] 2016-11-29 14:51:06,659 WARN [Timer-Driven Process Thread-7] org.apache.hadoop.hdfs.DFSClient Failed to connect to sandbox.hortonworks.com/127.0.0.1:50010 for block, add to deadNodes and continue. java.io.IOException: An existing connection was forcibly closed by the remote host java.io.IOException: An existing connection was forcibly closed by the remote host at java.lang.Thread.run(Thread.java:745) [na:1.8.0_111] 2016-11-29 14:51:06,660 WARN [Timer-Driven Process Thread-7] org.apache.hadoop.hdfs.DFSClient Could not obtain block: BP-1464254149-172.17.0.2-1477381671113:blk_1073742577_1761 file=/user/admin/Data/trucks.csv No live nodes contain current block Block locations: 172.17.0.2:50010 Dead nodes: 172.17.0.2:50010. Throwing a BlockMissingException 2016-11-29 14:51:06,660 WARN [Timer-Driven Process Thread-7] org.apache.hadoop.hdfs.DFSClient Could not obtain block: BP-1464254149-172.17.0.2-1477381671113:blk_1073742577_1761 file=/user/admin/Data/trucks.csv No live nodes contain current block Block locations: 172.17.0.2:50010 Dead nodes: 172.17.0.2:50010. Throwing a BlockMissingException 2016-11-29 14:51:06,660 WARN [Timer-Driven Process Thread-7] org.apache.hadoop.hdfs.DFSClient DFS Read org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1464254149-172.17.0.2-1477381671113:blk_1073742577_1761 file=/user/admin/Data/trucks.csv at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:889) [hadoop-hdfs-2.6.2.jar:na] 2016-11-29 14:51:06,660 ERROR [Timer-Driven Process Thread-7] o.apache.nifi.processors.hadoop.GetHDFS GetHDFS[id=abb1f7a5-0158-1000-f1d4-ef83203b4aa1] Error retrieving file hdfs://sandbox.hortonworks.com:8020/user/admin/Data/trucks.csv from HDFS due to org.apache.nifi.processor.exception.FlowFileAccessException: Failed to import data from org.apache.hadoop.hdfs.client.HdfsDataInputStream@7bea77c5 for StandardFlowFileRecord[uuid=34551c53-72ad-40fa-927d-5ac60fe6d83e,claim=,offset=0,name=712611918461157,size=0] due to org.apache.nifi.processor.exception.FlowFileAccessException: Unable to create ContentClaim due to org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1464254149-172.17.0.2-1477381671113:blk_1073742577_1761 file=/user/admin/Data/trucks.csv: org.apache.nifi.processor.exception.FlowFileAccessException: Failed to import data from org.apache.hadoop.hdfs.client.HdfsDataInputStream@7bea77c5 for StandardFlowFileRecord[uuid=34551c53-72ad-40fa-927d-5ac60fe6d83e,claim=,offset=0,name=712611918461157,size=0] due to org.apache.nifi.processor.exception.FlowFileAccessException: Unable to create ContentClaim due to org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1464254149-172.17.0.2-1477381671113:blk_1073742577_1761 file=/user/admin/Data/trucks.csv 2016-11-29 14:51:06,661 ERROR [Timer-Driven Process Thread-7] o.apache.nifi.processors.hadoop.GetHDFS org.apache.nifi.processor.exception.FlowFileAccessException: Failed to import data from org.apache.hadoop.hdfs.client.HdfsDataInputStream@7bea77c5 for StandardFlowFileRecord[uuid=34551c53-72ad-40fa-927d-5ac60fe6d83e,claim=,offset=0,name=712611918461157,size=0] due to org.apache.nifi.processor.exception.FlowFileAccessException: Unable to create ContentClaim due to org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1464254149-172.17.0.2-1477381671113:blk_1073742577_1761 file=/user/admin/Data/trucks.csv at org.apache.nifi.controller.repository.StandardProcessSession.importFrom(StandardProcessSession.java:2479) ~[na:na] Caused by: org.apache.nifi.processor.exception.FlowFileAccessException: Unable to create ContentClaim due to org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1464254149-172.17.0.2-1477381671113:blk_1073742577_1761 file=/user/admin/Data/trucks.csv at org.apache.nifi.controller.repository.StandardProcessSession.importFrom(StandardProcessSession.java:2472) ~[na:na] ... 14 common frames omitted

Online	Offline
Last Visited	‎08-30-2019 03:01 PM

Member Since	‎04-27-2016 03:08 PM
Last Visited	‎08-30-2019 03:01 PM
Posts	218
Kudos received	131

Cloudera Community

Re: OutofmemoryError while running the TPCDS query...

Re: Jobs are getting failed to renew token TIMELIN...

Re: Sandbox First Time Login : Cannot Type in Pass...

Re: what should be the value for dfs.datanode.data...

Re: Atlas UI not available : Service Unavailable

Re: File processing with leading BOM bytes in NiFi

Re: NiFi startup Exception.

Re: Custom JMS factory configuration in Apache NiF...

Re: How one should handle de-duplication of data?

Re: I am looking some info on XML SerDe that can ...

Re: Data transfer between two clusters

Re: hortonworks cloud for aws technical preview cr...

Re: Max file cout in PutFile processor does not wo...

Re: Integrating Apache NiFi with AWS S3 and SQS

Re: Errors in GetHDFS\PutHDFS using HDF running on...