Member since
10-06-2015
273
Posts
202
Kudos Received
81
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3137 | 10-11-2017 09:33 PM | |
2675 | 10-11-2017 07:46 PM | |
2028 | 08-04-2017 01:37 PM | |
1791 | 08-03-2017 03:36 PM | |
1679 | 08-03-2017 12:52 PM |
03-30-2016
03:04 PM
4 Kudos
@nejm hadj First I’ll answer your question and then I’ll make my recommendation. Answer: The name of the file does not matter. When setting up a Hive external table just specify the data source as the folder that will contain all the files (regardless of names). Details on setting up and external table: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/moving_data_from_hdfs_to_hive_external_table_method.html Details on reading/parsing JSON files into Hive: http://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queries/ (alternatively, you can convert JSON to CSV within NiFi. To do so, follow the NiFi portion of this example https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.html)
Recommendation: HDFS prefers large files with many entries as opposed to many files with small entries. The main reason being that for each file landed on HDFS, file information is saved in the NameNode (in memory). If you’re putting each twitter message in a separate file you will quickly fill up your NameNodes’s memory and overload the server. I suggest you aggregate multiple messages into one file before writing to HDFS. This can be done with the MergeContent processor in Nifi. Take a look at the below screenshots showing how it would be set up. Also, take a look at the NiFi Twitter_Dashboard.xml example template (https://raw.githubusercontent.com/abajwa-hw/ambari-nifi-service/master/demofiles/Twitter_Dashboard.xml). You can import this into your NiFi by by clicking on Templates (third icon from right) which will launch the 'Nifi Flow templates' popup, and selecting the file.
... View more
03-30-2016
02:09 PM
@mike pal The below link should cover your requirements. It shows a strategy for incremental updates/ingest. It also covers the scenario where base data may change: http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
... View more
03-25-2016
06:42 AM
1 Kudo
This should do it: /**
* If this is an incremental import then we save the user's state back to the metastore
*/
private void saveIncrementalState(SqoopOptions options) throws IOException {
if (!isIncremental(options)) {
return;
}
Map<String,String> descriptor=options.getStorageDescriptor();
String jobName=options.getJobName();
if (null != jobName && null != descriptor) {
LOG.info("Saving incremental import state to the metastore");
JobStorageFactory ssf=new JobStorageFactory(options.getConf());
JobStorage storage=ssf.getJobStorage(descriptor);
storage.open(descriptor);
try {
JobData data=new JobData(options.getParent(),this);
storage.update(jobName,data);
LOG.info("Updated data for job: " + jobName);
}
finally {
storage.close();
}
}
}
... View more
03-25-2016
05:49 AM
2 Kudos
It is mandatory for now. Quoting @ebergenholtz "...in HDP 2.3x, Atlas is required to be running in order for the Hive post execution hook for Atlas to fire properly since it is using a REST API to propagate the metadata. This limitation is being addressed by introducing a messaging layer which decouples Atlas from Hive such that the state of the Atlas process does not affect the execution of Hive queries." Source: https://community.hortonworks.com/questions/22396/ranger-dependency-on-atlas.html The above also applies to HDP 2.4 and will be addressed with the coming release of HDP
... View more
03-25-2016
12:32 AM
2 Kudos
Yes Kafka supports SSL/HMAC/Kerberos. Support for Kerberos was introduced with Kafka 0.8.2 (HDP 2.3.2). See instructions on how to enable in link the below https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_secure-kafka-ambari/content/ch_secure-kafka-overview.html Kafka SSL communication has been introduced with Kafka 0.9 (HDP 2.3.4). See instructions on how to enable in link the below https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_Security_Guide/content/ch_wire-kafka.html Also, find below a few more interesting Kafka security links and articles that are worth reading http://henning.kropponline.de/2016/02/21/secure-kafka-java-producer-with-kerberos/ http://www.confluent.io/blog/apache-kafka-security-authorization-authentication-encryption http://henning.kropponline.de/2015/11/15/kafka-security-with-kerberos/
... View more
03-24-2016
05:21 AM
2 Kudos
@Viraj Vekaria Option 2 I noticed you mentioned WinSCP, so I’m assuming that the import job is for the initial data load only (or may run occasionally) and will be a manual process. If that is the case then the easiest thing to do is copy the files over to the cluster’s local file system and then use the command line to put the files into HDFS. 1) Copy files from your Windows machine to the cluster’s Linux file system using WinSCP
2) Create a directory in HDFS using the “hadoop fs -mkdir” command
Takes the path uri’s as an argument and creates a directory or multiple directories. # hadoop fs -mkdir <paths>
# Example:
hadoop fs -mkdir /user/hadoop
hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2 /user/hadoop/dir3
3) Copy the files from the local file system to HDFS using the “hadoop fs -put” command
Copies single src file or multiple src files from local file system to the Hadoop Distributed File System. # hadoop fs -put <local-src> ... <HDFS_dest_path>
# Example:
hadoop fs -put popularNames.txt /user/hadoop/dir1/popularNames.txt For more command line commands such as delete files, list files, etc... take a look at the links below: http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/ http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
... View more
03-23-2016
08:17 PM
1 Kudo
One way would be to use NiFi/HDF. You would create a ListFile processor to read the list of files in a folder and the pass it on to a GetFile processor (if you want to delete the original file) or FetchFile processor (if you want to keep the original). You would then use PutHDFS processor to land the files in HDFS. GetFile: Streams the contents of a file from a local disk (or network-attached disk) into NiFi and then deletes the original file. This Processor is expected to move the file from one location to another location and is not to be used for copying the data. FetchFile: Reads the contents of a file from disk and streams it into the contents of an incoming FlowFile. Once this is done, the file is optionally moved elsewhere or deleted to help keep the file system organized. ListHDFS: ListHDFS monitors a user-specified directory in HDFS and emits a FlowFile containing the filename for each file that it encounters. It then persists this state across the entire NiFi cluster by way of a Distributed Cache. These FlowFiles can then be fanned out across the cluster and sent to the FetchHDFS/GetFile Processor, which is responsible for fetching the actual content of those files and emitting FlowFiles that contain the content fetched from HDFS. PutHDFS: Write FlowFile data to Hadoop Distributed File System (HDFS) The flow you create in NiFi would continuously monitor your folder for new files and move them over. If it's only a one time ingest that you're interested in then you can just disable NiFi after you're done. Resources: https://nifi.apache.org/docs.html https://nifi.apache.org/docs/nifi-docs/html/getting-started.html
... View more
03-23-2016
07:29 PM
3 Kudos
@Mahesh Deshmukh Here you go: https://cwiki.apache.org/confluence/display/Hive/MultiDelimitSerDe
... View more
03-23-2016
12:28 PM
5 Kudos
@justin zhang Currently, Nifi does not have dedicated processors for different types of databases (Oracle, MySQL, etc...), and hence, no notion of CDC (change data capture). To copy from one DB to another you would use the ExecuteSQL and PutSQL processors. You would also configure the ExecuteSQL processor to run at appropriate time intervals (depending on your requirements).
ExecuteSQL: Executes a user-defined SQL SELECT command, writing the results to a FlowFile in Avro format PutSQL: Updates a database by executing the SQL DDM statement defined by the FlowFile’s content https://nifi.apache.org/docs/nifi-docs/html/getting-started.html#database-access Documentation for Processors: https://nifi.apache.org/docs.html
... View more
- « Previous
- Next »