About egarelnabi

egarelnabi · ‎03-30-2016

@nejm hadj First I’ll answer your question and then I’ll make my recommendation. Answer: The name of the file does not matter. When setting up a Hive external table just specify the data source as the folder that will contain all the files (regardless of names). Details on setting up and external table: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/moving_data_from_hdfs_to_hive_external_table_method.html Details on reading/parsing JSON files into Hive: http://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queries/ (alternatively, you can convert JSON to CSV within NiFi. To do so, follow the NiFi portion of this example https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.html) Recommendation: HDFS prefers large files with many entries as opposed to many files with small entries. The main reason being that for each file landed on HDFS, file information is saved in the NameNode (in memory). If you’re putting each twitter message in a separate file you will quickly fill up your NameNodes’s memory and overload the server. I suggest you aggregate multiple messages into one file before writing to HDFS. This can be done with the MergeContent processor in Nifi. Take a look at the below screenshots showing how it would be set up. Also, take a look at the NiFi Twitter_Dashboard.xml example template (https://raw.githubusercontent.com/abajwa-hw/ambari-nifi-service/master/demofiles/Twitter_Dashboard.xml). You can import this into your NiFi by by clicking on Templates (third icon from right) which will launch the 'Nifi Flow templates' popup, and selecting the file.

egarelnabi · ‎03-30-2016

@mike pal The below link should cover your requirements. It shows a strategy for incremental updates/ingest. It also covers the scenario where base data may change: http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/

egarelnabi · ‎03-25-2016

That's correct. To use SSL you'd need Kafka 0.9 (HDP 2.3.4)

egarelnabi · ‎03-25-2016

This should do it: /** * If this is an incremental import then we save the user's state back to the metastore */ private void saveIncrementalState(SqoopOptions options) throws IOException { if (!isIncremental(options)) { return; } Map<String,String> descriptor=options.getStorageDescriptor(); String jobName=options.getJobName(); if (null != jobName && null != descriptor) { LOG.info("Saving incremental import state to the metastore"); JobStorageFactory ssf=new JobStorageFactory(options.getConf()); JobStorage storage=ssf.getJobStorage(descriptor); storage.open(descriptor); try { JobData data=new JobData(options.getParent(),this); storage.update(jobName,data); LOG.info("Updated data for job: " + jobName); } finally { storage.close(); } } }

egarelnabi · ‎03-25-2016

It is mandatory for now. Quoting @ebergenholtz "...in HDP 2.3x, Atlas is required to be running in order for the Hive post execution hook for Atlas to fire properly since it is using a REST API to propagate the metadata. This limitation is being addressed by introducing a messaging layer which decouples Atlas from Hive such that the state of the Atlas process does not affect the execution of Hive queries." Source: https://community.hortonworks.com/questions/22396/ranger-dependency-on-atlas.html The above also applies to HDP 2.4 and will be addressed with the coming release of HDP

egarelnabi · ‎03-25-2016

Yes Kafka supports SSL/HMAC/Kerberos. Support for Kerberos was introduced with Kafka 0.8.2 (HDP 2.3.2). See instructions on how to enable in link the below https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_secure-kafka-ambari/content/ch_secure-kafka-overview.html Kafka SSL communication has been introduced with Kafka 0.9 (HDP 2.3.4). See instructions on how to enable in link the below https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_Security_Guide/content/ch_wire-kafka.html Also, find below a few more interesting Kafka security links and articles that are worth reading http://henning.kropponline.de/2016/02/21/secure-kafka-java-producer-with-kerberos/ http://www.confluent.io/blog/apache-kafka-security-authorization-authentication-encryption http://henning.kropponline.de/2015/11/15/kafka-security-with-kerberos/

egarelnabi · ‎03-24-2016

@Viraj Vekaria Option 2 I noticed you mentioned WinSCP, so I’m assuming that the import job is for the initial data load only (or may run occasionally) and will be a manual process. If that is the case then the easiest thing to do is copy the files over to the cluster’s local file system and then use the command line to put the files into HDFS. 1) Copy files from your Windows machine to the cluster’s Linux file system using WinSCP 2) Create a directory in HDFS using the “hadoop fs -mkdir” command Takes the path uri’s as an argument and creates a directory or multiple directories. # hadoop fs -mkdir <paths> # Example: hadoop fs -mkdir /user/hadoop hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2 /user/hadoop/dir3 3) Copy the files from the local file system to HDFS using the “hadoop fs -put” command Copies single src file or multiple src files from local file system to the Hadoop Distributed File System. # hadoop fs -put <local-src> ... <HDFS_dest_path> # Example: hadoop fs -put popularNames.txt /user/hadoop/dir1/popularNames.txt For more command line commands such as delete files, list files, etc... take a look at the links below: http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/ http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

egarelnabi · ‎03-23-2016

One way would be to use NiFi/HDF. You would create a ListFile processor to read the list of files in a folder and the pass it on to a GetFile processor (if you want to delete the original file) or FetchFile processor (if you want to keep the original). You would then use PutHDFS processor to land the files in HDFS. GetFile: Streams the contents of a file from a local disk (or network-attached disk) into NiFi and then deletes the original file. This Processor is expected to move the file from one location to another location and is not to be used for copying the data. FetchFile: Reads the contents of a file from disk and streams it into the contents of an incoming FlowFile. Once this is done, the file is optionally moved elsewhere or deleted to help keep the file system organized. ListHDFS: ListHDFS monitors a user-specified directory in HDFS and emits a FlowFile containing the filename for each file that it encounters. It then persists this state across the entire NiFi cluster by way of a Distributed Cache. These FlowFiles can then be fanned out across the cluster and sent to the FetchHDFS/GetFile Processor, which is responsible for fetching the actual content of those files and emitting FlowFiles that contain the content fetched from HDFS. PutHDFS: Write FlowFile data to Hadoop Distributed File System (HDFS) The flow you create in NiFi would continuously monitor your folder for new files and move them over. If it's only a one time ingest that you're interested in then you can just disable NiFi after you're done. Resources: https://nifi.apache.org/docs.html https://nifi.apache.org/docs/nifi-docs/html/getting-started.html

egarelnabi · ‎03-23-2016

@Mahesh Deshmukh Here you go: https://cwiki.apache.org/confluence/display/Hive/MultiDelimitSerDe

egarelnabi · ‎03-23-2016

@justin zhang Currently, Nifi does not have dedicated processors for different types of databases (Oracle, MySQL, etc...), and hence, no notion of CDC (change data capture). To copy from one DB to another you would use the ExecuteSQL and PutSQL processors. You would also configure the ExecuteSQL processor to run at appropriate time intervals (depending on your requirements). ExecuteSQL: Executes a user-defined SQL SELECT command, writing the results to a FlowFile in Avro format PutSQL: Updates a database by executing the SQL DDM statement defined by the FlowFile’s content https://nifi.apache.org/docs/nifi-docs/html/getting-started.html#database-access Documentation for Processors: https://nifi.apache.org/docs.html

Online	Offline
Last Visited	‎08-14-2019 09:54 AM

Member Since	‎10-06-2015 09:21 PM
Last Visited	‎08-14-2019 09:54 AM
Posts	273
Kudos received	202

Cloudera Community

Re: Is it possible to import a complete new taxono...

Re: Is it possible in Apache Atlas to add key-valu...

Re: Do we have tag carry forward in atlas hdp2.6.1...

Re: With ATLAS, which format attribute Date is acc...

Re: Spark streaming support for stream analytics m...

Re: load multiple json file to hive table

Re: how to increment the hive external table?

Re: Does Kafka support SSL/HMAC/Kerberos?

Re: How to create/update Sqoop Jobs in Java?

Re: Hive Dependency on Atlas

Re: Does Kafka support SSL/HMAC/Kerberos?

Re: How to import existing repositories's contents...

Re: How to import existing repositories's contents...

Re: How to Specify the multiple delimiters in hive...

Re: How to extract data from mysql to oracle using...