About JordanMoore

JordanMoore · ‎01-16-2018

@Tu Nguyen Where are you reading that you need to use JDBC from Spark to communicate with Hive? It isn't in the SparkSQL documentation. https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables 1. Try using an alternative JDBC client, see if you get similar results. 2. What happens when you simply use the following? val spark = SparkSession .builder() .appName("Spark Transactional Hive Example") .config("spark.sql.warehouse.dir", warehouseLocation) .enableHiveSupport() .getOrCreate() spark.table("tnguy.table_transactional_test").count()

JordanMoore · ‎01-15-2018

@Tu Nguyen - I'm afraid I don't understand your question. Spark does not use JDBC to communicate with Hive, but it can load Hive with any type of data that's able to be represented as a Spark DataSet. You may want to try a "MSCK REPAIR TABLE <tablename>;" in Hive, though

JordanMoore · ‎01-13-2018

You can coalesce using Spark, or MergeContent in NiFi to "compact" processes without needing to go to -getmerge. You should ideally avoid zip files on HDFS. They are not a common format in HDFS since they are not splittable. Reading a large ZIP file will therefore be only processable by a single mapper. Querying multiple part files of uncompressed CSV will be faster. If you need these files compressed in HDFS for archival while also able to query via Hive and other engines, use a different, compressed, binary format like Snappy w/ ORC. If you just want a CSV, use Beeline's output format argument, and write the results to a file, which you can then ZIP.

JordanMoore · ‎01-02-2018

Ambari itself doesn't know those disks are mounted until you edit the host configurations for HDFS/YARN and update the data directory configurations. The Ambari Alert check will run periodically to see if those configured disks are mounted, then the agent will update the dashboard.

JordanMoore · ‎12-14-2017

@Rakesh AN I have not used Flume in a distributed fashion, but whatever agent you choose, it tails the logs from the agent on that server, then ships them to the configured sink destinations. One agent per server makes it collect from different servers. Flume is near real-time, since it is configured with a batch size. It's not clear what doubt you have... Can you please explain how you've configured your Flume agents, and the issues you are experiencing? The Flume documentation is fairly straightforward

JordanMoore · ‎12-12-2017

You should read the warning on the ExecSource docs against using tail -f https://flume.apache.org/FlumeUserGuide.html#exec-source It even provides you the other sources to consider using instead. Those being "Spooling Directory Source, Taildir Source or direct integration with Flume via the SDK." Personally, I like tools such as Filebeat or Fluentd for real time collection of logs, and sending those to either Elasticsearch or Solr, since they provide better tooling around log inspection.

JordanMoore · ‎12-06-2017

@Ravikiran Dasari, You can of course install NiFi as an extra service, just as anything else. You are not locked to only packages HDP provides. You just lose the advantage of using Ambari to monitor and configure it. Feel free to read over the NiFi installation documentation, if you want to use it. Or you can install HDF services (such as NiFi) to your existing HDP cluster. If you want to use Flume, it seems there is an external FTP source, however, I personally don't know to install or configure it Also see https://community.hortonworks.com/questions/150882/ftp-files-to-hdfs.html

JordanMoore · ‎12-05-2017

NiFI is not your only option. You could install a Flume Agent on the SFTP Server to read this folder as a spooling directory. You can use Spark to read from the FTP directory and write to HDFS as it's just a filesystem. Add FTP Java clients to your code, and read from a folder. Whatever route you chose, you either need 1. additional software installed on the SFTP Server itself 2. setup a process "upstream" of the SFTP server that also sends files to HDFS. That could be by WebHDFS, HTTPFS, or a NFS Gateway 3. Some software that HDP does not provide out of the box between that server and HDFS. This includes NiFi, but Streamsets is another option. The official documentation for those softwares are going to tell you more than I would be able to here. If you want to use HDF, I believe you see if this documentation suits your needs. https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.1.1/bk_installing-hdf-and-hdp/content/ch_install-ambari.html

JordanMoore · ‎12-04-2017

NiFI has a GetFTP and PutHDFS processor. Are you using an HDF cluster?

JordanMoore · ‎12-01-2017

Is topic deletion enabled at the broker level (delete.topic.enable=true) in the entire cluster, and did you restart it if you did enable it? Maybe since the disk is full, Kafka and related services are refusing to start. Have you verified there are such processes running on the machine? The nuclear, manual option would be to delete the topic data from the broker, but you must also purge the Zookeeper records for this topic

Online	Offline
Last Visited	‎12-07-2015 12:15 PM

Member Since	‎11-19-2015 11:49 AM
Last Visited	‎12-07-2015 12:15 PM
Posts	158
Kudos received	25

Cloudera Community

Re: what is the most best monitoring tool for hado...

Re: What are the resources and technologies requir...

Re: How can I run kafka connect to import data fro...

Re: HDP Component working in deep

Re: I want to add an additional edge node to my ex...

Re: Spark with HIVE JDBC connection

Re: Spark with HIVE JDBC connection

Re: Concatenate and zip files in hdfs

Re: what is the status from ambari GUI that approv...

Re: what are the different sources used in real-ti...

Re: what are the different sources used in real-ti...

Re: how to get files from SFTP to HDFS?

Re: how to get files from SFTP to HDFS?

Re: how to get files from SFTP to HDFS?

Re: how to force delete topic from kafka