Member since
11-19-2015
158
Posts
25
Kudos Received
21
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
15145 | 09-01-2018 01:27 AM | |
1917 | 09-01-2018 01:18 AM | |
5684 | 08-20-2018 09:39 PM | |
960 | 07-20-2018 04:51 PM | |
2532 | 07-16-2018 09:41 PM |
01-16-2018
02:29 AM
@Tu Nguyen Where are you reading that you need to use JDBC from Spark to communicate with Hive? It isn't in the SparkSQL documentation. https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables 1. Try using an alternative JDBC client, see if you get similar results. 2. What happens when you simply use the following? val spark = SparkSession
.builder()
.appName("Spark Transactional Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
spark.table("tnguy.table_transactional_test").count()
... View more
01-15-2018
04:54 PM
@Tu Nguyen - I'm afraid I don't understand your question. Spark does not use JDBC to communicate with Hive, but it can load Hive with any type of data that's able to be represented as a Spark DataSet. You may want to try a "MSCK REPAIR TABLE <tablename>;" in Hive, though
... View more
01-13-2018
02:57 AM
You can coalesce using Spark, or MergeContent in NiFi to "compact" processes without needing to go to -getmerge. You should ideally avoid zip files on HDFS. They are not a common format in HDFS since they are not splittable. Reading a large ZIP file will therefore be only processable by a single mapper. Querying multiple part files of uncompressed CSV will be faster. If you need these files compressed in HDFS for archival while also able to query via Hive and other engines, use a different, compressed, binary format like Snappy w/ ORC. If you just want a CSV, use Beeline's output format argument, and write the results to a file, which you can then ZIP.
... View more
01-02-2018
10:26 PM
Ambari itself doesn't know those disks are mounted until you edit the host configurations for HDFS/YARN and update the data directory configurations. The Ambari Alert check will run periodically to see if those configured disks are mounted, then the agent will update the dashboard.
... View more
12-14-2017
04:32 PM
@Rakesh AN
I have not used Flume in a distributed fashion, but whatever agent you choose, it tails the logs from the agent on that server, then ships them to the configured sink destinations. One agent per server makes it collect from different servers. Flume is near real-time, since it is configured with a batch size. It's not clear what doubt you have... Can you please explain how you've configured your Flume agents, and the issues you are experiencing? The Flume documentation is fairly straightforward
... View more
12-12-2017
07:26 PM
You should read the warning on the ExecSource docs against using tail -f https://flume.apache.org/FlumeUserGuide.html#exec-source It even provides you the other sources to consider using instead. Those being "Spooling Directory Source, Taildir Source or direct integration with Flume via the SDK." Personally, I like tools such as Filebeat or Fluentd for real time collection of logs, and sending those to either Elasticsearch or Solr, since they provide better tooling around log inspection.
... View more
12-06-2017
08:25 PM
1 Kudo
@Ravikiran Dasari, You can of course install NiFi as an extra service, just as anything else. You are not locked to only packages HDP provides. You just lose the advantage of using Ambari to monitor and configure it. Feel free to read over the NiFi installation documentation, if you want to use it. Or you can install HDF services (such as NiFi) to your existing HDP cluster. If you want to use Flume, it seems there is an external FTP source, however, I personally don't know to install or configure it Also see https://community.hortonworks.com/questions/150882/ftp-files-to-hdfs.html
... View more
12-05-2017
11:26 PM
NiFI is not your only option. You could install a Flume Agent on the SFTP Server to read this folder as a spooling directory. You can use Spark to read from the FTP directory and write to HDFS as it's just a filesystem. Add FTP Java clients to your code, and read from a folder. Whatever route you chose, you either need 1. additional software installed on the SFTP Server itself 2. setup a process "upstream" of the SFTP server that also sends files to HDFS. That could be by WebHDFS, HTTPFS, or a NFS Gateway 3. Some software that HDP does not provide out of the box between that server and HDFS. This includes NiFi, but Streamsets is another option. The official documentation for those softwares are going to tell you more than I would be able to here. If you want to use HDF, I believe you see if this documentation suits your needs. https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.1.1/bk_installing-hdf-and-hdp/content/ch_install-ambari.html
... View more
12-04-2017
10:09 PM
2 Kudos
NiFI has a GetFTP and PutHDFS processor. Are you using an HDF cluster?
... View more
12-01-2017
11:22 PM
Is topic deletion enabled at the broker level (delete.topic.enable=true) in the entire cluster, and did you restart it if you did enable it? Maybe since the disk is full, Kafka and related services are refusing to start. Have you verified there are such processes running on the machine? The nuclear, manual option would be to delete the topic data from the broker, but you must also purge the Zookeeper records for this topic
... View more