Member since
11-19-2015
158
Posts
25
Kudos Received
21
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
11720 | 09-01-2018 01:27 AM | |
1096 | 09-01-2018 01:18 AM | |
3663 | 08-20-2018 09:39 PM | |
484 | 07-20-2018 04:51 PM | |
1461 | 07-16-2018 09:41 PM |
11-21-2017
09:45 PM
Without knowing how you are executing the whole process, it sounds like you ran spark-submit from a Docker container. Therefore, only the first process of spark-submit happened inside of Docker. If you have mounted the HADOOP_CONF directory into the container, then this is no different than outside the container. Additionally, if you submitted as cluster mode to YARN, then the app master / driver & executors of Spark are no different than regular YARN processes; whereas, if you did it in client mode, the Spark driver remains within the Docker container until the Spark application ends.
... View more
11-15-2017
07:29 PM
Confluent is the support company for Kafka. I personally would trust their code more than someone else's.
... View more
11-14-2017
07:31 PM
1 Kudo
@Swaapnika Guntaka You could use Spark Streaming in PySpark to consume a topic and write the data to HDFS. You could also use HDF with NiFi and skip Python entirely. Also, this is a Python client, by Confluent, not related to Kafka Connect. https://github.com/confluentinc/confluent-kafka-python
... View more
11-03-2017
09:39 PM
Yes. This feature exists in many forms Flume MapReduce using Camus or Apache Gobblin Spark Streaming NiFi Streamsets Kafka Connect Depending on what tools you have available, it's up to you to decide which makes the most sense.
... View more
11-02-2017
02:42 AM
Unless I am mistaken, Amabri only checks the Datanode / NodeManager process is running, not that the network connection between the DataNode to the ResourceManager is possible. SSH to the datanodes, try to telnet to the resource manager port, and report back
... View more
11-02-2017
02:34 AM
@Divya Sodha I think you may be confusing the purpose of RAM and and a hard drive. Even a "RAMDisk" is the reverse of what you are asking - putting files into your RAM. If you need lighter resource usage, you are welcome to create your own base VM, install Amabri following the HDP documentation, then install the minimal set of components you need for your learning purposes. The only reason the sandbox needs the 8+GB is for running the majority of the HDP components. Also, many of the Hadoop processes are running in Java, which uses a configurable heap size from the Ambari configurations. I have ran been able to run a single node Hadoop cluster within 4-6GB of RAM depending on what other services I additionally had. Keep in mind, your OS itself needs 1-2GB of RAM.
... View more
10-30-2017
04:18 AM
1 Kudo
Yes, MirrorMaker is not putting a limitation on remote vs local cluster. It is designed for remote clusters because there is almost no need to do it locally. If you are mirroring a topic locally, you must rename it, and if you are going to rename it, then you have consumers/producers using data in both topics? You are replicating data within the same cluster for little gain while your consumers/producers can easily be configured to use the correct topic(s).
... View more
10-30-2017
04:15 AM
Do you need auditing in your system, then no. If not, then yes, and why did you have it enabled?
... View more
10-27-2017
09:06 PM
That's the Ranger Kafka Plugin logging HDFS auditing. https://github.com/apache/ranger/blob/master/plugin-kafka/scripts/install.properties#L65 Log into Ranger Admin UI if you want to disable it.
... View more
10-27-2017
08:57 PM
At the basics, you would write a producer that consumes from one topic and produces to another. MirrorMaker is what you are looking for. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_kafka-component-guide/content/ch_kafka_mirrormaker.html
... View more
10-26-2017
03:50 PM
Do you have a working sqoop command?
With that information, you can create an hourly oozie job.
Start with a one-off workflow.xml file - find the documentation here.
https://oozie.apache.org/docs/4.2.0/DG_SqoopActionExtension.html Make sure you can run the workflow before working on the coordinator.
Then you can make an hourly coordinator like this and put the "jobStart" and "jobEnd" properties in the oozie config file.
<coordinator-app name="DB2-Export"
frequency="${coord:hours(1)}"
start="${jobStart}" end="${jobEnd}" timezone="UTC"
xmlns="uri:oozie:coordinator:0.2">
<controls>
<concurrency>1</concurrency>
<execution>FIFO</execution>
<throttle>1</throttle>
</controls>
<action>
<workflow>
<app-path>${wf_application_path}</app-path>
</workflow>
</action>
</coordinator-app>
You would execute this like oozie job -config db2-export-cooord.properties -run Where that property file might contain jobTracker=namenode.fqdn:8050
nameNode=hdfs://hadoop_cluster
wf_application_path=hdfs://path/to/db2-export/
oozie.coord.application.path=${wf_application_path}
jobStart=2017-11-01T09:00Z
jobEnd=2099-11-09T09:00Z
... View more
10-25-2017
10:00 PM
2 Kudos
If you don't have NiFi, Camus is deprecated in favor of Gobblin, yes, but Confluent has packaged Kafka Connect specifically for transferring data between various source and sinks, such as HDFS. https://www.confluent.io/product/connectors/ https://docs.confluent.io/current/connect/connect-hdfs/docs/hdfs_connector.html
... View more
10-25-2017
04:56 AM
How did you setup your local repo? You need access to that server and delete all "hue*" packages
... View more
10-25-2017
04:52 AM
1 Kudo
As of HDP 2.6, Hue is deprecated in favor of Ambari Views. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_release-notes/content/deprecated_items.html You're welcome to download, compile, and setup Hue on your own, though. http://gethue.com/hadoop-hue-3-on-hdp-installation-tutorial/ You may also try this custom Ambari service. https://github.com/EsharEditor/ambari-hue-service
... View more
10-20-2017
02:43 PM
If you want to use multiple partitions, in my experience, you would handle that by embedding a message production time, then at the consumer level extract that. For example, you could dumping data into some time-series capable database, then querying that ordering by timestamp.
... View more
10-17-2017
07:58 PM
The answer here depends heavily on what services you need, what hardware is available, and how frequently you will use them. Flume Agents are minimal and mostly collect logs. Livy is just a web API for Spark, but it does maintain SparkContexts and starts with 2GB heap by default. Supervisor is a Storm process. (I don't know much about Storm) Spark, Phoenix Query, and Accumulo Thrift Servers should ideally be separated for the respective query processing. Install multiple of each to provide failover. If you are limited by servers, then use your best judgement about what is the most critical piece of your architecture, then set explicitly dedicated hardware pools to that. For the rest, as long as you have the available cpu/memory/disk to run additional processing with little overheard, then you can combine services with minimal impact.
... View more
10-17-2017
07:26 PM
Is data guaranteed to be produced chronologically? Can you afford to embed a timestamp into the message and sort client-side? Kafka guarantees order within a single partition, and the partition can be based on a hash of some key, so for example, all events by user_id X will be ordered within a partition. Refer: https://stackoverflow.com/questions/29820384/apache-kafka-order-of-messages-with-multiple-partitions
... View more
10-17-2017
07:19 PM
First off, ideally, to prevent data loss, you should use more than one replica. For better throughput, use more than one partition. When you describe the topic, it will tell you the leaders for each partition. That will give the broker ID. You will need to make a note of which ID's belong to which machines as well as the data location for each broker to know where the data is stored on those servers. As for how it determines, there is a leader election algorithm within Zookeeper... probably worth reading over the Kafka documentation / Wiki if you are really curious about that. Forcing leaders is also possible, http://blog.erdemagaoglu.com/post/128624804243/forcing-kafka-partition-leaders
... View more
10-10-2017
08:52 PM
@CaselChen Again, Spark connects directly to the HiveMetastore - using JDBC requires you to go through HiveServer2
... View more
09-26-2017
05:49 PM
1 Kudo
I have Hue 4 deployed using Puppet against an HDP 2.5 cluster. Works fine, at least for Spark, Oozie, and Hive. Also integrated with LDAP, so not sure what issues @Shashant Panwar is having. Just point the correct hue.ini properties for like fs.deaultFS / webhdfs, the ResourceManager, HiveServer2, etc. and should work. Add in authentication after you get the other pieces working. The Hue Users Google Group has been fairly helpful with support (in other words, you probably won't get much Hortonworks support for Hue)
... View more
09-14-2017
07:04 PM
You should be more clear in your reasoning, but yes, Ambari is not tied to the HDP stack. You can define your own, or use another like Apache BigTop. There are some Ambari Stack definitions that don't include Hadoop (HDFS or YARN)
... View more
08-29-2017
10:41 PM
You just write the DStream using saveAsTextFiles http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams I wouldn't suggest PySpark for Spark Streaming for simply the reason that the streaming API methods for writing anything but text don't exist.
... View more
08-24-2017
06:59 PM
Can you better your use case for why you think you need to append files? HDFS is intended for a write-once, read many architecture. The hadoop inputformat's support reading many files from an HDFS directory, and all contained files will be read. MapReduce, Spark, Pig, Hive all support reading and writing files in that format. If you really want this feature, fetch the file, append it, then overwrite the HDFS file
... View more
08-24-2017
06:29 PM
Spark connects to the Hive metastore directly via a HiveContext. It does not (nor should, in my opinion) use JDBC. First, you must compile Spark with Hive support, then you need to explicitly call enableHiveSupport() on the SparkSession bulider. Additionally, Spark2 will need you to provide either 1. A hive-site.xml file in the classpath 2. Setting hive.metastore.uris . Refer: https://stackoverflow.com/questions/31980584/how-to-connect-to-a-hive-metastore-programmatically-in-sparksql Additional resources - https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables - https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-hive-integration.html
... View more
08-22-2017
06:53 PM
You can get the JSON response.
https://github.com/apache/ambari/blob/trunk/ambari-server/docs/api/v1/hosts.md
http://ambari-server:8080/clusters/:clusterName/hosts
To extract the hostnames easier, you could try JSONPath
$.items[*].Hosts.host_name
Or Python with Requests library r = requests.get('...')
hosts = ','.join(x['Hosts']['host_name'] for x in r.json()['items'])
... View more
08-18-2017
06:46 PM
When I took the certification two years ago, it was almost the exact same as the AWS practice exam, so there should be no surprises. The questions aren't graded on the tool you use, just the validity of the end data. If you would like additional practice, then see if you can address a problem using a completely different set of tools. For example, you could write Spark or even MapReduce instead of Pig or HiveQL to solve certain issues. For others, you may find that there is only one available tool that provides the features you need. The important thing that I would suggest is getting a good "mental map" of each of the documentation pages since you won't have access to the internet or a search engine. Know the keywords to actually "Ctrl+F" for when you are stuck, and have a good grasp on the commonly used functions/syntax of HDFS CLI, Pig, Hive, Sqoop, etc. In my opinion, the Flume documentation is very searchable because it is all on a single page, but for Hive and Pig, it takes a few clicks to get where you need. Good luck!
... View more
08-18-2017
06:18 PM
Again, the API is versioned. If there are any major breaking changes, one should expect there to be a v2. If you just look at the Github trunk, then you'll see that the API spec has not changed in years, and the latest commits have been typo and hyperlink fixes.
... View more
Re: Need Suggestion: How Can i Get Free HDPCD Exam...
08-16-2017
07:00 PM
08-16-2017
07:00 PM
You can actually start the AWS practice exam, extract out all the information and scripts, then use the sandbox for performing the entire practice exam. Depending on how quickly you do this, it would be cheaper in AWS charges than whatever that "pdf dump" site is giving you. The exam is completely hands-on scripting anyway, not written Q&A, so I highly doubt that site you found is anything but a scam. Good luck studying!
... View more
08-16-2017
06:44 PM
The API is versioned, and I have not been aware of many changes in the structure of the data model, only the amount that is data that is returned with the addition of new services in each release of HDP, for example. Depending on the size of your cluster, http://ambari-server:8080/api/v1/clusters/:cluster_name/ returns a lot of information You can find all the documentation for the API here https://github.com/apache/ambari/tree/trunk/ambari-server/docs/api/v1 There is a ticket to get SwaggerUI documentation so the API is easier to navigate: https://issues.apache.org/jira/browse/AMBARI-20435
... View more
08-16-2017
06:34 PM
I have built and deployed Hue 4.0.0 on CentOS just fine. What exact problems are you referring to? Granted, I have done this within an actual production system, not the sandbox, but I don't see why it would be any different. If using the Dockerized sandbox, I would recommend setting up Hue within docker-compose. And if you are using that route, just use the Hue docker container and configure the hue.ini file accordingly. https://github.com/cloudera/hue/tree/master/tools/docker If you want to try to install it via Ambari, there is a service for that too. https://github.com/EsharEditor/ambari-hue-service
... View more
- « Previous
- Next »