About tkiss

tkiss · ‎02-09-2017

No, it is not possible: "A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns" Source: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

tkiss · ‎02-08-2017

Very good question! Let's dig into Hadoop's source to find this out. The audit log uses java.net.InetAddress's toString() method to obtain a text format of the address: https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L7049 InetAddress's returns the information in "hostname/ip" format. If the hostname is not resolvable (reverse lookup is not working) then you get a starting slash: http://docs.oracle.com/javase/7/docs/api/java/net/InetAddress.html#toString()

tkiss · ‎02-03-2017

It really depends on your use-case and latency requirements. If you need to store Storm's result into HDFS then you can use a Storm HDFS Bolt. If you only need to store the source data I'd suggest to store from Kafka or Flume. That'll result a lower latency on the Storm topology and better decoupling.

tkiss · ‎02-02-2017

In Storm's nomenclature 'nimbus' is the cluster manager: http://storm.apache.org/releases/1.0.1/Setting-up-a-Storm-cluster.html Spark calls the cluster manager as 'master': http://spark.apache.org/docs/latest/spark-standalone.html

tkiss · ‎02-02-2017

Hello, Both storm & spark supports local mode. In Storm you need to create a LocalCluster instance then you can submit your job onto that. You can find description and example in the links: http://storm.apache.org/releases/1.0.2/Local-mode.html https://github.com/apache/storm/blob/1.0.x-branch/examples/storm-starter/src/jvm/org/apache/storm/starter/WordCountTopology.java#L98 Spark's approach on local mode is somewhat different. The allocation is controlled through the spark-master variable which can be set to local (or local[*], local[N] where N is a number). If local is specified executors will be started on your machine. Both Storm and Spark has monitoring capabilities through a web interface. You can find details about them here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_storm-component-guide/content/using-storm-ui.html http://spark.apache.org/docs/latest/monitoring.html Yarn is not a requirement but an option for distributed mode, both Spark & Storm is able to function on their own.

tkiss · ‎02-01-2017

Currently your parsing logic is based on a state machine. That approach won't work well with the idea of Spark. In Spark you'd need to load your data to a Dataset/Dataframe (or RDD) and do operations through that datastructure. I don't think that anybody will convert your code to Spark here and learning Spark would be inevitable anyways if you'd need to maintained the ported code. The lowest hanging fruit for you would be to make a try with Pypy interpreter which is more performant than cPython: http://pypy.org/ I've noticed in your code that you are reading in the file content in one go: lines = file.readlines() It would be more efficient to iterate through the file line by line: for line in open("Virtual_Ports.log", "r") I'd also suggest to use a profiler to see where your hotspots are. Hope this helps, Tibor

tkiss · ‎01-29-2017

Hi Sachin, SmartSense is available for the Hortonworks customers who has signed up for our support plan. It provides monitoring for server configurations and provides suggestions if any of the services are misconfigured. If you do not have a Hortonworks Support Contract you can disable it. It seems that your original problem has been resolved. I'd suggest closing this thread by choosing a 'best answer' for any of the answers you think solved your problem. If you have further questions please post new question so that others can easily learn from it (without the need to understand the whole history if this thread). Thanks, Tibor

tkiss · ‎01-27-2017

I believe Zeppelin only supports setting spark.app.name per interpreter at the moment: As a workaround you can make a try to duplicate the default spark interpreter and give unique spark.app.name to each newly created interpreter.

tkiss · ‎01-27-2017

After pivoting you need to run an aggregate function (e.g. sum) to get back a DataFrame/Dataset. After aggregation you'll be able to show() the data. You can find an excellent overview of pivoting at this website: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

tkiss · ‎01-27-2017

Seems like you have hdp-select package installed from 2.3.4.7 release while you are trying to install 2.3.6.0. Please provide further info to bring this problem into resolution: Are you trying to update from 2.3.4.7 to 2.3.6.0 ? Or have you installed manually 2.3.4.7's hdp-select tool?

Online	Offline
Last Visited	‎01-03-2019 02:21 PM

Member Since	‎02-10-2016 08:44 AM
Last Visited	‎01-03-2019 02:21 PM
Posts	50
Kudos received	7

Cloudera Community

Re: HDFS audit log

Re: stream processing runtimes

Re: Failed to install services on all nodes

Re: mvn release and nifi-nar-bundles

Re: streaming ingest to hdfs

Re: Performing Spark pivot without aggregation?

Re: HDFS audit log

Re: streaming ingest to hdfs

Re: stream processing runtimes

Re: stream processing runtimes

Re: Need to convert a python code to pyspark scrip...

Re: Failed to install services on all nodes

Re: set spark.app.name different for each noteboo...

Re: How to display pivoted dataframe with PSark, P...

Re: Failed to install services on all nodes