Member since
01-14-2019
144
Posts
48
Kudos Received
17
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1305 | 10-05-2018 01:28 PM | |
1123 | 07-23-2018 12:16 PM | |
1461 | 07-23-2018 12:13 PM | |
7306 | 06-25-2018 03:01 PM | |
4944 | 06-20-2018 12:15 PM |
06-18-2018
01:42 PM
1 Kudo
You may not be accounting for the driver RAM. Spark creates a driver process to act as a "parent" from which the executor processes spawn as separate YARN jobs. You are specifying the executor memory as 2GB but you did not specify the driver's memory limit. By default, the driver is allocated 1GB of RAM thus explaining your calculations. https://spark.apache.org/docs/latest/configuration.html
... View more
06-12-2018
01:59 PM
According to the relevant Phoenix JIRA ticket, it is still in the Unresolved state so you will have to use the HBase APIs directly. https://issues.apache.org/jira/browse/PHOENIX-590
... View more
05-23-2018
01:49 PM
1 Kudo
@Bhanu Pamu Please refer to the following section of our documentation to enable HDFS storage for Zeppelin HA: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_zeppelin-component-guide/content/ch_zeppelin_upgrade_hdfs_storage.htm
... View more
05-16-2018
12:06 PM
I couldn't find any documentation on this specific calculation, but you can understand it through testing as you have already. If you'd like to verify, insert a 2GB file into HDFS and get measurements before and after the insert. You should see the numbers change by the respective amounts.
... View more
05-16-2018
11:56 AM
You actually should not have to 'source Profile' before running cbd start. If you run cbd start in the same directory as Profile, then it will pull in those variables for you without having to 'source' them from the file.
... View more
05-15-2018
04:46 PM
@Vinuraj M This is referring to the replication factor of HDFS which defaults to 3. This means that files you place on HDFS are stored 3 times on disks across the cluster for redundancy/node failure tolerance purposes. Therefore your 'du -h' will give you the sum of file sizes you have places on HDFS whereas the HDFS disk usage will give you the total disk space consumed. 6.XX GB * 3 replication factor = ~19 GB
... View more
05-15-2018
04:34 PM
@Pankaj Singh Complete instructions for setting up Cloudbreak on Google Cloud can be found here: https://docs.hortonworks.com/HDPDocuments/Cloudbreak/Cloudbreak-2.4.0/content/gcp-launch/index.html UAA_DEFAULT_USER_PW needs to be set in a file named 'Profile' from which you initialize Cloudbreak.
... View more
05-11-2018
12:31 PM
1 Kudo
@Jorge Florencio We used Hive 2 currently for LLAP functionality, and Hive 1 for Hive on Tez. If you enable "Interactive Query" in the Configs tab you'll see the following output when you connect via Beeline. Let me know if there is a particular reason you want to use Hive 2. Connected to: Apache Hive (version 2.1.0.2.6.4.0-91)
Driver: Hive JDBC (version 1.2.1000.2.6.4.0-91)
Transaction isolation: TRANSACTION_REPEATABLE_READ
... View more
05-10-2018
02:44 PM
2 Kudos
This article is based on the following Kaggle competition:
https://www.kaggle.com/arjunjoshua/predicting-fraud-in-financial-payment-services It is a Scala-based implementation of the data science exploration written in Python. In addition to training a model, we also have the ability to batch-evaluate a set of data stored in a file through the trained model.
Full configuration, build, and installation instructions can be found at the GitHub repo:
https://github.com/anarasimham/anomaly-detection When you execute the model training, you'll get various lines of output as the data is cleaned and the model is built. To view this output, use the link provided by the Spark job console output. This will look like the following: 18/05/10 14:33:58 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: <HOST_IP>
ApplicationMaster RPC port: 0
queue: default
start time: 1525962717635
final status: SUCCEEDED
tracking URL: http://<SPARK_SERVER_HOST_NAME>:8088/proxy/application_1525369563028_0053/
user: root The last few lines, which show the trained model, look like this: +-----+-------------------------------------------------------------------------------------------------------------+-----------------------------------------+----------+
|label|features |probabilities |prediction|
+-----+-------------------------------------------------------------------------------------------------------------+-----------------------------------------+----------+
|0 |(10,[0,2,5,8,9],[1.0,1950.77,106511.31,-1950.77,104560.54]) |[0.9777384996414185,0.02226151153445244] |0.0 |
|0 |(10,[0,2,5,8,9],[1.0,3942.44,25716.56,-3942.44,21774.120000000003]) |[0.9777384996414185,0.02226151153445244] |0.0 |
|0 |[1.0,0.0,7276.69,93.0,0.0,1463.0,0.0,0.0,-7183.69,-5813.69] |[0.9777384996414185,0.02226151153445244] |0.0 |
|0 |(10,[0,2,5,8,9],[1.0,13614.91,30195.0,-13614.91,16580.09]) |[0.9777384996414185,0.02226151153445244] |0.0 |
|0 |[1.0,0.0,17488.56,14180.0,0.0,182385.22,199873.79,0.0,-3308.5600000000013,-34977.130000000005] |[0.9777384996414185,0.02226151153445244] |0.0 |
|0 |[1.0,0.0,19772.53,0.0,0.0,44486.99,64259.52,0.0,-19772.53,-39545.06] |[0.9777384996414185,0.02226151153445244] |0.0 |
|1 |(10,[0,2,3,7,9],[1.0,20128.0,20128.0,1.0,-20128.0]) |[0.022419333457946777,0.9775806665420532]|1.0 |
|0 |[1.0,0.0,33782.98,0.0,0.0,39134.79,16896.7,0.0,-33782.98,-11544.890000000003] |[0.9777384996414185,0.02226151153445244] |0.0 |
|0 |[1.0,0.0,34115.82,32043.0,0.0,245.56,34361.39,0.0,-2072.8199999999997,-68231.65] |[0.9777384996414185,0.02226151153445244] |0.0 | The original data is split into training data and test data, and the above is the results of running the test data through the model.
The "label" column denotes which label (0=legitimate, 1=fraudulent) the row of data truly falls into
The "features" column is all the data that went into training the model, in vectorized format because that is the way the model understands the data
The "probabilities" column denotes how likely the model thinks each of the labels is (first number being 0, second number being 1), and the "prediction" column is what the model thinks the data falls into. You can add additional print statements and re-run the training to explore When you execute the evaluation portion of this project (instructions in the GitHub repo), you will re-load the model from disk and use test data from a file to see if the model is predicting correctly. Note that it is a bad practice to use test data from the training set (like I have) but for simplicity I have done that. You can go to the Spark UI as above to view the output. And there you have it, a straightforward approach to building a Gradient Boosted Decision Tree Machine Learning model based off of financial data. This approach can be applied not only to Finance but can be used to train a whole variety of use cases in other industries.
... View more
Labels:
03-27-2018
06:18 PM
Could you share more details? It looks like you are not running as a privileged user, if you're getting an 'operation not permitted' error, but it is hard to know without context. Are any of your other services starting? Have you ever been able to start the Spark2 History Server? Is this a recent issue? What does your cluster look like? # nodes, specs for each node? Any other issues you've noticed, other details you can provide?
... View more