Member since
09-29-2015
67
Posts
115
Kudos Received
7
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1234 | 01-05-2016 05:03 PM | |
1870 | 12-31-2015 07:02 PM | |
1749 | 11-04-2015 03:38 PM | |
2156 | 10-19-2015 01:42 AM | |
1212 | 10-15-2015 02:22 PM |
03-26-2017
03:51 PM
MergeContent is the next processor in the flow. Is there sample code of what saving a reference to the latest flow file version looks like?
... View more
03-25-2017
10:34 PM
1 Kudo
NiFi UI shows that there are ~47k FlowFiles pending, but when I try to list the files in queue, I get the message "The queue has no FlowFiles". Looking at the logs, I see the below. Any way to fix the issue without clearing repository files? Error: 2017-03-15 15:43:44,165 WARN [Timer-Driven Process Thread-10] o.a.n.processors.standard.MergeContent MergeContent[id=015a1000-e3f5-15e4-c526-439d8b4f2216] Processor Administratively Yielded for 1 sec due to processing failure
2017-03-15 15:43:44,165 WARN [Timer-Driven Process Thread-10] o.a.n.c.t.ContinuallyRunProcessorTask Administratively Yielding MergeContent[id=015a1000-e3f5-15e4-c526-439d8b4f2216] due to uncaught Exception: org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=eb725d22-3e02-4283-a0ed-9b2d4c92cbb9,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1489605322168-114661, container=default, section=997], offset=237240, length=217],offset=0,name=1379541586731942,size=217] is not the most recent version of this FlowFile within this session (StandardProcessSession[id=125718])
2017-03-15 15:43:44,170 WARN [Timer-Driven Process Thread-10] o.a.n.c.t.ContinuallyRunProcessorTask
org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=eb725d22-3e02-4283-a0ed-9b2d4c92cbb9,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1489605322168-114661, container=default, section=997], offset=237240, length=217],offset=0,name=1379541586731942,size=217] is not the most recent version of this FlowFile within this session (StandardProcessSession[id=125718])
at org.apache.nifi.controller.repository.StandardProcessSession.migrate(StandardProcessSession.java:1121) ~[nifi-framework-core-1.1.2.jar:1.1.2]
at org.apache.nifi.controller.repository.StandardProcessSession.migrate(StandardProcessSession.java:1102) ~[nifi-framework-core-1.1.2.jar:1.1.2]
at org.apache.nifi.processor.util.bin.Bin.offer(Bin.java:142) ~[na:na]
at org.apache.nifi.processor.util.bin.BinManager.offer(BinManager.java:194) ~[na:na]
at org.apache.nifi.processor.util.bin.BinFiles.binFlowFiles(BinFiles.java:279) ~[na:na]
at org.apache.nifi.processor.util.bin.BinFiles.onTrigger(BinFiles.java:178) ~[na:na]
at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1099) ~[nifi-framework-core-1.1.2.jar:1.1.2]
at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:136) [nifi-framework-core-1.1.2.jar:1.1.2]
at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47) [nifi-framework-core-1.1.2.jar:1.1.2]
at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:132) [nifi-framework-core-1.1.2.jar:1.1.2]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_121]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
2017-03-15 15:43:45,185 ERROR [Timer-Driven Process Thread-9] o.a.n.processors.standard.MergeContent MergeContent[id=015a1000-e3f5-15e4-c526-439d8b4f2216] MergeContent[id=015a1000-e3f5-15e4-c526-439d8b4f2216] failed to process session due to org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=8ffd891d-baa5-46d2-8ddd-733518c2aa94,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1489606333097-5, container=default, section=5], offset=19170, length=216],offset=0,name=844025400925,size=216] is not the most recent version of this FlowFile within this session (StandardProcessSession[id=125721]): org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=8ffd891d-baa5-46d2-8ddd-733518c2aa94,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1489606333097-5, container=default, section=5], offset=19170, length=216],offset=0,name=844025400925,size=216] is not the most recent version of this FlowFile within this session (StandardProcessSession[id=125721])
... View more
Labels:
- Labels:
-
Apache NiFi
03-16-2017
01:44 AM
Here is what I wrote: cat <<EOF | /usr/hdp/current/phoenix-client/bin/sqlline.py localhost:2181:/hbase-unsecure &> /dev/null
!brief
!set outputformat csv
!set showwarnings false
!set silent true
!set headerinterval 0
!set showelapsedtime false
!set incremental true
!set showheader true
!record /shared/backup.csv
select * from d.test_table;
!record
!quit
EOF Working well so far (13M rows) after implementing suggestions here: https://community.hortonworks.com/content/supportkb/49037/phoenix-sqlline-query-on-larger-data-set-fails-wit.html
... View more
06-09-2016
02:45 AM
In case of oozie SSH action, where does the oozie workflow initiate the SSH action from? Does it execute from any of the data nodes? It executes from the current Oozie server, which usually runs on a master node. Deploy your SSH private keys there. See this article on how to go from start to finish.
... View more
06-03-2016
04:11 PM
Here is an example: https://community.hortonworks.com/articles/36321/predicting-stock-portfolio-losses-using-monte-carl.html
... View more
05-30-2016
02:36 PM
4 Kudos
Predicting stock portfolio losses using Monte Carlo simulation in Spark
Summary
Have you ever asked yourself: what is the most money my
stock holdings could lose in a single day? If you own stock through a 401k, a personal trading account, or employer provided stock options then you should absolutely ask yourself this question. Now think about how to answer it. Your first guess maybe to pick a random number, say 20%, and assume that is the worst case scenario. While simple, this is likely to be wildly inaccurate and certainly doesn’t take into account the positive impacts of a diversified portfolio. Surprisingly, a good estimate is hard to calculate. Luckily, financial institutions have to do this for their stock portfolios (called Value at Risk (VaR)), and we can apply their methods to individual portfolios. In this article we will run a Monte Carlo simulation using real trading data to try to quantify what can happen to your portfolio. You should now go to your broker website (Fidelity, E*Trade, etc...) and get a list of stocks that you own and the % that each holding represents of the total portfolio.
How it works
The Monte Carlo method is one that uses repeated sampling to predict a result. As a real-world example, think about how you might predict where your friend is aiming while throwing a dart at a dart board. If you were following the Monte Carlo method, you'd ask your friend to throw a 100 darts with the same aim, and then you'd make a prediction based on the largest cluster of darts. To predict stock returns we are going to pick 1,000,000 previous trading dates at random and see what happened to on those dates. The end result is going to be some aggregation of those results.
We will download historical stock trading data from Yahoo Finance and store them into HDFS. Then we will create a table in Spark like the below and pick a million random dates from it.
GS
AAPL
GE
OIL
2015-01-05
-3.12%
-2.81%
-1.83%
-6.06%
2015-01-06
-2.02%
-0.01%
-2.16%
-4.27%
2015-01-07
+1.48%
+1.40%
+0.04%
+1.91%
2015-01-08
+1.59%
+3.83%
+1.21%
+1.07%
Table 1: percent change per day by stock symbol
We combine the column values with the same proportions as your trading account. For example, if on Jan 5th 2015 you equaliy invested all of your money in GS, AAPL, GE, and OIL then you would have lost
% loss on 2015-01-05 = -3.12*(1/4) - 2.81*(1/4) - 1.83*(1/4) - 6.06*(1/4)
At the end of a Monte Carlo simulation we have 1,000,000 values that represent the possible gains and losses. We sort the results and take the 5th percentile, 50th percentile, and 95th percentile to represent the worst-case, average case, and best case scenarios.
When you run the below, you'll see this in the output
In a single day, this is what could happen to your stock holdings if you have $1000 invested
$ %
worst case -33 -3.33%
most likely scenario -1 -0.14%
best case 23 2.28%
The
code on GitHub also has examples of:
How to use Java 8 Lambda Expressions
Executing Hive SQL with Spark RDD objects
Unit testing Spark code with hadoop-mini-clusters
Detailed Step-by-step guide
1. Download and install the HDP Sandbox
Download the latest (2.4 as of this writing) HDP Sandbox
here. Import it into VMware or VirtualBox, start the instance, and update the DNS entry on your host machine to point to the new instance’s IP. On Mac, edit
/etc/hosts, on Windows, edit %systemroot%\system32\drivers\etc\ as administrator and add a line similar to the below:
192.168.56.102 sandbox sandbox.hortonworks.com
2. Download code and prerequisites
Log into the Sandbox and execute:
useradd guest
su - hdfs -c "hdfs dfs -mkdir /user/guest; hdfs dfs -chown guest:hdfs /user/guest; "
yum install -y java-1.8.0-openjdk-devel.x86_64
#update-alternatives --install /usr/lib/jvm/java java_sdk /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.91-0.b14.el6_7.x86_64 100
cd /tmp
git clone https://github.com/vzlatkin/MonteCarloVarUsingRealData.git
3. Update list of stocks that you own
Update
companies_list.txt with the list of companies that you own in your stock portfolio and either the portfolio weight (as %/100) or the dollar amount. You should be able to get this information from your broker's website (Fidelity, Scottrade, etc...). Take out any extra commas (,) if you are copying and pasting from the web. The provided sample looks like this:
Symbol,Weight or dollar amount (must include $)
GE,$250
AAPL,$250
GS,$250
OIL,$250
4. Download historical trading data for the stocks you own
Execute:
cd /tmp/MonteCarloVarUsingRealData/
/bin/bash downloadHistoricalData.sh
# Downloading historical data for GE
# Downloading historical data for AAPL
# Downloading historical data for GS
# Downloading historical data for OIL
# Saved to /tmp/stockData/
5. Run the MonteCarlo simulation
Execute:
su - guest -c " /usr/hdp/current/spark-client/bin/spark-submit --class com.hortonworks.example.Main --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 --queue default /tmp/MonteCarloVarUsingRealData/target/monte-carlo-var-1.0-SNAPSHOT.jar hdfs:///tmp/stockData/companies_list.txt hdfs:///tmp/stockData/*.csv"
Interpreting the Results
Below is the result of a sample portfolio that has $1,000 invested equally between Apple, GE, Goldman Sachs, and an ETF that holds crude oil. It says that with 95% certainty, the most that the portfolio can go down in a single day is $33. In addition, there is a 5% chance that the portfolio will gain $23 in a single day. Most of the time, the portfolio will lose $1 per day.
In a single day, this is what could happen to your stock holdings if you have $1000 invested
$ %
worst case -33 -3.33%
most likely scenario -1 -0.14%
best case 23 2.28%
... View more
05-27-2016
09:01 PM
@Constantin Stanca Would you share some of the specific optimizations mentioned in your article? "performance could be improved by ... using the operating system side optimization to take advantage of the most recent hardware NUMA capable."
... View more
05-24-2016
11:21 PM
I ended up storing the file in HDFS and reading it through sc.textFile(args[0])
... View more
04-16-2016
10:23 PM
Source article no longer exists. I used this: http://www.r-bloggers.com/interactive-data-science-with-r-in-apache-zeppelin-notebook/
... View more
03-14-2016
07:55 PM
1 Kudo
I used: grep '/var/log' /var/lib/ambari-agent/cache/cluster_configuration/* to identify all locations that need to be changed then Ambari/configs.sh to make the adjustments.
... View more