Member since
09-29-2015
67
Posts
115
Kudos Received
7
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
659 | 01-05-2016 05:03 PM | |
907 | 12-31-2015 07:02 PM | |
915 | 11-04-2015 03:38 PM | |
1051 | 10-19-2015 01:42 AM | |
556 | 10-15-2015 02:22 PM |
03-26-2017
03:51 PM
MergeContent is the next processor in the flow. Is there sample code of what saving a reference to the latest flow file version looks like?
... View more
03-25-2017
10:34 PM
1 Kudo
NiFi UI shows that there are ~47k FlowFiles pending, but when I try to list the files in queue, I get the message "The queue has no FlowFiles". Looking at the logs, I see the below. Any way to fix the issue without clearing repository files? Error: 2017-03-15 15:43:44,165 WARN [Timer-Driven Process Thread-10] o.a.n.processors.standard.MergeContent MergeContent[id=015a1000-e3f5-15e4-c526-439d8b4f2216] Processor Administratively Yielded for 1 sec due to processing failure
2017-03-15 15:43:44,165 WARN [Timer-Driven Process Thread-10] o.a.n.c.t.ContinuallyRunProcessorTask Administratively Yielding MergeContent[id=015a1000-e3f5-15e4-c526-439d8b4f2216] due to uncaught Exception: org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=eb725d22-3e02-4283-a0ed-9b2d4c92cbb9,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1489605322168-114661, container=default, section=997], offset=237240, length=217],offset=0,name=1379541586731942,size=217] is not the most recent version of this FlowFile within this session (StandardProcessSession[id=125718])
2017-03-15 15:43:44,170 WARN [Timer-Driven Process Thread-10] o.a.n.c.t.ContinuallyRunProcessorTask
org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=eb725d22-3e02-4283-a0ed-9b2d4c92cbb9,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1489605322168-114661, container=default, section=997], offset=237240, length=217],offset=0,name=1379541586731942,size=217] is not the most recent version of this FlowFile within this session (StandardProcessSession[id=125718])
at org.apache.nifi.controller.repository.StandardProcessSession.migrate(StandardProcessSession.java:1121) ~[nifi-framework-core-1.1.2.jar:1.1.2]
at org.apache.nifi.controller.repository.StandardProcessSession.migrate(StandardProcessSession.java:1102) ~[nifi-framework-core-1.1.2.jar:1.1.2]
at org.apache.nifi.processor.util.bin.Bin.offer(Bin.java:142) ~[na:na]
at org.apache.nifi.processor.util.bin.BinManager.offer(BinManager.java:194) ~[na:na]
at org.apache.nifi.processor.util.bin.BinFiles.binFlowFiles(BinFiles.java:279) ~[na:na]
at org.apache.nifi.processor.util.bin.BinFiles.onTrigger(BinFiles.java:178) ~[na:na]
at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1099) ~[nifi-framework-core-1.1.2.jar:1.1.2]
at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:136) [nifi-framework-core-1.1.2.jar:1.1.2]
at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47) [nifi-framework-core-1.1.2.jar:1.1.2]
at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:132) [nifi-framework-core-1.1.2.jar:1.1.2]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_121]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
2017-03-15 15:43:45,185 ERROR [Timer-Driven Process Thread-9] o.a.n.processors.standard.MergeContent MergeContent[id=015a1000-e3f5-15e4-c526-439d8b4f2216] MergeContent[id=015a1000-e3f5-15e4-c526-439d8b4f2216] failed to process session due to org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=8ffd891d-baa5-46d2-8ddd-733518c2aa94,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1489606333097-5, container=default, section=5], offset=19170, length=216],offset=0,name=844025400925,size=216] is not the most recent version of this FlowFile within this session (StandardProcessSession[id=125721]): org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=8ffd891d-baa5-46d2-8ddd-733518c2aa94,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1489606333097-5, container=default, section=5], offset=19170, length=216],offset=0,name=844025400925,size=216] is not the most recent version of this FlowFile within this session (StandardProcessSession[id=125721])
... View more
Labels:
- Labels:
-
Apache NiFi
03-25-2017
05:41 PM
I'm looking for a generic guide on tuning Phoenix queries as well as specific answers to the below questions: Is there a table showing how to compare different keyword in explain plans? Ex: “PARALLEL 1-WAY ROUND ROBIN FULL SCAN”
versus “PARALLEL 9-WAY FULL SCAN”, which one would be faster if everything else was the same? Is there a way to determine if all of my RegionServers
are being used during the execution of a particular query? What is the 9-CHUNK in “CLIENT 9-CHUNK 4352857 ROWS 943718447 BYTES” mean? Are more
CHUNKs better or worse? What is the impact of using functions, such as HOUR(ts)
on query execution time? Is the impact
1%, 10%, 50%, etc…?
... View more
Labels:
03-16-2017
01:44 AM
Here is what I wrote: cat <<EOF | /usr/hdp/current/phoenix-client/bin/sqlline.py localhost:2181:/hbase-unsecure &> /dev/null
!brief
!set outputformat csv
!set showwarnings false
!set silent true
!set headerinterval 0
!set showelapsedtime false
!set incremental true
!set showheader true
!record /shared/backup.csv
select * from d.test_table;
!record
!quit
EOF Working well so far (13M rows) after implementing suggestions here: https://community.hortonworks.com/content/supportkb/49037/phoenix-sqlline-query-on-larger-data-set-fails-wit.html
... View more
08-18-2016
03:11 PM
Here is more sample code that uses this project: https://github.com/vzlatkin/MonteCarloVarUsingRealData
... View more
06-09-2016
02:45 AM
In case of oozie SSH action, where does the oozie workflow initiate the SSH action from? Does it execute from any of the data nodes? It executes from the current Oozie server, which usually runs on a master node. Deploy your SSH private keys there. See this article on how to go from start to finish.
... View more
06-03-2016
04:11 PM
Here is an example: https://community.hortonworks.com/articles/36321/predicting-stock-portfolio-losses-using-monte-carl.html
... View more
05-30-2016
02:36 PM
4 Kudos
Predicting stock portfolio losses using Monte Carlo simulation in Spark
Summary
Have you ever asked yourself: what is the most money my
stock holdings could lose in a single day? If you own stock through a 401k, a personal trading account, or employer provided stock options then you should absolutely ask yourself this question. Now think about how to answer it. Your first guess maybe to pick a random number, say 20%, and assume that is the worst case scenario. While simple, this is likely to be wildly inaccurate and certainly doesn’t take into account the positive impacts of a diversified portfolio. Surprisingly, a good estimate is hard to calculate. Luckily, financial institutions have to do this for their stock portfolios (called Value at Risk (VaR)), and we can apply their methods to individual portfolios. In this article we will run a Monte Carlo simulation using real trading data to try to quantify what can happen to your portfolio. You should now go to your broker website (Fidelity, E*Trade, etc...) and get a list of stocks that you own and the % that each holding represents of the total portfolio.
How it works
The Monte Carlo method is one that uses repeated sampling to predict a result. As a real-world example, think about how you might predict where your friend is aiming while throwing a dart at a dart board. If you were following the Monte Carlo method, you'd ask your friend to throw a 100 darts with the same aim, and then you'd make a prediction based on the largest cluster of darts. To predict stock returns we are going to pick 1,000,000 previous trading dates at random and see what happened to on those dates. The end result is going to be some aggregation of those results.
We will download historical stock trading data from Yahoo Finance and store them into HDFS. Then we will create a table in Spark like the below and pick a million random dates from it.
GS
AAPL
GE
OIL
2015-01-05
-3.12%
-2.81%
-1.83%
-6.06%
2015-01-06
-2.02%
-0.01%
-2.16%
-4.27%
2015-01-07
+1.48%
+1.40%
+0.04%
+1.91%
2015-01-08
+1.59%
+3.83%
+1.21%
+1.07%
Table 1: percent change per day by stock symbol
We combine the column values with the same proportions as your trading account. For example, if on Jan 5th 2015 you equaliy invested all of your money in GS, AAPL, GE, and OIL then you would have lost
% loss on 2015-01-05 = -3.12*(1/4) - 2.81*(1/4) - 1.83*(1/4) - 6.06*(1/4)
At the end of a Monte Carlo simulation we have 1,000,000 values that represent the possible gains and losses. We sort the results and take the 5th percentile, 50th percentile, and 95th percentile to represent the worst-case, average case, and best case scenarios.
When you run the below, you'll see this in the output
In a single day, this is what could happen to your stock holdings if you have $1000 invested
$ %
worst case -33 -3.33%
most likely scenario -1 -0.14%
best case 23 2.28%
The
code on GitHub also has examples of:
How to use Java 8 Lambda Expressions
Executing Hive SQL with Spark RDD objects
Unit testing Spark code with hadoop-mini-clusters
Detailed Step-by-step guide
1. Download and install the HDP Sandbox
Download the latest (2.4 as of this writing) HDP Sandbox
here. Import it into VMware or VirtualBox, start the instance, and update the DNS entry on your host machine to point to the new instance’s IP. On Mac, edit
/etc/hosts, on Windows, edit %systemroot%\system32\drivers\etc\ as administrator and add a line similar to the below:
192.168.56.102 sandbox sandbox.hortonworks.com
2. Download code and prerequisites
Log into the Sandbox and execute:
useradd guest
su - hdfs -c "hdfs dfs -mkdir /user/guest; hdfs dfs -chown guest:hdfs /user/guest; "
yum install -y java-1.8.0-openjdk-devel.x86_64
#update-alternatives --install /usr/lib/jvm/java java_sdk /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.91-0.b14.el6_7.x86_64 100
cd /tmp
git clone https://github.com/vzlatkin/MonteCarloVarUsingRealData.git
3. Update list of stocks that you own
Update
companies_list.txt with the list of companies that you own in your stock portfolio and either the portfolio weight (as %/100) or the dollar amount. You should be able to get this information from your broker's website (Fidelity, Scottrade, etc...). Take out any extra commas (,) if you are copying and pasting from the web. The provided sample looks like this:
Symbol,Weight or dollar amount (must include $)
GE,$250
AAPL,$250
GS,$250
OIL,$250
4. Download historical trading data for the stocks you own
Execute:
cd /tmp/MonteCarloVarUsingRealData/
/bin/bash downloadHistoricalData.sh
# Downloading historical data for GE
# Downloading historical data for AAPL
# Downloading historical data for GS
# Downloading historical data for OIL
# Saved to /tmp/stockData/
5. Run the MonteCarlo simulation
Execute:
su - guest -c " /usr/hdp/current/spark-client/bin/spark-submit --class com.hortonworks.example.Main --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 --queue default /tmp/MonteCarloVarUsingRealData/target/monte-carlo-var-1.0-SNAPSHOT.jar hdfs:///tmp/stockData/companies_list.txt hdfs:///tmp/stockData/*.csv"
Interpreting the Results
Below is the result of a sample portfolio that has $1,000 invested equally between Apple, GE, Goldman Sachs, and an ETF that holds crude oil. It says that with 95% certainty, the most that the portfolio can go down in a single day is $33. In addition, there is a 5% chance that the portfolio will gain $23 in a single day. Most of the time, the portfolio will lose $1 per day.
In a single day, this is what could happen to your stock holdings if you have $1000 invested
$ %
worst case -33 -3.33%
most likely scenario -1 -0.14%
best case 23 2.28%
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- hdp-2.4.0
- How-ToTutorial
- spark-1.6.0
- spark-sql
05-27-2016
09:01 PM
@Constantin Stanca Would you share some of the specific optimizations mentioned in your article? "performance could be improved by ... using the operating system side optimization to take advantage of the most recent hardware NUMA capable."
... View more
05-24-2016
11:21 PM
I ended up storing the file in HDFS and reading it through sc.textFile(args[0])
... View more
04-16-2016
10:23 PM
Source article no longer exists. I used this: http://www.r-bloggers.com/interactive-data-science-with-r-in-apache-zeppelin-notebook/
... View more
03-14-2016
07:55 PM
1 Kudo
I used: grep '/var/log' /var/lib/ambari-agent/cache/cluster_configuration/* to identify all locations that need to be changed then Ambari/configs.sh to make the adjustments.
... View more
03-14-2016
05:54 AM
11 Kudos
Summary
Enabling SSL encryption for the Web UIs that make up Hadoop is a tedious process that requires planning, learning to use security tools, and lots of mouse clicks through Ambari's UI. This article aims to simplify the process by presenting a semi-automated, start-to-finish example that enables SSL for the below Web UIs in the Hortonworks Sandbox:
Ambari HBase Oozie Ranger HDFS Planning
There is no substitute for reading the
documentation. If you plan on enabling SSL in a production cluster, then make sure you are familiar with SSL concepts and the communication paths between each HDP component. In addition, plan on cluster downtime. Here are some concepts that you should know well:
Certificate Authority (CA) A Certificate Authority is a company that others trust that signs certificates for a fee. On a Mac you can view a list of CAs that your computer trusts by opening up the "Keychain Access" application and clicking on "System Roots". If you don't want to pay one of these companies to sign your certificates, then you can generate your own CA, just beware the Google Chrome and other browsers will present you with a privacy warning.
Server SSL certificate These are files that prove the identity of a something, in our case: HDP services. Usually there is one certificate per hostname, and it is signed by a CA. There are two pieces of a certificate: the private and public keys. A private key is needed to encrypt a message and a public certificate is needed to decrypt the same message.
Java private keystore When Java HDP services need to encrypt messages, they need a place to look for the private key part of a server's SSL certificate. This keystore holds those private keys. It should be kept secure so that attackers cannot impersonate the service. For this reason, each HDP component in this article has its own private keystore.
Java trust keystore Just like my Mac has a list of CAs that it trusts, a Java process on a Linux machine needs the same. This keystore will usually hold the Public CA's certificate and any intermediary CA certificates. If a certificate was signed with a CA that you created yourself then also add the public part of a server's SSL certificate into this keystore.
Ranger plugins Ranger plugins communicate with Ranger Admin server over SSL. What is important to understand is where each plugin executes and thus where server SSL certificates are needed. For HDFS, the execution is on the NameNodes, for HBase, it is on the RegionServers, for YARN, it is on the ResourceManagers. When you create server SSL certificates use the hostnames where the plugins execute.
Enable SSL on HDP Sandbox
This part is rather easy. Install the HDP 2.4 Sandbox and follow the below steps. If you use an older version of the Sandbox note that you'll need to change the Ambari password used in the script.
Download my script
wget "https://raw.githubusercontent.com/vzlatkin/EnableSSLinHDP/master/enable-ssl.sh"
Stop all services via Ambari (manually stop HDFS or Turn Off Maintenance Mode) Execute:
/bin/bash enable-ssl.sh --all
Start all services via Ambari, which is now running on port 8443 Goto Ranger Admin UI and edit HDFS and HBase services to set the Common Name for Certificate to sandbox.hortonworks.com Enable SSL in production
There are two big reasons why enabling SSL in production can be more difficult than in a sandbox:
If Hadoop components run in Highly Available mode. The solution for most instances is to create a single server SSL certificate and copy it to all HA servers. However, for Oozie you'll need a special server SSL certificate with CN=*.domainname.com If using Public CAs to sign server SSL certificates. Besides adding time to the process that is needed for the CA to sign your certificates you may also need additional steps to add intermediate CA certificates to the various Java trust stores and finding a CA that can sign non-FQDN server SSL certificates for Oozie HA
If you are using Ranger to secure anything besides HBase and HDFS then you will need to make changes to the script to enable extra plugins.
The steps are similar to enabling SSL in Sanbox:
Download my script
wget "https://raw.githubusercontent.com/vzlatkin/EnableSSLinHDP/master/enable-ssl.sh"
Make changes to these variables inside of the script to reflect your cluster layout. The script uses these variables to generate certificates and copy them to all machines where they are needed. Below is an example for my three node cluster.
server1="example1.hortonworks.com"
server2="example2.hortonworks.com"
server3="example3.hortonworks.com"
OOZIE_SERVER_ONE=$server2
NAMENODE_SERVER_ONE=$server1
RESOURCE_MANAGER_SERVER_ONE=$server3
HISTORY_SERVER=$server1
HBASE_MASTER_SERVER_ONE=$server2
RANGER_ADMIN_SERVER=$server1
ALL_NAMENODE_SERVERS="${NAMENODE_SERVER_ONE} $server2"
ALL_OOZIE_SERVERS="${OOZIE_SERVER_ONE} $server3"
ALL_HBASE_MASTER_SERVERS="${HBASE_MASTER_SERVER_ONE} $server3"
ALL_HBASE_REGION_SERVERS="$server1 $server2 $server3"
ALL_REAL_SERVERS="$server1 $server2 $server3"
ALL_HADOOP_SERVERS="$server1 $server2 $server3"
export AMBARI_SERVER=$server1
AMBARI_PASS=xxxx
CLUSTER_NAME=cluster1
If you are going to pay a Public CA to sign your server SSL certificates then copy them to /tmp/security and name them as such:
ca.crt
example1.hortonworks.com.crt
example1.hortonworks.com.key
example2.hortonworks.com.crt
example2.hortonworks.com.key
example3.hortonworks.com.crt
example3.hortonworks.com.key
hortonworks.com.crt
hortonworks.com.key
The last certificate is needed for Oozie if you have Oozie HA enabled. The CN of that certificate should be CN=*.domainname.com as described hereIf you are NOT going to use a Public CA to sign your certificates, then change these lines in the script to be relevant to your organization:
/C=US/ST=New York/L=New York City/O=Hortonworks/OU=Consulting/CN=HortonworksCA
Stop all services via Ambari Execute:
/bin/bash enable-ssl.sh --all
Start all services via Ambari, which is now running on port 8443 Goto Ranger Admin UI and edit HDFS and HBase services to set the Common Name for Certificate to $NAMENODE_SERVER_ONE and $HBASE_MASTER_SERVER_ONE that you specified in the above script
If you chose not to enable SSL for some components or decide to modify the script to include others (please send me a patch) then be aware of these dependencies:
Setting up Ambari trust store is required before enabling SSL encryption for any other component Before you enable HBase SSL encryption, enable Hadoop SSL encryption Validation tips
View and verify SSL certificate being used by a server
openssl s_client -connect ${OOZIE_SERVER_ONE}:11443 -showcerts < /dev/null
View Oozie jobs through command-line
oozie jobs -oozie https://${OOZIE_SERVER_ONE}:11443/oozie
View certificates stored in a Java keystore
keytool -list -storepass password -keystore /etc/hadoop/conf/hadoop-private-keystore.jks
View Ranger policies for HDFS
cat example1.hortonworks.com.key example1.hortonworks.com.crt >> example1.hortonworks.com.pem
curl --cacert /tmp/security/ca.crt --cert /tmp/security/example1.hortonworks.com.pem "https://example1.hortonworks.com:6182/service/plugins/policies/download/cluster1_hadoop?lastKnownVersion=3&pluginId=hdfs@example1.hortonworks.com-cluster1_hadoop"
Validate that Ranger plugins can connect to Ranger admin server by searching for util.PolicyRefresher in HDFS NameNode and HBase RegionServer log files
References
GitHub repo Documentation to enable SSL for Ambari Oozie HDP documentation and Oozie documentation on apache.org Enable SSL encryption for Hadoop components Documentation for Ranger
... View more
- Find more articles tagged with:
- Encryption
- hdp-2.4
- How-ToTutorial
- Sandbox
- Security
- ssl
03-06-2016
11:35 PM
1 Kudo
Thanks. I'll watch this JIRA for progress: https://issues.apache.org/jira/browse/HIVE-10924
... View more
03-02-2016
03:59 AM
Problems fixed. There is no longer a step to chroot Solr directory in Zookeeper.
... View more
03-01-2016
05:46 PM
@Artem Ervits Thanks for giving this tutorial a try. If you are getting the errors on an HDP Sandbox, would you send me the .vmdk file? I'll take a look and see what needs to change in the tutorial.
... View more
03-01-2016
12:52 AM
Yes, I should have added a link to GitHub: https://github.com/vzlatkin/DoctorsNotes
... View more
03-01-2016
12:22 AM
12 Kudos
Summary
Because patients visit many doctors, trends in their ailments and complaints may be difficult to identify. The steps in this article will help you address exactly this problem by creating a TagCloud of the most frequent complaints per patient. Below is a sample:
We will generate random HL7 MDM^T02 (v2.3) messages that contain a doctor's note about a fake patient and that patient's fake complaint to their doctor. Apache NiFi will be used to parse these messages and send them to Apache Solr. Finally Banana is used to create the visual dashboard.
In the middle of the dashboard is a TagCloud where the more frequently mentioned symptoms for a selected patient appear larger than others. Because this project relies on randomly generated data, some interesting results are possible. In this case, I got lucky and all the symptoms seem related to the patient's most frequent complaint: Morning Alcohol Drinking. The list of all possible symptoms comes from Google searches.
Summary of steps
Download and install the HDP Sandbox
Download and install the latest NiFi release
Download the HL7 message generator
Create a Solr dashboard to visualize the results
Create and execute a new NiFi flow
Detailed Step-by-step guide
1. Download and install the HDP Sandbox
Download the latest (2.3 as of this writing) HDP Sandbox here. Import it into VMware or VirtualBox, start the instance, and update the DNS entry on your host machine to point to the new instance’s IP. On Mac, edit /etc/hosts, on Windows, edit %systemroot%\system32\drivers\etc\ as administrator and add a line similar to the below:
192.168.56.102 sandbox sandbox.hortonworks.com
2. Download and install the latest NiFi release
Follow the directions here. These were the steps that I executed for 0.5.1
wget http://apache.cs.utah.edu/nifi/0.5.1/nifi-0.5.1-bin.zip -O /tmp/nifi-0.5.1-bin.zip
cd /opt/
unzip /tmp/nifi-0.5.1-bin.zip
useradd nifi
chown -R nifi:nifi /opt/nifi-0.5.1/
perl -pe 's/run.as=.*/run.as=nifi/' -i /opt/nifi-0.5.1/conf/bootstrap.conf
perl -pe 's/nifi.web.http.port=8080/nifi.web.http.port=9090/' -i /opt/nifi-0.5.1/conf/nifi.properties
/opt/nifi-0.5.1/bin/nifi.sh start
3. Download the HL7 message generator
A big thank you to HAPI for their excellent library to parse and create HL7 messages on which my code relies. The generator creates a very simple MDM^T02 that includes an in-line note from a doctor. MDM stands for Medical Document Management, and T02 specifies that this is a message for a new document. For more details about this message type read this document. Here is a sample message for Beatrice Cunningham:
MSH|^~\&|||||20160229002413.415-0500||MDM^T02|7|P|2.3
EVN|T02|201602290024
PID|1||599992601||cunningham^beatrice^||19290611|F
PV1|1|O|Burn center^60^71
TXA|1|CN|TX|20150211002413||||||||DOC-ID-10001|||||AU||AV
OBX|1|TX|1001^Reason For Visit: |1|Evaluated patient for skin_scaling. ||||||F
As a pre-requisite to executing the code, we need to install Java 8. Execute this on the Sandbox:
yum -y install java-1.8.0-openjdk.x86_64
Now, download the pre-build jar file that has the HL7 generator and execute it to create a single message in /tmp/hl7-messages. I chose to store the jar file in /var/ftp/pub because my IDE uploads files during code development. If you change this directory, also change it in the NiFi flow.
mkdir -p /var/ftp/pub
cd /var/ftp/pub
wget https://raw.githubusercontent.com/vzlatkin/DoctorsNotes/master/target/hl7-generator-1.0-SNAPSHOT-shaded.jar
mkdir -p /tmp/hl7-messages/
/usr/lib/jvm/jre-1.8.0/bin/java -cp hl7-generator-1.0-SNAPSHOT-shaded.jar com.hortonworks.example.Main 1 /tmp/hl7-messages
chown -R nifi:nifi /tmp/hl7-messages/
4. Create a Solr dashboard to visualize the results
Now we need to configure Solr to ignore some words that don't add value. We do this by modifying stopwords.txt
cat <<EOF > /opt/hostname-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf/stopwords.txt
adjustments
Admitted
because
blood
changes
complained
Discharged
Discussed
Drew
Evaluated
for
hospital
me
medication
of
patient
Performed
Prescribed
Reason
Recommended
Started
tests
The
to
treatment
visit
Visited
was
EOF
Next, we download the custom dashboard and start Solr in cloud mode
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64
wget "https://raw.githubusercontent.com/vzlatkin/DoctorsNotes/master/other/Chronic%20Symptoms%20(Solr).json" -O /opt/hostname-hdpsearch/solr/server/solr-webapp/webapp/banana/app/dashboards/default.json
/opt/hostname-hdpsearch/solr/bin/solr start -c -z localhost:2181
/opt/hostname-hdpsearch/solr/bin/solr create -c hl7_messages -d data_driven_schema_configs -s 1 -rf 1
5. Create and execute a new NiFi flow
Start by downloading this NiFi template to your host machine.
To import the template, open the NiFi UI
Next, open Templates manager:
Click "Browse", then find the template on your local machine, click "Import", and close the Template Window.
Drag and drop to instantiate a new template:
Double click the new process group called HL7, and start all of the processes. To do so, hold down the Shift-key, and select all of the processes on the screen. Then click the "Start" button:
Here is a quick walk through of the processes starting in the top-left corner. First, we use ListFile process to get a directory listing from /tmp/hl7-messages. Second, the FetchFile process reads each file one-by-one, passes the contents to the next step, and deletes if successful. Third, the text file is parsed as an HL7 formatted message. Next, the UpdateAttribute and AttributesToJSON processes get the contents ready for insertion into Solr. Finally, we use the PutSolrContentStream process to add new documents via Solr REST API. The remaining two processes on the very bottom are for spawning the custom Java code and logging details for troubleshooting.
Conclusion
Now open the Banana UI. You should see a dashboard that looks similar to the screenshot in the beginning of this article. You can see how many messages have been processed by clicking the link in the top-right panel called "Filter By".
Troubleshooting
If you are not seeing any data in Solr/Banana, then reload the page. Also perform a search via this page to validate that results are being indexed via Solr correctly.
Full source code is located in GitHub.
... View more
- Find more articles tagged with:
- Data Processing
- hl7
- How-ToTutorial
- NiFi
- solr
Labels:
02-14-2016
05:26 PM
11 Kudos
For a UI showing the biggest consumers of space in HDFS install and configure Twitter's HDFS-DU. For a quick visual representation of HDFS disk usage with no extra tools required, use this script: #!/usr/bin/env bash
max_depth=5
largest_root_dirs=$(hdfs dfs -du -s '/*' | sort -nr | perl -ane 'print "$F[1] "')
printf "%15s %s\n" "bytes" "directory"
for ld in $largest_root_dirs; do
printf "%15.0f %s\n" $(hdfs dfs -du -s $ld| cut -d' ' -f1) $ld
all_dirs=$(hdfs dfs -ls -R $ld | egrep '^dr........' | perl -ane "scalar(split('/',\$_)) <= $max_depth && print \"\$F[7]\n\"" )
for d in $all_dirs; do
line=$(hdfs dfs -du -s $d)
size=$(echo $line | cut -d' ' -f1)
parent_dir=${d%/*}
child=${d##*/}
if [ -n "$parent_dir" ]; then
leading_dirs=$(echo $parent_dir | perl -pe 's/./-/g; s/^.(.+)$/\|$1/')
d=${leading_dirs}/$child
fi
printf "%15.0f %s\n" $size $d
done
done
Sample output: bytes directory
480376973 /hdp
480376973 |---/apps
480376973 |--------/2.3.4.0-3485
98340772 |---------------------/hive
210320342 |---------------------/mapreduce
97380893 |---------------------/pig
15830286 |---------------------/sqoop
58504680 |---------------------/tez
24453973 /user
0 |----/admin
3629715 |----/ambari-qa
3440200 |--------------/.staging
653010 |-----------------------/job_1454293069490_0001
... View more
- Find more articles tagged with:
- Cloud & Operations
- disk
- HDFS
- How-ToTutorial
- Space
- usage
Labels:
02-10-2016
12:46 AM
1 Kudo
I found the documentation on how to do this without downtime: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#DataNode_Hot_Swap_Drive The only challenge that I encountered was the :port: in the command. It is the dfs.datanode.ipc.address parameter from hdfs-site.xml. My full command looked like this su - hdfs -c "hdfs dfsadmin -reconfig datanode sandbox.hortonworks.com:8010 start"
... View more
02-09-2016
02:12 AM
@Ancil McBarnett is there anyway to do this without downtime? Could you add a disk drive into a hot-swappable bay, add it to DataNode's list of directories, force a rebalance, and remove one of the old drives?
... View more
02-07-2016
11:08 PM
1 Kudo
I'm looking to generate reports on workflow performance in Oozie: failures, durations, users, etc... What is the best way to do so if Oozie is currently using DerbyDB and not MySQL? (Prior to moving Oozie to MySQL.)
I stopped the Oozie service and used the Phoenix sqlline tool:
su - oozie -c "/usr/hdp/current/oozie-server/bin/oozie-stop.sh"
java -cp .:/usr/hdp/2.3.2.0-2950/oozie/libserver/derby-10.10.1.1.jar:/usr/hdp/2.3.2.0-2950/phoenix/bin/../phoenix-4.4.0.2.3.2.0-2950-thin-client.jar sqlline.SqlLine -d org.apache.derby.jdbc.EmbeddedDriver -u jdbc:derby:/hadoop/oozie/data/oozie-db -n none -p none --color=true --fastConnect=false --verbose=true --isolation=TRANSACTION_READ_COMMITTED
0: jdbc:derby:/hadoop/oozie/data/oozie-db> CALL SYSCS_UTIL.SYSCS_EXPORT_TABLE ('OOZIE','WF_ACTIONS','WF_ACTIONS.del',null,null,null);
0: jdbc:derby:/hadoop/oozie/data/oozie-db> !outputformat vertical
0: jdbc:derby:/hadoop/oozie/data/oozie-db> !tables
0: jdbc:derby:/hadoop/oozie/data/oozie-db> SELECT STATUS, WF_ID, TYPE, NAME, EXECUTION_PATH, ERROR_MESSAGE FROM OOZIE.WF_ACTIONS;
... View more
Labels:
02-07-2016
08:03 PM
I also ran into this problem and it was painful to troubleshoot. Is there a JIRA to improve the error message?
... View more
01-25-2016
08:51 PM
1 Kudo
I want to port the following SQL statement to Hive from Sybase. What is the best approach to get the below to work in Hive? Hive DML UPDATE syntax doesn't mention support for JOINs or nested SELECT statements.
UPDATE table1
SET table1.column1 = table2.column1,
table1.column10 = table2.column10
FROM table1, table2
WHERE table1.columnID = table2.columnID
After enabling transactional=true on table1 the above produces an error: Error while compiling statement: FAILED: ParseException ... missing EOF at 'FROM' ...
... View more
Labels:
01-22-2016
01:39 AM
2 Kudos
Below is a sample script with out batching. It would be really nice if someone could figure out how to get Ambari to accept a bulk request for a Service Check, as described here #!/usr/bin/env bash
AMBARI_HOST=${1:-sandbox.hortonworks.com}
LOGIN=admin
PASSWORD=admin
if [ -e "~/.ambari_login" ]; then
. ~/.ambari_login
fi
cluster_name=$(curl -s -u $LOGIN:$PASSWORD "http://$AMBARI_HOST:8080/api/v1/clusters" | python -mjson.tool | perl -ne '/"cluster_name":.*?"(.*?)"/ && print "$1\n"')
if [ -z "$cluster_name" ]; then
exit
fi
echo "Got cluster name: $cluster_name"
running_services=$(curl -s -u $LOGIN:$PASSWORD "http://$AMBARI_HOST:8080/api/v1/clusters/$cluster_name/services?fields=ServiceInfo/service_name&ServiceInfo/maintenance_state=OFF" | python -mjson.tool | perl -ne '/"service_name":.*?"(.*?)"/ && print "$1\n"')
if [ -z "$running_services" ]; then
exit
fi
echo "Got running services:
$running_services"
post_body=
for s in $running_services; do
if [ "$s" == "ZOOKEEPER" ]; then
post_body="{\"RequestInfo\":{\"context\":\"$s Service Check\",\"command\":\"${s}_QUORUM_SERVICE_CHECK\"},\"Requests/resource_filters\":[{\"service_name\":\"$s\"}]}"
else
post_body="{\"RequestInfo\":{\"context\":\"$s Service Check\",\"command\":\"${s}_SERVICE_CHECK\"},\"Requests/resource_filters\":[{\"service_name\":\"$s\"}]}"
fi
curl -s -u $LOGIN:$PASSWORD -H "X-Requested-By:X-Requested-By" -X POST --data "$post_body" "http://$AMBARI_HOST:8080/api/v1/clusters/$cluster_name/requests"
done
... View more
01-21-2016
05:37 PM
2 Kudos
I'd like Ambari to execute a service check for all installed components that are not in maintenance mode.
I couldn't find such an option in the UI so I tried the REST API. I ran the below command and got back an "Accepted" status, but when I look in Ambari UI for a list of executed background operations I only see a single service check when I expected two service checks.
curl -v -u $LOGIN:$PASSWORD -H "X-Requested-By:X-Requested-By" -X POST "http://$AMBARI_HOST:8080/api/v1/clusters/$cluster_name/requests" --data '[ {"RequestInfo":{"context":"HIVE Service Check","command":"HIVE_SERVICE_CHECK"},"Requests/resource_filters":[{"service_name":"HIVE"}]}, {"RequestInfo":{"context":"MAPREDUCE2 Service Check","command":"MAPREDUCE2_SERVICE_CHECK"},"Requests/resource_filters":[{"service_name":"MAPREDUCE2"}]} ]'
... View more
Labels:
01-20-2016
07:58 PM
1 Kudo
I stopped a running process in NiFi 0.4.1 to make a configuration change, but I am unable to make an edit or re-start the process. It seem stuck waiting for a thread to finish. How do I debug this situation? I've looked at nifi-app.log, searched for that process name, but did not see anything of value. I ran ps -eLf hoping to see a thread with Solr in the name, but I didn't find that either. Here is a picture showing the lack of a "start" or "edit configuration" options. What else can I do to troubleshoot this problem before restarting the nifi process on this machine?
... View more
Labels:
- Labels:
-
Apache NiFi
-
Apache Solr
01-20-2016
06:34 PM
5 Kudos
You have to decide how many clusters you need for the below tasks which apply to Hadoop applications the same was as they apply to typical Enterprise software: Test upgrade procedures for new versions of existing components Execute performance tests of custom-built applications Allow end-users to perform user acceptance testing Execute integration tests where custom-built applications communicate with third-party software Experiment with new software that is beta quality and may not be ready for usage at all Execute security penetration tests (typically done by an external company) Let application developers modify configuration parameters and restart services on short notice Maintain a mirror image of production environment to be activated in case of natural disaster or unforeseen events Execute regression tests that compare the outputs of new application code with existing code running in production I believe, DEV -> QA -> PROD is a minimum and I have seen larger organizations deploy LAB -> DEV -> QA -> PROD -> DR as separate clusters.
... View more
01-13-2016
07:11 PM
That worked!
... View more
01-13-2016
06:00 PM
Yes, I changed the identities as specified here. Realm was filled in during the first step of the Enable Kerberos Wizard. The output of from the URL is very long, so I won't post it here. There is no mention of ambari-qa, and realm is a filled in property. Is there anything specific that I should investigate?
... View more