1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
853 | 04-03-2024 06:39 AM | |
1645 | 01-12-2024 08:19 AM | |
813 | 12-07-2023 01:49 PM | |
1408 | 08-02-2023 07:30 AM | |
2028 | 03-29-2023 01:22 PM |
06-02-2016
01:34 PM
can you screen print the main screen of ambari. make sure you are running the spark job from the command line on the sandbox
... View more
06-02-2016
01:25 PM
is anything else running in the spark cluster (like Zeppelin)? Or do you have not enough YARN resources?
... View more
06-01-2016
01:54 PM
2 Kudos
I will add the NiFi flow information here tomorrow. For a rough draft, I wanted to show what it could do as it's pretty cool. cat all.txt| jq --raw-output '.["text"]' | syntaxnet/demo.sh From NiFi I collect a stream of Twitter data and send that to a file as JSON (all.txt). There are many ways to parse that, but I am a fan of the simple command line tool JQ which is an awesome tool to parse JSON from the command line and is available for MacOSX and Linux. So from the Twitter feed I just grab the tweet text to parse with Parsey. Initially I was going to install TensorFlow Syntaxnet (Parsey McParseface) on the HDP 2.4 Sandbox, but Centos 6 and TensorFlow do not play well. So for now the easiest route is to install HDF on a Mac and build Syntaxnet on your Mac. The install instructions are very detailed, but the build is very particular and very machine intensive. It's best to let the build run and go off and do something else with everything else shutdown (no Chrome, VM, editors, ...). After running McParseface, here are some results: Input: RT @ Data_Tactical : Scary and fascinating : The future of big data https : //t.co/uwHoV8E49N # bigdata # datanews # datascience # datacenter https : ...
Parse:
Data_Tactical JJ ROOT
+-- RT NNP nn
+-- @ IN nn
+-- : : punct
+-- Scary JJ dep
| +-- and CC cc
| +-- fascinating JJ conj
+-- future NN dep
| +-- The DT det
| +-- of IN prep
| +-- data NNS pobj
| +-- big JJ amod
+-- https ADD dep
+-- # $ dep
| +-- //t.co/uwHoV8E49N CD num
| +-- datanews NNS dep
| | +-- bigdata NNP nn
| | +-- # $ nn
| +-- # $ dep
| +-- datacenter NN dep
| | +-- # NN nn
| | +-- datascience NN nn
| +-- https ADD dep
+-- ... . punct
INFO:tensorflow:Read 4 documents
Input: u_t=11x^2u_xx+ -LRB- 11x+2t -RRB- u_x+-1u https : //t.co/NHXcebT9XC # trading # bigdata https : //t.co/vOM8S5Ewwq
Parse:
u_t=11x^2u_xx+ LS ROOT
+-- 11x+2t LS dep
| +-- -LRB- -LRB- punct
| +-- -RRB- -RRB- punct
+-- u_x+-1u CD dep
+-- https ADD dep
+-- : : punct
+-- //t.co/vOM8S5Ewwq CD dep
Input: RT @ weloveknowles : When Beyoncé thinks the song is over but the hive has other ideas https : //t.co/0noxKaYveO
Parse:
RT NNP ROOT
+-- @ IN prep
| +-- weloveknowles NNS pobj
+-- : : punct
+-- thinks VBZ dep
| +-- When WRB advmod
| +-- Beyoncé NNP nsubj
| +-- is VBZ ccomp
| | +-- song NN nsubj
| | | +-- the DT det
| | +-- over RB advmod
| +-- but CC cc
| +-- has VBZ conj
| +-- hive NN nsubj
| | +-- the DT det
| +-- ideas NNS dobj
| | +-- other JJ amod
| +-- https ADD advmod
+-- //t.co/0noxKaYveO ADD dep
Input: RT @ KirkDBorne : Enabling the # BigData Revolution -- An International # OpenData Roadmap : https : //t.co/e89xNNNkUe # Data4Good HT @ Devbd https : / ...
Parse:
RT NNP ROOT
+-- @ IN prep
| +-- KirkDBorne NNP pobj
+-- : : punct
+-- Enabling VBG dep
| +-- Revolution NNP dobj
| +-- the DT det
| +-- # $ nn
| +-- BigData NNP nn
| +-- -- : punct
| +-- Roadmap NNP dep
| | +-- An DT det
| | +-- International NNP nn
| | +-- OpenData NNP nn
| | +-- # NN nn
| +-- : : punct
| +-- https ADD dep
| +-- //t.co/e89xNNNkUe LS dep
| +-- @ NN dep
| +-- Data4Good CD nn
| | +-- # $ nn
| +-- HT FW nn
| +-- Devbd NNP dep
| +-- https ADD dep
| +-- : : punct
+-- / NFP punct
+-- ... . punct
Input: RT @ DanielleAlberti : It 's like 10 , 000 bees when all you need is a hive. https : //t.co/ElGLLbykN8
Parse:
RT NNP ROOT
+-- @ IN prep
| +-- DanielleAlberti NNP pobj
+-- : : punct
+-- 's VBZ dep
| +-- It PRP nsubj
| +-- like IN prep
| | +-- 10 CD pobj
| +-- , , punct
| +-- bees NNS appos
| +-- 000 CD num
| +-- https ADD rcmod
| +-- when WRB advmod
| +-- all DT nsubj
| | +-- need VBP rcmod
| | +-- you PRP nsubj
| +-- is VBZ cop
| +-- a DT det
| +-- hive. NN nn
+-- //t.co/ElGLLbykN8 ADD dep
I am going to wire this up to NiFi to drop these in HDFS for further data analysis in Zeppelin. The main problems are you need to have very specific versions of Python (2.7), Bazel (0.2.0 - 0.2.2b), Numpy, Protobuf, ASCIITree and others. Some of these don't play well with older versions of Centos. If you are on a clean Mac or Ubuntu, things should go smooth. My CentOS was missing a bunch of libraries so I tried to install them: sudo yum -y install swigpip install -U
protobuf==3.0.0b2 pip install asciitreepip install numpyPip install noseWget https://github.com/bazelbuild/bazel/releases/download/0.2.2b/bazel-0.2.2b-installer-linux-x86_64.shsudo yum -y install
libstdc++ ./configuresudo yum -y install
pkg-config zip g++ zlib1g-dev unzipcd ..
bazel test syntaxnet/... util/utf8/...# On Mac, run the following:
bazel test --linkopt=-headerpad_max_install_names \
syntaxnet/... util/utf8/…cat /etc/redhat-releaseCentOS release 6.7
(Final)sudo yum -y install
glibcsudo yum
-y install epel_release
sudo yum -y install gcc gcc-c++ python-pip python-devel atlas atlas-devel
gcc-gfortran openssl-devel libffi-devel
pip install --upgrade virtualenv
virtualenv --system-site-packages ~/venvs/tensorflow
source
~/venvs/tensorflow/bin/activate
pip install --upgrade numpy scipy wheel cryptography #optional
pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl
# or below if you want gpu, support, but cuda and cudnn are required, see docs
for more install instructions
pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whlsudo yum -y install
python-numpy swig python-devsudo yum -y upgradeyum install python27 It's worth a try for the patient or people with newer CentOS. Your mileage may vary! References: http://googleresearch.blogspot.com/2016/05/announcing-syntaxnet-worlds-most.html http://arxiv.org/abs/1603.06042 https://www.cis.upenn.edu/~treebank/ http://googleresearch.blogspot.com/2011/03/building-resources-to-syntactically.html https://github.com/tensorflow/models/tree/master/syntaxnet http://hoolihan.net/blog-tim/2016/03/02/installing-tensorflow-on-centos/ https://github.com/tensorflow/models/tree/master/syntaxnet#getting-started http://googleresearch.blogspot.com/2016/05/announcing-syntaxnet-worlds-most.html https://dmngaya.com/2015/10/25/installing-python-2-7-on-centos-6-7/
... View more
Labels:
05-31-2016
08:46 PM
Did you check the docs: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/validating-phoenix-installation.html
... View more
05-31-2016
08:21 PM
@Jitendra Yadav that worked for me in zeppelin and the data looks good.
... View more
05-31-2016
07:50 PM
I've done so with a sqlContext.sql ("create table...") then a sqlContext.sql("insert into") but a dataframe.write.orc will produce an ORC file that cannot be seen as hive. What are all the ways to work with ORC from Spark?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
05-26-2016
09:35 PM
2 Kudos
Alluxio is available to install on HDP and works with your existing HDFS.
http://www.alluxio.org/documentation/en/Configuring-Alluxio-with-HDFS.html
http://www.alluxio.com/2016/04/getting-started-with-alluxio-and-spark/
To use on HDP, you need to edit the /opt/alluxio-1.0.1/conf/alluxio-env.sh after you create it and change the port from 9000 to 8020.
You can also override at the command line:
export
ALLUXIO_UNDERFS_ADDRESS=hdfs://localhost:8020
To Setup Alluxio. First download it.
bin/alluxio format[root@sandbox
alluxio-1.0.1]# bin/alluxio formatConnecting to
localhost as root...Warning: Permanently
added 'localhost' (RSA) to the list of known hosts.Formatting Alluxio
Worker @ sandbox.hortonworks.comConnection to
localhost closed.Formatting Alluxio
Master @ localhost[root@sandbox
alluxio-1.0.1]# bin/alluxio-start.sh local
Killed 0 processes
on sandbox.hortonworks.comKilled 0 processes
on sandbox.hortonworks.comConnecting to
localhost as root...Killed 0 processes
on sandbox.hortonworks.comConnection to
localhost closed.Formatting RamFS:
/mnt/ramdisk (1gb)Starting master @
localhost. Logging to /opt/alluxio-1.0.1/logsStarting worker @
sandbox.hortonworks.com. Logging to /opt/alluxio-1.0.1/logs
==>
/opt/alluxio-1.0.1/logs/user.log <==2016-05-25
17:22:45,622 INFO logger.type
(Format.java:formatFolder) - Formatting
JOURNAL_FOLDER:/opt/alluxio-1.0.1/journal/2016-05-25
17:22:45,656 INFO logger.type
(Format.java:formatFolder) - Formatting
BlockMaster_JOURNAL_FOLDER:/opt/alluxio-1.0.1/journal/BlockMaster2016-05-25
17:22:45,657 INFO logger.type
(Format.java:formatFolder) - Formatting
FileSystemMaster_JOURNAL_FOLDER:/opt/alluxio-1.0.1/journal/FileSystemMaster2016-05-25
17:22:45,659 INFO logger.type
(Format.java:formatFolder) - Formatting
LineageMaster_JOURNAL_FOLDER:/opt/alluxio-1.0.1/journal/LineageMaster
==>
/opt/alluxio-1.0.1/logs/worker.log <==2016-05-26
20:46:19,189 INFO
server.AbstractConnector (AbstractConnector.java:doStart) - Started
SelectChannelConnector@0.0.0.0:300002016-05-26
20:46:19,189 INFO logger.type
(UIWebServer.java:startWebServer) - Alluxio Worker Web service started @
0.0.0.0/0.0.0.0:300002016-05-26
20:46:19,191 INFO logger.type
(AbstractClient.java:connect) - Alluxio client (version 1.0.1) is trying to
connect with BlockMasterWorker master @ sandbox.hortonworks.com/10.0.2.15:199982016-05-26
20:46:19,199 INFO logger.type
(AbstractClient.java:connect) - Client registered with BlockMasterWorker master
@ sandbox.hortonworks.com/10.0.2.15:199982016-05-26
20:46:19,312 INFO logger.type
(AlluxioWorker.java:start) - Started worker with id 12016-05-26
20:46:19,312 INFO logger.type
(AlluxioWorker.java:start) - Alluxio Worker version 1.0.1 started @
sandbox.hortonworks.com/10.0.2.15:299982016-05-26
20:46:20,311 INFO logger.type
(AbstractClient.java:connect) - Alluxio client (version 1.0.1) is trying to
connect with FileSystemMasterWorker master @
sandbox.hortonworks.com/10.0.2.15:199982016-05-26
20:46:20,311 INFO logger.type
(AbstractClient.java:connect) - Client registered with FileSystemMasterWorker
master @ sandbox.hortonworks.com/10.0.2.15:199982016-05-26
20:46:20,313 INFO logger.type
(AbstractClient.java:connect) - Alluxio client (version 1.0.1) is trying to
connect with FileSystemMasterWorker master @
sandbox.hortonworks.com/10.0.2.15:199982016-05-26
20:46:20,314 INFO logger.type
(AbstractClient.java:connect) - Client registered with FileSystemMasterWorker
master @ sandbox.hortonworks.com/10.0.2.15:19998
To validate your install, run the tests:
./bin/alluxio runTests
All tests passed
You can view from
the command line (
http://www.alluxio.org/documentation/en/Command-Line-Interface.html ) [root@sandbox
alluxio-1.0.1]# ./bin/alluxio fs ls /default_tests_files
80.00B 05-26-2016 20:49:01:243 In Memory
/default_tests_files/BasicFile_CACHE_PROMOTE_MUST_CACHE
84.00B 05-26-2016 20:49:02:877 In Memory
/default_tests_files/BasicNonByteBuffer_CACHE_PROMOTE_MUST_CACHE
80.00B 05-26-2016 20:49:04:432 In Memory
/default_tests_files/BasicFile_CACHE_PROMOTE_CACHE_THROUGH
84.00B 05-26-2016 20:49:08:236 In Memory
/default_tests_files/BasicNonByteBuffer_CACHE_PROMOTE_CACHE_THROUGH
80.00B 05-26-2016 20:49:12:342 In Memory
/default_tests_files/BasicFile_CACHE_PROMOTE_THROUGH
84.00B 05-26-2016 20:49:16:392 In Memory
/default_tests_files/BasicNonByteBuffer_CACHE_PROMOTE_THROUGH
80.00B 05-26-2016 20:49:20:851 In Memory
/default_tests_files/BasicFile_CACHE_PROMOTE_ASYNC_THROUGH
84.00B 05-26-2016 20:49:23:190 In Memory
/default_tests_files/BasicNonByteBuffer_CACHE_PROMOTE_ASYNC_THROUGH
80.00B 05-26-2016 20:49:25:152 In Memory
/default_tests_files/BasicFile_CACHE_MUST_CACHE
84.00B 05-26-2016 20:49:26:975 In Memory
/default_tests_files/BasicNonByteBuffer_CACHE_MUST_CACHE
80.00B 05-26-2016 20:49:28:595 In Memory
/default_tests_files/BasicFile_CACHE_CACHE_THROUGH
84.00B 05-26-2016 20:49:32:375 In Memory
/default_tests_files/BasicNonByteBuffer_CACHE_CACHE_THROUGH
80.00B 05-26-2016 20:49:36:505 In Memory
/default_tests_files/BasicFile_CACHE_THROUGH
84.00B 05-26-2016 20:49:40:823 In Memory
/default_tests_files/BasicNonByteBuffer_CACHE_THROUGH
80.00B 05-26-2016 20:49:44:827 In Memory
/default_tests_files/BasicFile_CACHE_ASYNC_THROUGH
84.00B 05-26-2016 20:49:47:248 In Memory
/default_tests_files/BasicNonByteBuffer_CACHE_ASYNC_THROUGH
80.00B 05-26-2016 20:49:49:614 In Memory
/default_tests_files/BasicFile_NO_CACHE_MUST_CACHE
84.00B 05-26-2016 20:49:52:384 In Memory
/default_tests_files/BasicNonByteBuffer_NO_CACHE_MUST_CACHE
80.00B 05-26-2016 20:49:55:107 In Memory
/default_tests_files/BasicFile_NO_CACHE_CACHE_THROUGH
84.00B 05-26-2016 20:49:59:675 In Memory
/default_tests_files/BasicNonByteBuffer_NO_CACHE_CACHE_THROUGH
80.00B 05-26-2016 20:50:03:639 Not In Memory
/default_tests_files/BasicFile_NO_CACHE_THROUGH
84.00B 05-26-2016 20:50:07:425 Not In Memory
/default_tests_files/BasicNonByteBuffer_NO_CACHE_THROUGH
80.00B 05-26-2016 20:50:11:384 In Memory
/default_tests_files/BasicFile_NO_CACHE_ASYNC_THROUGH
84.00B 05-26-2016 20:50:13:310 In Memory
/default_tests_files/BasicNonByteBuffer_NO_CACHE_ASYNC_THROUGH
You can access these
files from Spark and Flink. Alluxio has
configurable storage tiers (memory, HHD, SSD) and can sit on top of HDFS.
To Browse The Alluxio File System and Also View Metrics
http://localhost:19999/home
References:
Presentation on
Alluxio (formerely Tachyon)
http://www.slideshare.net/TachyonNexus/tachyon-presentation-at-ampcamp-6-november-2015
Unified Name Space
http://www.alluxio.com/2016/04/unified-namespace-allowing-applications-to-access-data-anywhere/
Getting Started with
Alluxio and Spark
http://www.alluxio.com/2016/04/getting-started-with-alluxio-and-spark/
... View more
Labels:
05-25-2016
08:55 PM
1 Kudo
Spark and Hadoop go together like peanut butter and jelly. Check out my slides https://community.hortonworks.com/content/idea/28342/apache-zeppelin-with-scala-spark-introduction-to-r.html https://community.hortonworks.com/content/kbentry/34784/data-ingest-with-apache-zeppelin-apache-spark-16-h.html I worked at a few places that used Spark and Spark streaming to ingest data into HDFS and HBase. Then Spark + Spark MLib and H20 to run machine learning on the data. Then Hive and Spark SQL for queries. And reporting through Hive Thrift server to Tableau. Spark without Hadoop really is missing out a lot. And Spark 1.6 on HDP you get all the benefits of running YARN applications, common security and locality of data access. I wouldn't run Spark without Hadoop unless you are running Spark standalone for development. Even there Zeppelin + Spark 1.6 on HDP is an awesome development environment.
... View more
05-25-2016
07:02 PM
Can Sqoop tranfers data to HDFS as a Compressed Snappy in ORC format ? And can then go directly into a hive table?
... View more
Labels:
- Labels:
-
Apache Sqoop