About TimothySpann

TimothySpann · ‎06-02-2016

can you screen print the main screen of ambari. make sure you are running the spark job from the command line on the sandbox

TimothySpann · ‎06-02-2016

is anything else running in the spark cluster (like Zeppelin)? Or do you have not enough YARN resources?

TimothySpann · ‎06-01-2016

I will add the NiFi flow information here tomorrow. For a rough draft, I wanted to show what it could do as it's pretty cool. cat all.txt| jq --raw-output '.["text"]' | syntaxnet/demo.sh From NiFi I collect a stream of Twitter data and send that to a file as JSON (all.txt). There are many ways to parse that, but I am a fan of the simple command line tool JQ which is an awesome tool to parse JSON from the command line and is available for MacOSX and Linux. So from the Twitter feed I just grab the tweet text to parse with Parsey. Initially I was going to install TensorFlow Syntaxnet (Parsey McParseface) on the HDP 2.4 Sandbox, but Centos 6 and TensorFlow do not play well. So for now the easiest route is to install HDF on a Mac and build Syntaxnet on your Mac. The install instructions are very detailed, but the build is very particular and very machine intensive. It's best to let the build run and go off and do something else with everything else shutdown (no Chrome, VM, editors, ...). After running McParseface, here are some results: Input: RT @ Data_Tactical : Scary and fascinating : The future of big data https : //t.co/uwHoV8E49N # bigdata # datanews # datascience # datacenter https : ... Parse: Data_Tactical JJ ROOT +-- RT NNP nn +-- @ IN nn +-- : : punct +-- Scary JJ dep | +-- and CC cc | +-- fascinating JJ conj +-- future NN dep | +-- The DT det | +-- of IN prep | +-- data NNS pobj | +-- big JJ amod +-- https ADD dep +-- # $ dep | +-- //t.co/uwHoV8E49N CD num | +-- datanews NNS dep | | +-- bigdata NNP nn | | +-- # $ nn | +-- # $ dep | +-- datacenter NN dep | | +-- # NN nn | | +-- datascience NN nn | +-- https ADD dep +-- ... . punct INFO:tensorflow:Read 4 documents Input: u_t=11x^2u_xx+ -LRB- 11x+2t -RRB- u_x+-1u https : //t.co/NHXcebT9XC # trading # bigdata https : //t.co/vOM8S5Ewwq Parse: u_t=11x^2u_xx+ LS ROOT +-- 11x+2t LS dep | +-- -LRB- -LRB- punct | +-- -RRB- -RRB- punct +-- u_x+-1u CD dep +-- https ADD dep +-- : : punct +-- //t.co/vOM8S5Ewwq CD dep Input: RT @ weloveknowles : When Beyoncé thinks the song is over but the hive has other ideas https : //t.co/0noxKaYveO Parse: RT NNP ROOT +-- @ IN prep | +-- weloveknowles NNS pobj +-- : : punct +-- thinks VBZ dep | +-- When WRB advmod | +-- Beyoncé NNP nsubj | +-- is VBZ ccomp | | +-- song NN nsubj | | | +-- the DT det | | +-- over RB advmod | +-- but CC cc | +-- has VBZ conj | +-- hive NN nsubj | | +-- the DT det | +-- ideas NNS dobj | | +-- other JJ amod | +-- https ADD advmod +-- //t.co/0noxKaYveO ADD dep Input: RT @ KirkDBorne : Enabling the # BigData Revolution -- An International # OpenData Roadmap : https : //t.co/e89xNNNkUe # Data4Good HT @ Devbd https : / ... Parse: RT NNP ROOT +-- @ IN prep | +-- KirkDBorne NNP pobj +-- : : punct +-- Enabling VBG dep | +-- Revolution NNP dobj | +-- the DT det | +-- # $ nn | +-- BigData NNP nn | +-- -- : punct | +-- Roadmap NNP dep | | +-- An DT det | | +-- International NNP nn | | +-- OpenData NNP nn | | +-- # NN nn | +-- : : punct | +-- https ADD dep | +-- //t.co/e89xNNNkUe LS dep | +-- @ NN dep | +-- Data4Good CD nn | | +-- # $ nn | +-- HT FW nn | +-- Devbd NNP dep | +-- https ADD dep | +-- : : punct +-- / NFP punct +-- ... . punct Input: RT @ DanielleAlberti : It 's like 10 , 000 bees when all you need is a hive. https : //t.co/ElGLLbykN8 Parse: RT NNP ROOT +-- @ IN prep | +-- DanielleAlberti NNP pobj +-- : : punct +-- 's VBZ dep | +-- It PRP nsubj | +-- like IN prep | | +-- 10 CD pobj | +-- , , punct | +-- bees NNS appos | +-- 000 CD num | +-- https ADD rcmod | +-- when WRB advmod | +-- all DT nsubj | | +-- need VBP rcmod | | +-- you PRP nsubj | +-- is VBZ cop | +-- a DT det | +-- hive. NN nn +-- //t.co/ElGLLbykN8 ADD dep I am going to wire this up to NiFi to drop these in HDFS for further data analysis in Zeppelin. The main problems are you need to have very specific versions of Python (2.7), Bazel (0.2.0 - 0.2.2b), Numpy, Protobuf, ASCIITree and others. Some of these don't play well with older versions of Centos. If you are on a clean Mac or Ubuntu, things should go smooth. My CentOS was missing a bunch of libraries so I tried to install them: sudo yum -y install swigpip install -U protobuf==3.0.0b2 pip install asciitreepip install numpyPip install noseWget https://github.com/bazelbuild/bazel/releases/download/0.2.2b/bazel-0.2.2b-installer-linux-x86_64.shsudo yum -y install libstdc++ ./configuresudo yum -y install pkg-config zip g++ zlib1g-dev unzipcd .. bazel test syntaxnet/... util/utf8/...# On Mac, run the following: bazel test --linkopt=-headerpad_max_install_names \ syntaxnet/... util/utf8/…cat /etc/redhat-releaseCentOS release 6.7 (Final)sudo yum -y install glibcsudo yum -y install epel_release sudo yum -y install gcc gcc-c++ python-pip python-devel atlas atlas-devel gcc-gfortran openssl-devel libffi-devel pip install --upgrade virtualenv virtualenv --system-site-packages ~/venvs/tensorflow source ~/venvs/tensorflow/bin/activate pip install --upgrade numpy scipy wheel cryptography #optional pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl # or below if you want gpu, support, but cuda and cudnn are required, see docs for more install instructions pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whlsudo yum -y install python-numpy swig python-devsudo yum -y upgradeyum install python27 It's worth a try for the patient or people with newer CentOS. Your mileage may vary! References: http://googleresearch.blogspot.com/2016/05/announcing-syntaxnet-worlds-most.html http://arxiv.org/abs/1603.06042 https://www.cis.upenn.edu/~treebank/ http://googleresearch.blogspot.com/2011/03/building-resources-to-syntactically.html https://github.com/tensorflow/models/tree/master/syntaxnet http://hoolihan.net/blog-tim/2016/03/02/installing-tensorflow-on-centos/ https://github.com/tensorflow/models/tree/master/syntaxnet#getting-started http://googleresearch.blogspot.com/2016/05/announcing-syntaxnet-worlds-most.html https://dmngaya.com/2015/10/25/installing-python-2-7-on-centos-6-7/

TimothySpann · ‎05-31-2016

Did you check the docs: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/validating-phoenix-installation.html

TimothySpann · ‎05-31-2016

@Jitendra Yadav that worked for me in zeppelin and the data looks good.

TimothySpann · ‎05-31-2016

I've done so with a sqlContext.sql ("create table...") then a sqlContext.sql("insert into") but a dataframe.write.orc will produce an ORC file that cannot be seen as hive. What are all the ways to work with ORC from Spark?

TimothySpann · ‎05-26-2016

Alluxio is available to install on HDP and works with your existing HDFS. http://www.alluxio.org/documentation/en/Configuring-Alluxio-with-HDFS.html http://www.alluxio.com/2016/04/getting-started-with-alluxio-and-spark/ To use on HDP, you need to edit the /opt/alluxio-1.0.1/conf/alluxio-env.sh after you create it and change the port from 9000 to 8020. You can also override at the command line: export ALLUXIO_UNDERFS_ADDRESS=hdfs://localhost:8020 To Setup Alluxio. First download it. bin/alluxio format[root@sandbox alluxio-1.0.1]# bin/alluxio formatConnecting to localhost as root...Warning: Permanently added 'localhost' (RSA) to the list of known hosts.Formatting Alluxio Worker @ sandbox.hortonworks.comConnection to localhost closed.Formatting Alluxio Master @ localhost[root@sandbox alluxio-1.0.1]# bin/alluxio-start.sh local Killed 0 processes on sandbox.hortonworks.comKilled 0 processes on sandbox.hortonworks.comConnecting to localhost as root...Killed 0 processes on sandbox.hortonworks.comConnection to localhost closed.Formatting RamFS: /mnt/ramdisk (1gb)Starting master @ localhost. Logging to /opt/alluxio-1.0.1/logsStarting worker @ sandbox.hortonworks.com. Logging to /opt/alluxio-1.0.1/logs ==> /opt/alluxio-1.0.1/logs/user.log <==2016-05-25 17:22:45,622 INFO logger.type (Format.java:formatFolder) - Formatting JOURNAL_FOLDER:/opt/alluxio-1.0.1/journal/2016-05-25 17:22:45,656 INFO logger.type (Format.java:formatFolder) - Formatting BlockMaster_JOURNAL_FOLDER:/opt/alluxio-1.0.1/journal/BlockMaster2016-05-25 17:22:45,657 INFO logger.type (Format.java:formatFolder) - Formatting FileSystemMaster_JOURNAL_FOLDER:/opt/alluxio-1.0.1/journal/FileSystemMaster2016-05-25 17:22:45,659 INFO logger.type (Format.java:formatFolder) - Formatting LineageMaster_JOURNAL_FOLDER:/opt/alluxio-1.0.1/journal/LineageMaster ==> /opt/alluxio-1.0.1/logs/worker.log <==2016-05-26 20:46:19,189 INFO server.AbstractConnector (AbstractConnector.java:doStart) - Started SelectChannelConnector@0.0.0.0:300002016-05-26 20:46:19,189 INFO logger.type (UIWebServer.java:startWebServer) - Alluxio Worker Web service started @ 0.0.0.0/0.0.0.0:300002016-05-26 20:46:19,191 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.0.1) is trying to connect with BlockMasterWorker master @ sandbox.hortonworks.com/10.0.2.15:199982016-05-26 20:46:19,199 INFO logger.type (AbstractClient.java:connect) - Client registered with BlockMasterWorker master @ sandbox.hortonworks.com/10.0.2.15:199982016-05-26 20:46:19,312 INFO logger.type (AlluxioWorker.java:start) - Started worker with id 12016-05-26 20:46:19,312 INFO logger.type (AlluxioWorker.java:start) - Alluxio Worker version 1.0.1 started @ sandbox.hortonworks.com/10.0.2.15:299982016-05-26 20:46:20,311 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.0.1) is trying to connect with FileSystemMasterWorker master @ sandbox.hortonworks.com/10.0.2.15:199982016-05-26 20:46:20,311 INFO logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ sandbox.hortonworks.com/10.0.2.15:199982016-05-26 20:46:20,313 INFO logger.type (AbstractClient.java:connect) - Alluxio client (version 1.0.1) is trying to connect with FileSystemMasterWorker master @ sandbox.hortonworks.com/10.0.2.15:199982016-05-26 20:46:20,314 INFO logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ sandbox.hortonworks.com/10.0.2.15:19998 To validate your install, run the tests: ./bin/alluxio runTests All tests passed You can view from the command line ( http://www.alluxio.org/documentation/en/Command-Line-Interface.html ) [root@sandbox alluxio-1.0.1]# ./bin/alluxio fs ls /default_tests_files 80.00B 05-26-2016 20:49:01:243 In Memory /default_tests_files/BasicFile_CACHE_PROMOTE_MUST_CACHE 84.00B 05-26-2016 20:49:02:877 In Memory /default_tests_files/BasicNonByteBuffer_CACHE_PROMOTE_MUST_CACHE 80.00B 05-26-2016 20:49:04:432 In Memory /default_tests_files/BasicFile_CACHE_PROMOTE_CACHE_THROUGH 84.00B 05-26-2016 20:49:08:236 In Memory /default_tests_files/BasicNonByteBuffer_CACHE_PROMOTE_CACHE_THROUGH 80.00B 05-26-2016 20:49:12:342 In Memory /default_tests_files/BasicFile_CACHE_PROMOTE_THROUGH 84.00B 05-26-2016 20:49:16:392 In Memory /default_tests_files/BasicNonByteBuffer_CACHE_PROMOTE_THROUGH 80.00B 05-26-2016 20:49:20:851 In Memory /default_tests_files/BasicFile_CACHE_PROMOTE_ASYNC_THROUGH 84.00B 05-26-2016 20:49:23:190 In Memory /default_tests_files/BasicNonByteBuffer_CACHE_PROMOTE_ASYNC_THROUGH 80.00B 05-26-2016 20:49:25:152 In Memory /default_tests_files/BasicFile_CACHE_MUST_CACHE 84.00B 05-26-2016 20:49:26:975 In Memory /default_tests_files/BasicNonByteBuffer_CACHE_MUST_CACHE 80.00B 05-26-2016 20:49:28:595 In Memory /default_tests_files/BasicFile_CACHE_CACHE_THROUGH 84.00B 05-26-2016 20:49:32:375 In Memory /default_tests_files/BasicNonByteBuffer_CACHE_CACHE_THROUGH 80.00B 05-26-2016 20:49:36:505 In Memory /default_tests_files/BasicFile_CACHE_THROUGH 84.00B 05-26-2016 20:49:40:823 In Memory /default_tests_files/BasicNonByteBuffer_CACHE_THROUGH 80.00B 05-26-2016 20:49:44:827 In Memory /default_tests_files/BasicFile_CACHE_ASYNC_THROUGH 84.00B 05-26-2016 20:49:47:248 In Memory /default_tests_files/BasicNonByteBuffer_CACHE_ASYNC_THROUGH 80.00B 05-26-2016 20:49:49:614 In Memory /default_tests_files/BasicFile_NO_CACHE_MUST_CACHE 84.00B 05-26-2016 20:49:52:384 In Memory /default_tests_files/BasicNonByteBuffer_NO_CACHE_MUST_CACHE 80.00B 05-26-2016 20:49:55:107 In Memory /default_tests_files/BasicFile_NO_CACHE_CACHE_THROUGH 84.00B 05-26-2016 20:49:59:675 In Memory /default_tests_files/BasicNonByteBuffer_NO_CACHE_CACHE_THROUGH 80.00B 05-26-2016 20:50:03:639 Not In Memory /default_tests_files/BasicFile_NO_CACHE_THROUGH 84.00B 05-26-2016 20:50:07:425 Not In Memory /default_tests_files/BasicNonByteBuffer_NO_CACHE_THROUGH 80.00B 05-26-2016 20:50:11:384 In Memory /default_tests_files/BasicFile_NO_CACHE_ASYNC_THROUGH 84.00B 05-26-2016 20:50:13:310 In Memory /default_tests_files/BasicNonByteBuffer_NO_CACHE_ASYNC_THROUGH You can access these files from Spark and Flink. Alluxio has configurable storage tiers (memory, HHD, SSD) and can sit on top of HDFS. To Browse The Alluxio File System and Also View Metrics http://localhost:19999/home References: Presentation on Alluxio (formerely Tachyon) http://www.slideshare.net/TachyonNexus/tachyon-presentation-at-ampcamp-6-november-2015 Unified Name Space http://www.alluxio.com/2016/04/unified-namespace-allowing-applications-to-access-data-anywhere/ Getting Started with Alluxio and Spark http://www.alluxio.com/2016/04/getting-started-with-alluxio-and-spark/

TimothySpann · ‎05-25-2016

Spark and Hadoop go together like peanut butter and jelly. Check out my slides https://community.hortonworks.com/content/idea/28342/apache-zeppelin-with-scala-spark-introduction-to-r.html https://community.hortonworks.com/content/kbentry/34784/data-ingest-with-apache-zeppelin-apache-spark-16-h.html I worked at a few places that used Spark and Spark streaming to ingest data into HDFS and HBase. Then Spark + Spark MLib and H20 to run machine learning on the data. Then Hive and Spark SQL for queries. And reporting through Hive Thrift server to Tableau. Spark without Hadoop really is missing out a lot. And Spark 1.6 on HDP you get all the benefits of running YARN applications, common security and locality of data access. I wouldn't run Spark without Hadoop unless you are running Spark standalone for development. Even there Zeppelin + Spark 1.6 on HDP is an awesome development environment.

TimothySpann · ‎05-25-2016

Thanks, I was afraid that was the case.

TimothySpann · ‎05-25-2016

Can Sqoop tranfers data to HDFS as a Compressed Snappy in ORC format ? And can then go directly into a hive table?

Online	Offline
Last Visited	‎05-20-2024 05:42 PM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎05-20-2024 05:42 PM
Posts	1,973
Kudos received	1122

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

Re: Initial job has not accepted any resources

Re: Initial job has not accepted any resources

Using Parsey McParseFace (Google TensorFlow Syntax...

Re: Issue connecting to Phoenix with Sqlline

Re: Can you create a hive table in ORC Format from...

Can you create a hive table in ORC Format from Spa...

Alluxio on HDP 2.4 - In Memory HDFS

Re: Hadoop + Spark Use Case

Re: SQOOP Import to Snappy ORC

SQOOP Import to Snappy ORC