1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2458 | 04-03-2024 06:39 AM | |
| 3807 | 01-12-2024 08:19 AM | |
| 2053 | 12-07-2023 01:49 PM | |
| 3037 | 08-02-2023 07:30 AM | |
| 4156 | 03-29-2023 01:22 PM |
06-17-2016
12:42 PM
Lipstick Installation Resources: http://www.graphviz.org/Download_linux_rhel.php https://github.com/Netflix/Lipstick/wiki/Getting-Started Commands sudo yum list available 'graphviz*'
sudo yum -y install 'graphviz*'
./gradlew assemble I always like to rename gradlew, avengers; then ./gradlew run-app Hit your browser to view: http://localhost:9292/ Make sure you add that port/open firewall/etc... 2016-06-17 02:36:44,558 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
2.4.0root2016-06-17 02:36:402016-06-17 02:36:44HASH_JOIN,FILTER,LIMIT
Success!
Job Stats (time in seconds):
JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTimeMedianMapTimeMaxReduceTimeMinReduceTimeAvgReduceTimeMedianReducetimeAliasFeatureOutputs
job_local2036219587_000121n/an/an/an/an/an/an/an/afruit_names_join,fruits,limited,namesHASH_JOIN
job_local406327028_000211n/an/an/an/an/an/an/an/afruit_namesfile:/tmp/temp195796189/tmp-2027262369,
Input(s):
Successfully read 3 records from: "file:///opt/demo/certification/pig/Lipstick/quickstart/1.dat"
Successfully read 3 records from: "file:///opt/demo/certification/pig/Lipstick/quickstart/2.dat"
Output(s):
Successfully stored 1 records in: "file:/tmp/temp195796189/tmp-2027262369"
Counters:
Total records written : 1
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local2036219587_0001->job_local406327028_0002,
job_local406327028_0002
2016-06-17 02:36:44,568 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2016-06-17 02:36:44,571 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2016-06-17 02:36:44,582 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2016-06-17 02:36:44,583 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(orange,ORANGE) It's a very nice looking visualization.
... View more
Labels:
06-16-2016
11:41 PM
Sometimes it's easy to share files: https://forums.virtualbox.org/viewtopic.php?t=15679 You just pick a directory, set Auto-mount "Yes" and Access to "Full" and hit OK. For some of us, depending on versions of VirtualBox, the VM and host operating system; things might not work. It also can get broken when host operating system or VM updates. Log into the VM as root and try this (if it's HDP sandbox or another running Centos). cd /opt/VBoxGuestAdditions-*/init sudo ./vboxadd setup
modprobe -a vboxguest vboxsf vboxvideo
rm -rf /media/sf_Downloads
mkdir /media/sf_Downloads
mount -t vboxsf Downloads /media/sf_Downloads
For me that worked and my Downloads directory was shared so I could move files to my Sandbox and off for development. There are some other things you can try and certainly rebooting everyone helps. For me, this worked fine.
... View more
Labels:
06-16-2016
02:38 PM
6 Kudos
Most code for current big data projects and for the code you are going to write is going to be JVM based (Java and Scala mostly). There is certainly a ton of R, Python, Shell and other languages. For this tutorial we will focus on JVM tools. The great thing about that is that Java and Scala Static Code Analysis Tools will work for analyzing your code. JUnit test are great for testing the basic code and making sure you isolate out functionality from Hadoop and Spark specific interfacing.
General Java Tools for Testing
http://junit.org/
http://checkstyle.sourceforge.net/ http://pmd.github.io/pmd-5.4.2/pmd-java/rules/index.html Testing Hadoop (A Great Overview)
https://github.com/mfjohnson/HadoopTesting https://www.infoq.com/articles/HadoopMRUnit Example: I have a Hive UDF
written in Java that I can test via Junit to ensure that the main functionality
works. (See: UtilTest) import static org.junit.Assert.assertEquals;
import org.junit.Test;
/**
* Test method for
* {@link com.dataflowdeveloper.deprofaner.ProfanityRemover#fillWithCharacter(
* int, java.lang.String)}.
*/
@Test
public void testFillWithCharacterIntString() {
assertEquals("XXXXX", Util.fillWithCharacter(5, "X") );
}
As you can see this is just a plain old JUnit Test, but it's one step in the process to make sure you can test your code before it is deployed. Also Jenkins and other CI tools are great at running JUnits are part of their continuous build and integration process. A great way to test your application is with a small Hadoop cluster or simulated one. Testing against a Sandbox downloaded on your laptop is a great way as well. Testing Integration
with a Mini-Cluster https://github.com/hortonworks/mini-dev-cluster https://github.com/sakserv/hadoop-mini-clusters Testing Hbase
Applications Artem Ervits has a great article on Hbase Unit Testing. https://community.hortonworks.com/repos/15674/variety-of-hbase-unit-testing-utilities.html https://github.com/dbist/HBaseUnitTest Testing Apache NiFi
Processors
http://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2.0.1/bk_DeveloperGuide/content/instantiate-testrunner.html http://www.nifi.rocks/developing-a-custom-apache-nifi-processor-unit-tests-partI/ Testing Apache NiFi
Scripts
https://github.com/mattyb149/nifi-script-tester http://funnifi.blogspot.com/2016/06/testing-executescript-processor-scripts.html Testing Oozie
https://oozie.apache.org/docs/4.2.0/ENG_MiniOozie.html Testing Hive Scripts
https://cwiki.apache.org/confluence/display/Hive/Unit+Testing+Hive+SQL http://hakunamapdata.com/beetest-a-simple-utility-for-testing-apache-hive-scripts-locally-for-non-java-developers/ https://github.com/klarna/HiveRunner https://github.com/edwardcapriolo/hive_test http://finraos.github.io/HiveQLUnit/ Testing Hive UDF
http://blog.matthewrathbone.com/2013/08/10/guide-to-writing-hive-udfs.html https://cwiki.apache.org/confluence/display/Hive/PluginDeveloperKit Using
org.apache.hive.pdk.HivePdkUnitTest and org.apache.hive.pdk.HivePdkUnitTests in your Hive plugin so that it will be included in unit tests. Testing Pig Scripts
http://pig.apache.org/docs/r0.8.1/pigunit.html http://www.slideshare.net/Skillspeed/hdfs-and-big-data-tdd-using-pig-unit-webinar http://www.slideshare.net/SwissHUG/practical-pig-and-pig-unit-michael-noll-july-2012 Testing Apache Spark
Applications
http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/ https://github.com/holdenk/spark-testing-base http://www.slideshare.net/hkarau/effective-testing-for-spark-programs-strata-ny-2015 https://developer.ibm.com/hadoop/2016/03/07/testing-your-apache-spark-code-with-junit-4-0-and-intellij/ http://www.slideshare.net/knoldus/unit-testing-of-spark-applications Testing Apache Storm
Applications
Debugging an Apache Storm Topology https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
... View more
Labels:
06-15-2016
08:33 PM
The Brickhouse Collection of UDFs from Klout includes functions for collapsing multiple rows into one, generating top K lists, a distributed cache, bloom counters, JSON functions and HBase tools. Facebook UDF Collection (HIVE-1545) including functions for unescape, find in an array and finding a max in a set of columns.
UDF Collection for Various String distances, Text classification and other Text Mining. UDF for anonymizing data with Apache Pig. Hive UDF for various functions like array count Curve Computing UDF Ngram Functions UDF Hive UDFs Similar to Oracle Funcitons A collection of UDFs for GeocodeIP, Haversine Distance, DecodeURL UDFs Hive Funnel Analysis UDF by Yahoo (tracking user conversion rates across actions) Hive UDF Collection by LivingSocial for Min and Max Date, MySQL Style Like, and more. Hive UDF with Yahoo Data sketches is for Stochastic Streaming Algorithms called Data Sketches. Hive UDF to Count Business Days. User Agent String Parser Hive UDF Date Range Generator Hive UDF
... View more
06-10-2016
11:36 PM
Could you provide the github link or upload the template xml before we publish this? Also would be good to show what the tweet looks like before/after processing
... View more
06-14-2016
01:50 AM
Hey Timothy
Great article and I wanted to thank you for putting it together. Currently I am trying to create a corpus that I will later use to train an RNN article summarizer. I didnt have access to something like gigaword so I wrote an article scraper in Javascript and now I wanted to POS tag the title and body using parsey mcparseface. I have gotten to the point where I can pass in a single input file via the params passed to parser_eval, but my JS scraper is currently outputting a JSON object in a .json file for each article which contains the title, body and some other info. What I am wanting to do is see if there is a way to pass a folder to the params (such as the input field) and have it iterate over all the files in a folder, use Parsey McParseface to POS tag the title and body and then output that in xml. I have pasted the main entry point below. I cant figure out how to modify the "documents". I figured I would post to see if you have any recommendations on how to go about passing in the data from each of these files. I have been trying to find where in the pipeline I am able to inject / modify the sentences coming in but have not had success yet. If you have any tips or recommendations on how I might be able to accomplish this, send them my way. Otherwise, thanks again for the article! Time to jump back into the API docs 🙂 def main(unused_argv):
logging.set_verbosity(logging.INFO)
path_to_json = "%s/tf_files/dataset_raw/nprarticles" % expanduser("~")
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
# we need both the json and an index number so use enumerate()
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
title = json_text['title']
body = json_text['body']
with tf.Session() as sess:
src = gen_parser_ops.document_source(batch_size=32,
corpus_name=FLAGS.corpus_name,
task_context=FLAGS.task_context)
sentence = sentence_pb2.Sentence()
l_root = ET.Element("root")
l_headline = ET.SubElement(l_root, "headline").text = "title"
l_text = ""
l_text2 = ET.SubElement(l_root, "text") #sentence.text
l_sentences = ET.SubElement(l_root, "sentences")
l_numSentences = 0
while True:
documents, finished = sess.run(src)
#logging.info('Read %d documents', len(documents))
for d in documents:
sentence.ParseFromString(d)
l_sentence = ET.SubElement(l_sentences, "sentence", id="%s" % l_numSentences)
l_tokens = ET.SubElement(l_sentence, "tokens")
l_text = "%s %s" % (l_text, sentence.text)
#print 'Formatting XML'
formatXml(sentence, l_tokens)
l_numSentences += 1
... View more
06-30-2016
03:00 PM
1 Kudo
There's currently no native integration with Ranger. Nevertheless, Alluxio provides a REST API. You could setup Alluxio as a Knox service and use the Ranger Knox plugin. Or even simpler, if you use HDFS as your UnderFS the HDFS plugin could do the work (haven't tried it yet). Seems to be an interesting feature to me. You could file a JIRA for it in Alluxio or Ranger.
... View more
05-25-2016
04:30 PM
1 Kudo
Twitter
has opened source another real-time, distributed, fault-tolerant stream
processing engine called Heron. They
see as the successor for Storm. It is
backwards compatible with Storm's topology API. First I followed the getting started guide. Downloading and installing on MacOsx. Downloads ./heron-client-install-0.14.0-darwin.sh --user
Heron client installer
----------------------
Uncompressing......
Heron is now installed!
Make sure you have "/usr/local/bin" in your path.
See http://heronstreaming.io/docs/getting-started.html for how to use Heron.
heron.build.version : 0.14.0
heron.build.time : Tue May 24 22:44:01 PDT 2016
heron.build.timestamp : 1464155053000
heron.build.host : tw-mbp-kramasamy
heron.build.user : kramasamy
heron.build.git.revision : be87b09f348e0ed05f45503340a2245a4ef68a35
heron.build.git.status : Clean
➜ Downloads export PATH=$PATH::/usr/local/bin
➜ Downloads ./heron-tools-install-0.14.0-darwin.sh --user
Heron tools installer
---------------------
Uncompressing......
Heron Tools is now installed!
Make sure you have "/usr/local/bin" in your path.
See http://heronstreaming.io/docs/getting-started.html for how to use Heron.
heron.build.version : 0.14.0
heron.build.time : Tue May 24 22:44:01 PDT 2016
heron.build.timestamp : 1464155053000
heron.build.host : tw-mbp-kramasamy
heron.build.user : kramasamy
heron.build.git.revision : be87b09f348e0ed05f45503340a2245a4ef68a35
heron.build.git.status : Clean
http://twitter.github.io/heron/docs/getting-started/ Run the example to make sure everything is installed heron submit local ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.ExclamationTopology ExclamationTopology
[2016-05-25 16:16:32 -0400] com.twitter.heron.scheduler.local.LocalLauncher INFO: For checking the status and logs of the topology, use the working directory /Users/tspann/.herondata/topologies/local/tspann/ExclamationTopology
INFO: Topology 'ExclamationTopology' launched successfully
INFO: Elapsed time: 4.722s.
heron activate local ExclamationTopology
[2016-05-25 16:19:38 -0400] com.twitter.heron.spi.utils.TMasterUtils SEVERE: Topology is already activateed
INFO: Successfully activated topology 'ExclamationTopology'
INFO: Elapsed time: 2.739s.
heron activate local ExclamationTopology
[2016-05-25 16:19:38 -0400] com.twitter.heron.spi.utils.TMasterUtils SEVERE: Topology is already activateed
INFO: Successfully activated topology 'ExclamationTopology'
INFO: Elapsed time: 2.739s.
Run the UI sudo heron-ui
25 May 2016 16:20:31-INFO:main.py:101: Listening at http://192.168.1.5:8889
25 May 2016 16:20:31-INFO:main.py:102: Using tracker url: http://localhost:8888
To not step on HDP ports, I change the port sudo heron-tracker --port 8881
25 May 2016 16:24:14-INFO:main.py:183: Running on port: 8881
25 May 2016 16:24:14-INFO:main.py:184: Using config file: /usr/local/herontools/conf/heron_tracker.yaml
Look at the heron website: http://localhost:8881/topologies {"status": "success", "executiontime": 4.291534423828125e-05, "message": "", "version": "1.0.0", "result": {}} Let's run the UI: sudo heron-ui --port 8882 --tracker_url http://localhost:8881
25 May 2016 16:28:53-INFO:main.py:101: Listening at http://192.168.1.5:8882
25 May 2016 16:28:53-INFO:main.py:102: Using tracker url: http://localhost:8881
Look at the Heron Cluster http://localhost:8881/clusters
{"status": "success", "executiontime": 1.9073486328125e-05, "message": "",
"version": "1.0.0", "result": ["localzk", "local"]} Using Heron CLI heron
usage: heron <command> <options> ...
Available commands:
activate Activate a topology
deactivate Deactivate a topology
help Prints help for commands
kill Kill a topology
restart Restart a topology
submit Submit a topology
version Print version of heron-cli
Getting more help:
heron help <command> Prints help and options for <command>
For detailed documentation, go to http://heronstreaming.io
If you need to restart a topology: heron restart local ExclamationTopology
INFO: Successfully restarted topology 'ExclamationTopology'
INFO: Elapsed time: 3.928s. Look at my topology http://localhost:8881/topologies#/all/all/ExclamationTopology
{
"status": "success", "executiontime": 7.104873657226562e-05, "message": "",
"version": "1.0.0",
"result": {"local": {"default": ["ExclamationTopology"]}}
} Adding --verbose will add a ton of debug logs. Attached are some screen shots. The Heron UI is decent. I am hoping Heron screens will be integrated into Ambari.
... View more
Labels:
05-22-2016
08:29 PM
1 Kudo
Create a Hive Table as ORC File through Spark SQL in Zeppelin. %sql
create table default.logs_orc_table (clientIp STRING, clientIdentity STRING, user STRING, dateTime STRING, request STRING, statusCode INT, bytesSent FLOAT, referer STRING, userAgent STRING) stored as orc Load data from a DataFrame into this table: %sql
insert into table default.logs_orc_table select t.* from accessLogsDF t
I can create a table in the Hive View from Ambari. CREATE TABLE IF NOT EXISTS survey
( firstName STRING, lastName STRING, gender STRING,
phone STRING, email STRING,
address STRING,
city STRING,
postalcode STRING,
surveyanswer STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' STORED AS TEXTFILE;
Then really easy to load some data from a CSV file. LOAD DATA INPATH '/demo/survey1.csv' OVERWRITE INTO TABLE survey; I can create an ORC based table in Hive from Hive View in Ambari, Spark / Spark SQL or Hive areas in Zeppelin: create table survey_orc(
firstName varchar(255),
lastName varchar(255),
gender varchar(255),
phone varchar(255),
email varchar(255),
address varchar(255),
city varchar(255),
postalcode varchar(255),
surveyanswer varchar(255)
) stored as orc tblproperties
("orc.compress"="NONE");
I can do the same insert into from Hive. %hive
insert into table default.survey_orc select t.* from survey t
I can query Hive tables from Spark SQL or Hive easily.
... View more
Labels:
- « Previous
- Next »