About TimothySpann

jyadav · ‎06-03-2016

@Timothy Spann You can also use pig ORC storage and store file in ORC, afterward create external hive table and point it to orc directory. https://pig.apache.org/docs/r0.15.0/func.html#OrcStorage Data types Most Orc data type has one to one mapping to Pig data type. Several exceptions are: Loader side: Orc STRING/CHAR/VARCHAR all map to Pig varchar Orc BYTE/BINARY all map to Pig bytearray Orc TIMESTAMP/DATE all maps to Pig datetime Orc DECIMAL maps to Pig bigdecimal Storer side: Pig chararray maps to Orc STRING Pig datetime maps to Orc TIMESTAMP Pig bigdecimal/biginteger all map to Orc DECIMAL Pig bytearray maps to Orc BINARY

abhijeetmagarwa · ‎07-01-2016

I know this is the silly question but where can you find the location. I am not able to trace back @srai

rhryniewicz · ‎06-12-2016

Updated tutorial: 1) using centos-release-scl 2) wget https://bootstrap.pypa.io/ez_setup.py Thanks!

therooske · ‎06-14-2016

Hey Timothy Great article and I wanted to thank you for putting it together. Currently I am trying to create a corpus that I will later use to train an RNN article summarizer. I didnt have access to something like gigaword so I wrote an article scraper in Javascript and now I wanted to POS tag the title and body using parsey mcparseface. I have gotten to the point where I can pass in a single input file via the params passed to parser_eval, but my JS scraper is currently outputting a JSON object in a .json file for each article which contains the title, body and some other info. What I am wanting to do is see if there is a way to pass a folder to the params (such as the input field) and have it iterate over all the files in a folder, use Parsey McParseface to POS tag the title and body and then output that in xml. I have pasted the main entry point below. I cant figure out how to modify the "documents". I figured I would post to see if you have any recommendations on how to go about passing in the data from each of these files. I have been trying to find where in the pipeline I am able to inject / modify the sentences coming in but have not had success yet. If you have any tips or recommendations on how I might be able to accomplish this, send them my way. Otherwise, thanks again for the article! Time to jump back into the API docs 🙂 def main(unused_argv): logging.set_verbosity(logging.INFO) path_to_json = "%s/tf_files/dataset_raw/nprarticles" % expanduser("~") json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')] # we need both the json and an index number so use enumerate() for index, js in enumerate(json_files): with open(os.path.join(path_to_json, js)) as json_file: json_text = json.load(json_file) title = json_text['title'] body = json_text['body'] with tf.Session() as sess: src = gen_parser_ops.document_source(batch_size=32, corpus_name=FLAGS.corpus_name, task_context=FLAGS.task_context) sentence = sentence_pb2.Sentence() l_root = ET.Element("root") l_headline = ET.SubElement(l_root, "headline").text = "title" l_text = "" l_text2 = ET.SubElement(l_root, "text") #sentence.text l_sentences = ET.SubElement(l_root, "sentences") l_numSentences = 0 while True: documents, finished = sess.run(src) #logging.info('Read %d documents', len(documents)) for d in documents: sentence.ParseFromString(d) l_sentence = ET.SubElement(l_sentences, "sentence", id="%s" % l_numSentences) l_tokens = ET.SubElement(l_sentence, "tokens") l_text = "%s %s" % (l_text, sentence.text) #print 'Formatting XML' formatXml(sentence, l_tokens) l_numSentences += 1

uday_vakalapudi · ‎06-15-2016

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It just like a File to store group of rows called stripes, along with auxiliary information in a file footer. It just a storage format, nothing to do with ORC/Spark.

HorizonNet · ‎06-30-2016

There's currently no native integration with Ranger. Nevertheless, Alluxio provides a REST API. You could setup Alluxio as a Knox service and use the Ranger Knox plugin. Or even simpler, if you use HDFS as your UnderFS the HDFS plugin could do the work (haven't tried it yet). Seems to be an interesting feature to me. You could file a JIRA for it in Alluxio or Ranger.

nicksxs · ‎05-27-2017

if my hive table is a external table located on hdfs, could this solution work? thanks , if my hive table is a external table ,could this solution work?

TimothySpann · ‎05-25-2016

Twitter has opened source another real-time, distributed, fault-tolerant stream processing engine called Heron. They see as the successor for Storm. It is backwards compatible with Storm's topology API. First I followed the getting started guide. Downloading and installing on MacOsx. Downloads ./heron-client-install-0.14.0-darwin.sh --user Heron client installer ---------------------- Uncompressing...... Heron is now installed! Make sure you have "/usr/local/bin" in your path. See http://heronstreaming.io/docs/getting-started.html for how to use Heron. heron.build.version : 0.14.0 heron.build.time : Tue May 24 22:44:01 PDT 2016 heron.build.timestamp : 1464155053000 heron.build.host : tw-mbp-kramasamy heron.build.user : kramasamy heron.build.git.revision : be87b09f348e0ed05f45503340a2245a4ef68a35 heron.build.git.status : Clean ➜ Downloads export PATH=$PATH::/usr/local/bin ➜ Downloads ./heron-tools-install-0.14.0-darwin.sh --user Heron tools installer --------------------- Uncompressing...... Heron Tools is now installed! Make sure you have "/usr/local/bin" in your path. See http://heronstreaming.io/docs/getting-started.html for how to use Heron. heron.build.version : 0.14.0 heron.build.time : Tue May 24 22:44:01 PDT 2016 heron.build.timestamp : 1464155053000 heron.build.host : tw-mbp-kramasamy heron.build.user : kramasamy heron.build.git.revision : be87b09f348e0ed05f45503340a2245a4ef68a35 heron.build.git.status : Clean http://twitter.github.io/heron/docs/getting-started/ Run the example to make sure everything is installed heron submit local ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.ExclamationTopology ExclamationTopology [2016-05-25 16:16:32 -0400] com.twitter.heron.scheduler.local.LocalLauncher INFO: For checking the status and logs of the topology, use the working directory /Users/tspann/.herondata/topologies/local/tspann/ExclamationTopology INFO: Topology 'ExclamationTopology' launched successfully INFO: Elapsed time: 4.722s. heron activate local ExclamationTopology [2016-05-25 16:19:38 -0400] com.twitter.heron.spi.utils.TMasterUtils SEVERE: Topology is already activateed INFO: Successfully activated topology 'ExclamationTopology' INFO: Elapsed time: 2.739s. heron activate local ExclamationTopology [2016-05-25 16:19:38 -0400] com.twitter.heron.spi.utils.TMasterUtils SEVERE: Topology is already activateed INFO: Successfully activated topology 'ExclamationTopology' INFO: Elapsed time: 2.739s. Run the UI sudo heron-ui 25 May 2016 16:20:31-INFO:main.py:101: Listening at http://192.168.1.5:8889 25 May 2016 16:20:31-INFO:main.py:102: Using tracker url: http://localhost:8888 To not step on HDP ports, I change the port sudo heron-tracker --port 8881 25 May 2016 16:24:14-INFO:main.py:183: Running on port: 8881 25 May 2016 16:24:14-INFO:main.py:184: Using config file: /usr/local/herontools/conf/heron_tracker.yaml Look at the heron website: http://localhost:8881/topologies {"status": "success", "executiontime": 4.291534423828125e-05, "message": "", "version": "1.0.0", "result": {}} Let's run the UI: sudo heron-ui --port 8882 --tracker_url http://localhost:8881 25 May 2016 16:28:53-INFO:main.py:101: Listening at http://192.168.1.5:8882 25 May 2016 16:28:53-INFO:main.py:102: Using tracker url: http://localhost:8881 Look at the Heron Cluster http://localhost:8881/clusters {"status": "success", "executiontime": 1.9073486328125e-05, "message": "", "version": "1.0.0", "result": ["localzk", "local"]} Using Heron CLI heron usage: heron <command> <options> ... Available commands: activate Activate a topology deactivate Deactivate a topology help Prints help for commands kill Kill a topology restart Restart a topology submit Submit a topology version Print version of heron-cli Getting more help: heron help <command> Prints help and options for <command> For detailed documentation, go to http://heronstreaming.io If you need to restart a topology: heron restart local ExclamationTopology INFO: Successfully restarted topology 'ExclamationTopology' INFO: Elapsed time: 3.928s. Look at my topology http://localhost:8881/topologies#/all/all/ExclamationTopology { "status": "success", "executiontime": 7.104873657226562e-05, "message": "", "version": "1.0.0", "result": {"local": {"default": ["ExclamationTopology"]}} } Adding --verbose will add a ton of debug logs. Attached are some screen shots. The Heron UI is decent. I am hoping Heron screens will be integrated into Ambari.

TimothySpann · ‎05-22-2016

Create a Hive Table as ORC File through Spark SQL in Zeppelin. %sql create table default.logs_orc_table (clientIp STRING, clientIdentity STRING, user STRING, dateTime STRING, request STRING, statusCode INT, bytesSent FLOAT, referer STRING, userAgent STRING) stored as orc Load data from a DataFrame into this table: %sql insert into table default.logs_orc_table select t.* from accessLogsDF t I can create a table in the Hive View from Ambari. CREATE TABLE IF NOT EXISTS survey ( firstName STRING, lastName STRING, gender STRING, phone STRING, email STRING, address STRING, city STRING, postalcode STRING, surveyanswer STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE; Then really easy to load some data from a CSV file. LOAD DATA INPATH '/demo/survey1.csv' OVERWRITE INTO TABLE survey; I can create an ORC based table in Hive from Hive View in Ambari, Spark / Spark SQL or Hive areas in Zeppelin: create table survey_orc( firstName varchar(255), lastName varchar(255), gender varchar(255), phone varchar(255), email varchar(255), address varchar(255), city varchar(255), postalcode varchar(255), surveyanswer varchar(255) ) stored as orc tblproperties ("orc.compress"="NONE"); I can do the same insert into from Hive. %hive insert into table default.survey_orc select t.* from survey t I can query Hive tables from Spark SQL or Hive easily.

TimothySpann · ‎11-28-2017

oops drop the second one

Online	Offline
Last Visited	‎05-20-2024 05:42 PM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎05-20-2024 05:42 PM
Posts	1,973
Kudos received	1122

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

Re: Using Pig to Load Data into ORC

Re: How can I enable pig -useHCatalog via Ambari. ...

Re: Error (No package python27 available.) while s...

Re: Using Parsey McParseFace (Google TensorFlow Sy...

Re: Can you create a hive table in ORC Format from...

Re: Alluxio on HDP 2.4 - In Memory HDFS

Re: SQOOP Import to Snappy ORC

Investigating Twitter Heron

Data Ingest with Apache Zeppelin + Apache Spark 1....

Re: Parsing Apache Log Files with Spark