About therooske

therooske · ‎06-14-2016

Hey Timothy Great article and I wanted to thank you for putting it together. Currently I am trying to create a corpus that I will later use to train an RNN article summarizer. I didnt have access to something like gigaword so I wrote an article scraper in Javascript and now I wanted to POS tag the title and body using parsey mcparseface. I have gotten to the point where I can pass in a single input file via the params passed to parser_eval, but my JS scraper is currently outputting a JSON object in a .json file for each article which contains the title, body and some other info. What I am wanting to do is see if there is a way to pass a folder to the params (such as the input field) and have it iterate over all the files in a folder, use Parsey McParseface to POS tag the title and body and then output that in xml. I have pasted the main entry point below. I cant figure out how to modify the "documents". I figured I would post to see if you have any recommendations on how to go about passing in the data from each of these files. I have been trying to find where in the pipeline I am able to inject / modify the sentences coming in but have not had success yet. If you have any tips or recommendations on how I might be able to accomplish this, send them my way. Otherwise, thanks again for the article! Time to jump back into the API docs 🙂 def main(unused_argv): logging.set_verbosity(logging.INFO) path_to_json = "%s/tf_files/dataset_raw/nprarticles" % expanduser("~") json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')] # we need both the json and an index number so use enumerate() for index, js in enumerate(json_files): with open(os.path.join(path_to_json, js)) as json_file: json_text = json.load(json_file) title = json_text['title'] body = json_text['body'] with tf.Session() as sess: src = gen_parser_ops.document_source(batch_size=32, corpus_name=FLAGS.corpus_name, task_context=FLAGS.task_context) sentence = sentence_pb2.Sentence() l_root = ET.Element("root") l_headline = ET.SubElement(l_root, "headline").text = "title" l_text = "" l_text2 = ET.SubElement(l_root, "text") #sentence.text l_sentences = ET.SubElement(l_root, "sentences") l_numSentences = 0 while True: documents, finished = sess.run(src) #logging.info('Read %d documents', len(documents)) for d in documents: sentence.ParseFromString(d) l_sentence = ET.SubElement(l_sentences, "sentence", id="%s" % l_numSentences) l_tokens = ET.SubElement(l_sentence, "tokens") l_text = "%s %s" % (l_text, sentence.text) #print 'Formatting XML' formatXml(sentence, l_tokens) l_numSentences += 1

Online	Offline
Last Visited	‎06-16-2016 12:00 AM

Member Since	‎06-14-2016 01:20 AM
Last Visited	‎06-16-2016 12:00 AM
Posts	1

Cloudera Community

Re: Using Parsey McParseFace (Google TensorFlow Sy...