Member since
06-14-2016
1
Post
0
Kudos Received
0
Solutions
06-14-2016
01:50 AM
Hey Timothy
Great article and I wanted to thank you for putting it together. Currently I am trying to create a corpus that I will later use to train an RNN article summarizer. I didnt have access to something like gigaword so I wrote an article scraper in Javascript and now I wanted to POS tag the title and body using parsey mcparseface. I have gotten to the point where I can pass in a single input file via the params passed to parser_eval, but my JS scraper is currently outputting a JSON object in a .json file for each article which contains the title, body and some other info. What I am wanting to do is see if there is a way to pass a folder to the params (such as the input field) and have it iterate over all the files in a folder, use Parsey McParseface to POS tag the title and body and then output that in xml. I have pasted the main entry point below. I cant figure out how to modify the "documents". I figured I would post to see if you have any recommendations on how to go about passing in the data from each of these files. I have been trying to find where in the pipeline I am able to inject / modify the sentences coming in but have not had success yet. If you have any tips or recommendations on how I might be able to accomplish this, send them my way. Otherwise, thanks again for the article! Time to jump back into the API docs 🙂 def main(unused_argv):
logging.set_verbosity(logging.INFO)
path_to_json = "%s/tf_files/dataset_raw/nprarticles" % expanduser("~")
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
# we need both the json and an index number so use enumerate()
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
title = json_text['title']
body = json_text['body']
with tf.Session() as sess:
src = gen_parser_ops.document_source(batch_size=32,
corpus_name=FLAGS.corpus_name,
task_context=FLAGS.task_context)
sentence = sentence_pb2.Sentence()
l_root = ET.Element("root")
l_headline = ET.SubElement(l_root, "headline").text = "title"
l_text = ""
l_text2 = ET.SubElement(l_root, "text") #sentence.text
l_sentences = ET.SubElement(l_root, "sentences")
l_numSentences = 0
while True:
documents, finished = sess.run(src)
#logging.info('Read %d documents', len(documents))
for d in documents:
sentence.ParseFromString(d)
l_sentence = ET.SubElement(l_sentences, "sentence", id="%s" % l_numSentences)
l_tokens = ET.SubElement(l_sentence, "tokens")
l_text = "%s %s" % (l_text, sentence.text)
#print 'Formatting XML'
formatXml(sentence, l_tokens)
l_numSentences += 1
... View more