Support Questions
Find answers, ask questions, and share your expertise

How to load a big file of 10gb once per executor and not per partition?

Highlighted

How to load a big file of 10gb once per executor and not per partition?

New Contributor

Hello,

I want to use a pretrained embedding model on each nodes of my cluster. for that i create the following:

from gensim.models.fasttext importFastTextas FT_gensim
# Load model (loads when this library is being imported)
model = FT_gensim.load_fasttext_format("/project/6008168/bib/wiki.en.bin")
   def get_vector(msg):
    pred = model[msg]
    return pred

Then in a script job:

#!/bin/bash#SBATCH -N 2#SBATCH -t 00:10:00#SBATCH --mem 20000#SBATCH --ntasks-per-node 1#SBATCH --cpus-per-task 32
module load python/2.7.14
source "/project/6008168/bib/ENV2.7.14/bin/activate"
module load spark/2.3.0
spark-submit /project/6008168/bib/test.py

then the test.py:

from __future__ import print_function
import sys
import time
import math
import csv
import datetime
importStringIOimport pyspark
import gensim
from operator import add
from pyspark.sql import*from pyspark.sql importSparkSessionfrom pyspark importSparkContext,SparkConffrom gensim.models.fasttext importFastTextas FT_gensim
appName ="bib"
modelpath ="/project/6008168/bib/wiki.en.bin"
conf =(SparkConf().setAppName(appName).set("spark.executor.memory","12G").set("spark.network.timeout","800s").set("spark.executor.heartbeatInterval","20s").set("spark.driver.maxResultSize","12g").set("spark.executor.instances",2).set("spark.executor.cores",30))
sc =SparkContext(conf = conf)#model = FT_gensim.load_fasttext_format(modelpath)
sc.addFile(modelpath)
sc.addPyFile("/project/6008168/bib/fasttextSpark.py")
print("nights = ", fasttextSpark.get_vector("nights"))

print("done")

The vector of "nights" is outputed. But suppose i have an my_rdd =(stringID, sentence) and i want to find the emebdding vector of sentence by summing up it words embedding vectors. In my solution, if sentence consist of 3 words the model will be loaded 3 times which in not efficace. How can i load the model on time per node?

Thank you