Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

how to improve the speed of the python script

Highlighted

how to improve the speed of the python script

Explorer

I'm very new to python. I'm working in the area of hydrology and I want to learn python to assist me with processing hydrological data.

At the moment I write a script to extract bits of information from a Informatica Big Data set. I have three csv files:

Complete_borelist.csv

Borelist_not_interested.csv

Elevation_info.csv

I want to create a file with has all the bores that are in complete_borelist.csv but not in borelist_not_interested.csv. I also want to grab some information from complete_borelist.csv and Elevation_info.csv for those bores which satisfy the first criteria.

My Python script is as follow:

not_interested_list =[]
outfile1 = open('output.csv','w')
outfile1.write('Station_ID,Name,Easting,Northing,Location_name,Elevation')
outfile1.write('\n')with open ('Borelist_not_interested.csv','r')as f1:for line in f1:ifnot line.startswith('Station'):#ignore header
            line = line.rstrip()
            words = line.split(',')
            station = words[0]
            not_interested_list.append(station)with open('Complete_borelist.csv','r')as f2:
    next(f2)#ignore headerfor line in f2:
        line= line.rstrip()
        words = line.split(',')
        station = words[0]ifnot station in not_interested_list:
            loc_name = words[1]
            easting = words[4]
            northing = words[5]
            outfile1.write(station+','+easting+','+northing+','+loc_name+',')with open ('Elevation_info.csv','r')as f3:
                next(f3)#ignore headerfor line in f3:
                    line = line.rstrip()
                    data = line.split(',')
                    bore_id = data[0]if bore_id == station:
                            elevation = data[4]
                            outfile1.write(elevation)
                            outfile1.write ('\n')                      

outfile1.close()

I have two issues with the script:

The first is the Elevation_info.csv doesn't have information for all the bore in the Complete_borelist.csv. When my loop get to the station where it can't find Elevation record for it, the script doesn't write "null" but continue to write the information for the next station in the same line. Can anyone help me to fix this please?

The second is my complete borelist is about >200000 rows and my script runs through them very slow. Can anyone have any suggestion to make it run faster?

Don't have an account?
Coming from Hortonworks? Activate your account here