About alexander_witte

alexander_witte · ‎07-12-2018

Hey Bryan thanks so much for taking the time! I think I'm almost there! The hint about the unicode issue helping me get past the first slew of errors. I seem to be running into a length one now however: @pandas_udf("array<string>") def stringClassifier(lookupstring, first, last): lookupstring = lookupstring.to_string().encode("utf-8") first = first.to_string().encode("utf-8") last = last.to_string().encode("utf-8") #this part takes the 3 strings above and reaches out to another library to do a string match result = process.extract(lookupstring, lookup_list, limit=4000) match_list = [item for item in result if item[0].startswith(first) and item[0].endswith(last)] result2 = process.extractOne(lookupstring, match_list) if result2 is not None and result2[0][1] > 75: answer = pd.Series(list(result2[0])) return answer else: fail = ["N/A","0"] return pd.Series(fail) RuntimeError: Result vector from pandas_udf was not the required length: expected 1, got 2 I'm initially passing three strings as variables to the function which then get passed to another library. The result is a tuple which I covert to a list then to a pandas Series object. I'm curious how I can make a 2 item array object a length of 1 ..? I'm obviously missing some basics here. @Bryan C

jsensharma · ‎06-17-2018

@Alex Witte You should be using Python 2.7.x instead of using Python3. Please find the Python certified version with HDP : https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_support-matrices/content/ch_matrices-ambari.html#ambari_software . hdp-select is a binary that comes with HDP installation and this uses "print" python function which will require slight change in python3. Thats why we are getting: SyntaxError:Missing parentheses in call to 'print'. . Can you please try setting the "HDP_VERSION" in spark-env.sh and then try again.

asirna · ‎06-15-2018

@Alex Witte, According to your question, you want to transform it to the below format Col1 Col2 1 [agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue] 2 [agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount] I have changed your code little bit and was able to achieve it. Please check this code and the pyspark execution output from pyspark.sql.types import * data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)] final_struc = StructType(fields=data_schema) df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc) df.show() from pyspark.sql.functions import udf def str_to_arr(my_list): my_list = my_list.split(",") return '[' + ','.join([str(elem) for elem in my_list]) + ']' str_to_arr_udf = udf(str_to_arr,StringType()) df = df.withColumn('route_arr',str_to_arr_udf(df["route"])) df = df.drop("route") df.show() >>> from pyspark.sql.types import * >>> data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)] >>> final_struc = StructType(fields=data_schema) >>> df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc) >>> df.show() +---+--------------------+ | id| route| +---+--------------------+ | 1|agakhanpark,scien...| | 2|agakhanpark,wynfo...| +---+--------------------+ >>> >>> >>> from pyspark.sql.functions import udf >>> def str_to_arr(my_list): ... my_list = my_list.split(",") ... return '[' + ','.join([str(elem) for elem in my_list]) + ']' ... >>> str_to_arr_udf = udf(str_to_arr,StringType()) >>> df = df.withColumn('route_arr',str_to_arr_udf(df["route"])) >>> df = df.drop("route") >>> df.show() +---+--------------------+ | id| route_arr| +---+--------------------+ | 1|[agakhanpark,scie...| | 2|[agakhanpark,wynf...| +---+--------------------+ . Please "Accept" the answer if this helps. . -Aditya

alexander_witte · ‎06-11-2018

Figured this out- my above approach worked whereby I can order the list and then just grab the last line in the list window. Like this: select *, last_value(station_arr) over (partition by journey order by point RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as route from (select point, id, station, journey, collect_set(station) over (partition by journey order by point) from source_table); This will take the ordered list and will populate that list to every row in the partition as requested above.

alexander_witte · ‎06-10-2018

Shu thanks so much this did it! Appreciate your detailed answer!

Online	Offline
Last Visited	‎07-23-2018 12:04 AM

Member Since	‎06-09-2018 01:32 PM
Last Visited	‎07-23-2018 12:04 AM
Posts	9
Kudos received	2

Cloudera Community

Re: Hive on Tez: How to order an array column?

Re: Pandas_udf with a tuple? (pyspark)

Re: Spark-submit error with Python3 on Hortonworks...

Re: Pyspark can't show() a CSV with an array

Re: Hive on Tez: How to order an array column?

Re: Hive on Tez- Adding a row column messes up tab...