Member since
06-09-2018
9
Posts
2
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4527 | 06-11-2018 11:56 PM |
07-12-2018
10:22 PM
Hey Bryan thanks so much for taking the time! I think I'm almost there! The hint about the unicode issue helping me get past the first slew of errors. I seem to be running into a length one now however: @pandas_udf("array<string>")
def stringClassifier(lookupstring, first, last):
lookupstring = lookupstring.to_string().encode("utf-8")
first = first.to_string().encode("utf-8")
last = last.to_string().encode("utf-8")
#this part takes the 3 strings above and reaches out to another library to do a string match
result = process.extract(lookupstring, lookup_list, limit=4000)
match_list = [item for item in result if item[0].startswith(first) and item[0].endswith(last)]
result2 = process.extractOne(lookupstring, match_list)
if result2 is not None and result2[0][1] > 75:
answer = pd.Series(list(result2[0]))
return answer
else:
fail = ["N/A","0"]
return pd.Series(fail)
RuntimeError: Result vector from pandas_udf was not the required length: expected 1, got 2 I'm initially passing three strings as variables to the function which then get passed to another library. The result is a tuple which I covert to a list then to a pandas Series object. I'm curious how I can make a 2 item array object a length of 1 ..? I'm obviously missing some basics here. @Bryan C
... View more
06-17-2018
04:12 AM
@Alex Witte You should be using Python 2.7.x instead of using Python3. Please find the Python certified version with HDP : https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_support-matrices/content/ch_matrices-ambari.html#ambari_software . hdp-select is a binary that comes with HDP installation and this uses "print" python function which will require slight change in python3. Thats why we are getting: SyntaxError:Missing parentheses in call to 'print'. . Can you please try setting the "HDP_VERSION" in spark-env.sh and then try again.
... View more
06-15-2018
10:56 AM
1 Kudo
@Alex Witte, According to your question, you want to transform it to the below format Col1 Col2
1 [agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue]
2 [agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount] I have changed your code little bit and was able to achieve it. Please check this code and the pyspark execution output from pyspark.sql.types import *
data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)]
final_struc = StructType(fields=data_schema)
df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc)
df.show()
from pyspark.sql.functions import udf
def str_to_arr(my_list):
my_list = my_list.split(",")
return '[' + ','.join([str(elem) for elem in my_list]) + ']'
str_to_arr_udf = udf(str_to_arr,StringType())
df = df.withColumn('route_arr',str_to_arr_udf(df["route"]))
df = df.drop("route")
df.show() >>> from pyspark.sql.types import *
>>> data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)]
>>> final_struc = StructType(fields=data_schema)
>>> df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc)
>>> df.show()
+---+--------------------+
| id| route|
+---+--------------------+
| 1|agakhanpark,scien...|
| 2|agakhanpark,wynfo...|
+---+--------------------+
>>>
>>>
>>> from pyspark.sql.functions import udf
>>> def str_to_arr(my_list):
... my_list = my_list.split(",")
... return '[' + ','.join([str(elem) for elem in my_list]) + ']'
...
>>> str_to_arr_udf = udf(str_to_arr,StringType())
>>> df = df.withColumn('route_arr',str_to_arr_udf(df["route"]))
>>> df = df.drop("route")
>>> df.show()
+---+--------------------+
| id| route_arr|
+---+--------------------+
| 1|[agakhanpark,scie...|
| 2|[agakhanpark,wynf...|
+---+--------------------+ . Please "Accept" the answer if this helps. . -Aditya
... View more
06-11-2018
11:56 PM
Figured this out- my above approach worked whereby I can order the list and then just grab the last line in the list window. Like this: select *, last_value(station_arr) over (partition by journey order by point RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as route from (select point, id, station, journey, collect_set(station) over (partition by journey order by point) from source_table); This will take the ordered list and will populate that list to every row in the partition as requested above.
... View more
06-10-2018
02:07 AM
1 Kudo
Shu thanks so much this did it! Appreciate your detailed answer!
... View more