Support Questions

Find answers, ask questions, and share your expertise

Zeppelin 0.6 - How to registerAsTable the data which can be showed with "%table" using pyspark

avatar
Super Collaborator

I am using pyspark to print filtered data as table.

final_table_text = "\n".join(table_text_list)
table_prefix='%table'
print """%s %s"""% (table_prefix,final_table_text)

The first line in table_text_list is table header

Now i want to do some order by on this data, for which one way which i thought of was to use registerAsTable.

There there a way through which if i specify the line separator and the column separator, I can use registerAsTable

1 ACCEPTED SOLUTION

avatar
Guru

registerAsTempTable is a spark function which will allow you to write SQL against your dataframe. Essentially it's a way to give the dataframe variable a name in the context of SQL.

If what you're looking to do is display the data from a programmatic dataframe in a %pyspark paragraph in the same way it does in say a %sql paragraph, your'e on the right track. However, you need to make sure that the data is separated by tabs, and rows end with a new line. You also need a new line between the header and the content. So:

final_table_text = "\n".join(table_text_list)
output = final_table_text.map(lambda x: "\t".join(x))
print "%table\n%s" % (output)

Note that the output and tab insertion can also be done within spark, before collecting your results if you want it to happen across the whole cluster, which may or may not be quicker depending on the size of the data.

Note also that this won't automatically limit the number of results in the way that the %sql paragraph does, so you may want to be a little careful about blowing out the browser RAM unless you put your own limits on the dataframe, or use take(n) instead of collect to retrieve the data for display.

View solution in original post

4 REPLIES 4

avatar
Guru

registerAsTempTable is a spark function which will allow you to write SQL against your dataframe. Essentially it's a way to give the dataframe variable a name in the context of SQL.

If what you're looking to do is display the data from a programmatic dataframe in a %pyspark paragraph in the same way it does in say a %sql paragraph, your'e on the right track. However, you need to make sure that the data is separated by tabs, and rows end with a new line. You also need a new line between the header and the content. So:

final_table_text = "\n".join(table_text_list)
output = final_table_text.map(lambda x: "\t".join(x))
print "%table\n%s" % (output)

Note that the output and tab insertion can also be done within spark, before collecting your results if you want it to happen across the whole cluster, which may or may not be quicker depending on the size of the data.

Note also that this won't automatically limit the number of results in the way that the %sql paragraph does, so you may want to be a little careful about blowing out the browser RAM unless you put your own limits on the dataframe, or use take(n) instead of collect to retrieve the data for display.

avatar
Super Collaborator

@Simon Elliston Ball

Thanks for information, I have a connecting issue

When i try to query the table it does not return me the rows. Is i am doing something wrong

result_schema = StructType([StructField("ID", StringType(), False),StructField("COMPONENT", StringType(),False),StructField("TOTAL_TESTS", IntegerType(), False)
myrdd = sc.parallelize([final_table_text])
df=sqlContext.createDataFrame(myrdd,result_schema)
df.registerTempTable("result_query")
abc=sqlContext.sql("select * from result_query")
print abc

The output is : DataFrame[ID: string, COMPONENT: string, TOTAL_TESTS: int]

avatar
Super Collaborator

missed to add .show to abc

avatar
Guru

not sure what is in final_table_text, but almost certain you don't mean to parallelise to get it from the driver. You should really be using the dataframe read api to distribute it in a case like this.