Created 07-27-2016 09:26 AM
I am using pyspark to print filtered data as table.
final_table_text = "\n".join(table_text_list) table_prefix='%table' print """%s %s"""% (table_prefix,final_table_text)
The first line in table_text_list is table header
Now i want to do some order by on this data, for which one way which i thought of was to use registerAsTable.
There there a way through which if i specify the line separator and the column separator, I can use registerAsTable
Created 07-27-2016 10:02 AM
registerAsTempTable is a spark function which will allow you to write SQL against your dataframe. Essentially it's a way to give the dataframe variable a name in the context of SQL.
If what you're looking to do is display the data from a programmatic dataframe in a %pyspark paragraph in the same way it does in say a %sql paragraph, your'e on the right track. However, you need to make sure that the data is separated by tabs, and rows end with a new line. You also need a new line between the header and the content. So:
final_table_text = "\n".join(table_text_list) output = final_table_text.map(lambda x: "\t".join(x)) print "%table\n%s" % (output)
Note that the output and tab insertion can also be done within spark, before collecting your results if you want it to happen across the whole cluster, which may or may not be quicker depending on the size of the data.
Note also that this won't automatically limit the number of results in the way that the %sql paragraph does, so you may want to be a little careful about blowing out the browser RAM unless you put your own limits on the dataframe, or use take(n) instead of collect to retrieve the data for display.
Created 07-27-2016 10:02 AM
registerAsTempTable is a spark function which will allow you to write SQL against your dataframe. Essentially it's a way to give the dataframe variable a name in the context of SQL.
If what you're looking to do is display the data from a programmatic dataframe in a %pyspark paragraph in the same way it does in say a %sql paragraph, your'e on the right track. However, you need to make sure that the data is separated by tabs, and rows end with a new line. You also need a new line between the header and the content. So:
final_table_text = "\n".join(table_text_list) output = final_table_text.map(lambda x: "\t".join(x)) print "%table\n%s" % (output)
Note that the output and tab insertion can also be done within spark, before collecting your results if you want it to happen across the whole cluster, which may or may not be quicker depending on the size of the data.
Note also that this won't automatically limit the number of results in the way that the %sql paragraph does, so you may want to be a little careful about blowing out the browser RAM unless you put your own limits on the dataframe, or use take(n) instead of collect to retrieve the data for display.
Created 07-27-2016 12:52 PM
Thanks for information, I have a connecting issue
When i try to query the table it does not return me the rows. Is i am doing something wrong
result_schema = StructType([StructField("ID", StringType(), False),StructField("COMPONENT", StringType(),False),StructField("TOTAL_TESTS", IntegerType(), False) myrdd = sc.parallelize([final_table_text]) df=sqlContext.createDataFrame(myrdd,result_schema) df.registerTempTable("result_query") abc=sqlContext.sql("select * from result_query") print abc
The output is : DataFrame[ID: string, COMPONENT: string, TOTAL_TESTS: int]
Created 07-29-2016 12:16 PM
missed to add .show to abc
Created 07-29-2016 12:19 PM
not sure what is in final_table_text, but almost certain you don't mean to parallelise to get it from the driver. You should really be using the dataframe read api to distribute it in a case like this.