Created 02-15-2017 04:02 PM
How to write a python code which will read files inside a directory and split them individually with respect to their types.
Like CSV will split by comma and store separately ..
PSV will split by pipe and store separately...
Created 02-15-2017 04:21 PM
I'm assuming your referring to how to accomplish this in Spark, since your question is tagged with 'pyspark' and 'spark'.
This is how to do it using Pyspark:
## read all files in directory and parse out fields needed
## file is pipe delimited
path = "hdfs://my_server:8020/tmp/bkm/clickstream/event=pageview/dt=2015-12-21/hr=*/*"
rows = sc.textFile(path)
fields = rows.map(lambda l: l.split("|"))
## you can cast data types
orders = fields.map(lambda o: Row(platform=o[101], date=int(o[1]), hour=int(o[2]), order_id=o[29], parent_order_uuid=o[90]))
## create a DataFrame
schemaOrders = sqlContext.createDataFrame(orders)
schemaOrders.registerTempTable("schemaOrders")
## get do some SQL on the parsed fields
rows = sqlContext.sql("SELECT platform ,date,hour,count(*) AS order_count from schemaOrders where date = '20151221' AND (order_id <> '' OR order_id IS NOT NULL) AND (parent_order_uuid =
'' OR parent_order_uuid IS NULL) AND platform IN ('desktop') group by platform,date,hour")
Created 02-15-2017 04:40 PM
Have a doubt , why I need to do cast data types ?
I think I need to do a similar one for csv file with an if clause right ?
My requirement was simple csv/psv/tsv fioles which am geeting from different feeds . I am thinking of writting some configuration file, because if in future i got some different types of file i will have to do only the config change not the complete code .
Created 02-15-2017 05:16 PM