How to write a python code which will read files inside a directory and split them individually with respect to their types.
Like CSV will split by comma and store separately ..
PSV will split by pipe and store separately...
I'm assuming your referring to how to accomplish this in Spark, since your question is tagged with 'pyspark' and 'spark'.
This is how to do it using Pyspark:
## read all files in directory and parse out fields needed
## file is pipe delimited
path = "hdfs://my_server:8020/tmp/bkm/clickstream/event=pageview/dt=2015-12-21/hr=*/*"
rows = sc.textFile(path)
fields = rows.map(lambda l: l.split("|"))
## you can cast data types
orders = fields.map(lambda o: Row(platform=o, date=int(o), hour=int(o), order_id=o, parent_order_uuid=o))
## create a DataFrame
schemaOrders = sqlContext.createDataFrame(orders)
## get do some SQL on the parsed fields
rows = sqlContext.sql("SELECT platform ,date,hour,count(*) AS order_count from schemaOrders where date = '20151221' AND (order_id <> '' OR order_id IS NOT NULL) AND (parent_order_uuid =
'' OR parent_order_uuid IS NULL) AND platform IN ('desktop') group by platform,date,hour")
Have a doubt , why I need to do cast data types ?
I think I need to do a similar one for csv file with an if clause right ?
My requirement was simple csv/psv/tsv fioles which am geeting from different feeds . I am thinking of writting some configuration file, because if in future i got some different types of file i will have to do only the config change not the complete code .