Support Questions
Find answers, ask questions, and share your expertise

How can I wrote a python code to read multiple files in a directory

Rising Star

How to write a python code which will read files inside a directory and split them individually with respect to their types.

Like CSV will split by comma and store separately ..

PSV will split by pipe and store separately...

3 REPLIES 3

Re: How can I wrote a python code to read multiple files in a directory

I'm assuming your referring to how to accomplish this in Spark, since your question is tagged with 'pyspark' and 'spark'.

This is how to do it using Pyspark:

## read all files in directory and parse out fields needed

## file is pipe delimited

path = "hdfs://my_server:8020/tmp/bkm/clickstream/event=pageview/dt=2015-12-21/hr=*/*"

rows = sc.textFile(path)

fields = rows.map(lambda l: l.split("|"))

## you can cast data types

orders = fields.map(lambda o: Row(platform=o[101], date=int(o[1]), hour=int(o[2]), order_id=o[29], parent_order_uuid=o[90]))

## create a DataFrame

schemaOrders = sqlContext.createDataFrame(orders)

schemaOrders.registerTempTable("schemaOrders")

## get do some SQL on the parsed fields

rows = sqlContext.sql("SELECT platform ,date,hour,count(*) AS order_count from schemaOrders where date = '20151221' AND (order_id <> '' OR order_id IS NOT NULL) AND (parent_order_uuid =

'' OR parent_order_uuid IS NULL) AND platform IN ('desktop') group by platform,date,hour")

Re: How can I wrote a python code to read multiple files in a directory

Rising Star

@Binu Mathew

Have a doubt , why I need to do cast data types ?

I think I need to do a similar one for csv file with an if clause right ?

My requirement was simple csv/psv/tsv fioles which am geeting from different feeds . I am thinking of writting some configuration file, because if in future i got some different types of file i will have to do only the config change not the complete code .

Re: How can I wrote a python code to read multiple files in a directory

  1. @ Dinesh Da - Test the code and you will see that it works. If you read my comment in the code, 'you can cast data types' , I purposefully use the word 'can' to imply that it works. I'm not saying you have to do it.