Member since
05-10-2017
10
Posts
1
Kudos Received
0
Solutions
06-07-2018
02:35 PM
Hi Everyone, Appreciate your response to the following problem. Problem Statement: 1. I have got 2 zip files (each of size 1 GB) stored at a directory on windows server. 2. Each zip file has 42 .dat files(5 files of 3-9 GB size range, 3 files of more than 128 MB (more than block size), rest of the files are in KB). These files have pipe delimited records that are actually table dump of sql server. 1 file has 1 table dump. 3. Incremental load is on daily basis. The files can be pushed to a SFTP location too. 4. I need to create an automated data ingestion pipeline for the same. Can anyone suggest the best way to do this? Solution Approach (I could think of): 1. Get files(zipped) pushed to a SFTP server. Run a python script to pull and store on Edge node. 2. Unzip both zip files into local file system on edge node and run a shell/python script to pick file one by one and load using hdfs -put into HDFS. 3. Create Hive tables above it to be consumed by business users using Impala/PowerBI. 4. NiFi is out of scope due to some business constraints. Questions: 1. What is the best way to handle zip files. Ask source system to load it in unzipped at SFTP OR get it unzipped at Edge node local file system? Unzipping both zip files would create 84 files of total 40 GB in size.Remember I would be creating 84 Hive tables one for each file where the data would be appended for each incremental feed. 2. How should I handle small files in KBs. I don't think merge is an option since they are individual SQL table dump and there would be 1 hive table created for each and the business would like to do processing on them daily especially when they have Alteryx to do self service ETL. 3. I am not sure how fast will it be to first get data onto Edge node and then use hdfs -put considering the file size. Please suggest a better alternative. Note: Cluster is hosted on AWS.
... View more
Labels:
- Labels:
-
Apache Hadoop
06-07-2018
07:29 AM
1 Kudo
Hi Everyone, Appreciate your response to the following problem. Problem Statement: 1. I have got 2 zip files (each of size 1 GB) stored at a directory on windows server. 2. Each zip file has 42 .dat files(5 files of 3-9 GB size range, 3 files of more than 128 MB (more than block size), rest of the files are in KB). These files have pipe delimited records that are actually table dump of sql server. 1 file has 1 table dump. 3. Incremental load is on daily basis. The files can be pushed to a SFTP location too. 4. I need to create an automated data ingestion pipeline for the same. Can anyone suggest the best way to do this? Solution Approach (I could think of): 1. Get files(zipped) pushed to a SFTP server. Run a python script to pull and store on Edge node. 2. Unzip both zip files into local file system on edge node and run a shell/python script to pick file one by one and load using hdfs -put into HDFS. 3. Create Hive tables above it to be consumed by business users using Impala/PowerBI. Questions: 1. What is the best way to handle zip files. Ask source system to load it in unzipped at SFTP OR get it unzipped at Edge node local file system? Unzipping both zip files would create 84 files of total 40 GB in size.Remember I would be creating 84 Hive tables one for each file where the data would be appended for each incremental feed. 2. How should I handle small files in KBs. I don't think merge is an option since they are individual SQL table dump and there would be 1 hive table created for each and the business would like to do processing on them daily especially when they have Alteryx to do self service ETL. 3. I am not sure how fast will it be to first get data onto Edge node and then use hdfs -put considering the file size. Please suggest a better alternative. Note: Cluster is hosted on AWS.
... View more
Labels:
- Labels:
-
HDFS
05-11-2017
09:09 PM
Thanks Aver for replying. Well, Initially the data would be dumped into HDFS and post processing into HBase(which i am assuming to be less than 2 TB)
... View more
05-10-2017
09:19 PM
Hi Everyone, We have a requirement to migrate data from ODS (plus some social media, web analytics etc) into Hadoop for which we need to create a cluster. Please find below the details: It will be Cloudera Enterprise edition and deployed on Azure Initial expected Data volume is 7.5 TB (includes replication factor of 3 & overhead of 20%) Incremental load is expected to be 1 GB/day Thinking to have Sqoop, Hive, Oozie, Flume, Spark, Kafka, HBase as well. Initial workload will be mainly around Data Import and ETL(Spark). Further, there could be some Analytics use case involving Classification, Recommendation Algos etc I have come up with following sizing(for production env). NN- Name Node, JN- Journal Node, RM- Resource Manager, ZK - Zookeeper, CM- Cloudera Manager Node Type Disk in TB's (7200 RPM) RAM Cores NN + JN + RM +ZK 1(OS) + 2(FSImage & Edit logs) + 1(JN) + 1(ZK) 32 14 StandBy NN + JN Same as NN 32 14 Edge + CM 1 14 4 Cloudera Director node 1 14 4 Data Nodes (4*3TB) (3 disks of 1 TB per node) 4*3 32 8 (Also one of DN will be JN as well) Questions: 1. Can anyone please confirm if I need to change anything ? 2. Is it mandatory to have separate RM node in prod? If yes, what should be its conf? 3. Can I have Director on Edge along with Cloudera Manager? 4. Also, please suggest what should I change(scale down) to set up a Dev env as well ? Thanks Gaurav
... View more
Labels:
- Labels:
-
Cloudera Director
-
Cloudera Manager