I have some questions about migration, data model and performance of Hadoop/Impala. Very appreciated for your help and advice.
How to migrate Oracle application to cloudera hadoop/Impala 1.1 How to replace oracle stored procedure in impala or M/R or java/python app. For example, the original SP include several parameters and sqls.
1.2 How to replace unsupported or complex SQL like over by partition from Oracle to impala. Are there any existing examples or Impala UDF? 1.3 How to handle update operation since part of data has to be updated. For example, use data timestamp? use the store model which can support update like HBase? or use delete all data/partition/dir and insert it again(insert overwrite).
Data store model , partition design and query performance 2.1 How to chose impala internal table or external table like csv, parquet, habase? For example, if there are several kind of data like importing exsited large data in Oracle into hadoop, new business data into hadoop, computed data in hadoop and frequently updated data in hadoop, how to choose the data model? Do you need special attention if the different kind of data need to join? We have XX TB's data from Oracle, do you have any suggestion about the file format like csv or parquet? Do we need to import the data results into impala internal table or hdfs fs after calculation. If those kind of data can be updated, how to we considered that? 2.2 How to partition the table /external table when joinning For example, there are huge number of sensor data and each one includes measuring data, acquisition timestamp and region information. We need:
calculate measuring data by different region
Query a series of measuring data during a certain time interval for specific sensor or region. 3 Query the specific sensor data from huge number of data cross all time.
Query data for all sensors on specific date. Would you please provide us some suggestion about how to setup up the patition for internal and directories structure for external table(csv) . In addition, for the structure of the directories, which is better when using date=20090101/area=BEIJING or year=2009/month=01/day=01/area=BEIJING? Is there any guide about that?