Since impala doesn't support delete, so I feel that is different than Hive. Can I delete/update data via Hive instead? Or the fundamental table/data struture of Impala insertion data is different than Hive table?
There is no difference per se. In both cases the INSERT results in files in a directory in HDFS or rows in an HBase table. And in most cases, one can query/load from either Impala or Hive and use them interchangeably. e.g. Load via Hive, query via Impala or vice versa.
I know Impala is using bloom filter to build hash, that results delete is not possible right? so if we delete data via Hive, what will happen to Impala, I think that is my question. Thanks!
no, I am using HIVE. I am doing some benchmarking and testing for Data warehousing. What I did is importing data from MySQL via Sqoop to Hive, and then query data via Impala. To simplfiy my questions...
1. INSERT data via Impala and via other methods to Hive tables, will that results different impact on query performance?
I didn't test this setup but just wondering is there any fundamental different.
2. I just realized that Hive doesn't support update/delete on single record. :(
Understanding the stack here is key to your question. Both Impala and Hive operate on data in HDFS (or HBase). How the data gets to HDFS matters much less than the format the data is in. There are a number of different file formats supported by both Impala and Hive so it is worth the time to understand the options.
The Impala docs cover this all:
Neither Hive nor Impala has a DELETE statement, so that's not a relevant consideration for how to do the INSERT.
If some practical difference makes it preferable to use Hive for INSERT, then that's fine. For example, maybe you call some UDFs as part of INSERT ... SELECT. (Of course, if in the future you could run those UDFs through Impala, I would recommend switch to Impala for the INSERT.) Also, today you would use Hive to insert into Avro, SequenceFile, or RCFile tables.
I could imagine there might be more or less resource usage in Hive or Impala, depending on factors like file format and partitioning. I haven't studied those in depth. With Parquet support for Hive so new, it might be easier right now to get data into Parquet tables through Impala.
When all else is equal, and you have the choice to use either Impala or Hive, I would suggest doing the insert through Impala because whatever future improvements come along for the REFRESH / INVALIDATE METADATA experience will likely work more smoothly when DML statements like INSERT are done via Impala.