About praveen_bora

praveen_bora · ‎03-04-2016

I need to apply authorization using Sentry. I need to create the local groups on the host on which HiveServer2 is running. How to know on which host HiveServer2 running ??

praveen_bora · ‎02-08-2016

we are on the CDH. I will have a look on the PPT. Can you answer my another comment on https://community.hortonworks.com/questions/14313/facing-issues-while-ingesting-data-into-hive.html

praveen_bora · ‎02-08-2016

1. Can you tell me the url of the presentation so that i can increase the RAM ? 2. I have import data from sql server to hive table without specifying the any file format and data import successfully into hive table. Now i am trying to copy data from hive table to another table which as parquet format defined at table creation. I am trying to copy into all the partitions which are possible based on the combination of the three columns. I have used insert into table t1 partition (c1,c2,c3) select * from t2. Coping from one table to another (Parquet).

praveen_bora · ‎02-08-2016

I got your point. Loadbalancer idea is not making sense. I was just thinking to break the data into small datasets so that query just check smaller dataset to prepare the output dataset. I am moving data from hive table (staging & unpartitioned) to another table (production & partitioned). Staging table has 1.7 million rows but query is failing with error Error : Java heap space . do i need to increase the memory allocated to JVM ? Staging tables might have 5 million & more rows so what should be the most likely value of the memory allocated to the JVM??

praveen_bora · ‎02-08-2016

We are now loading our existing historical data into hive. Major fact tables have around 2 million or more rows. Loading 1.7 million rows took 3 hours in virtual box having 6 cores ,24 GB ram & 128GB disk. I got your point. load balancer column should be some dimension column which is mostly used in the filtering.

praveen_bora · ‎02-08-2016

Hey Benjamin, Is it good if i put one extra column on the partitioned on clause like PARTITON ON ( MONTHS INT, DAY INT , LOADBALANCER INT). Loadbalancer column in the source database, which is SQL Server, will have value 1 for the normal load. If source table has more data load then Loadbalancer column will have more values like 1,2,... We can create a stored procedure in sql server will update the OLTP Loadbalancer column values in case we fell that we need to further partition data beyond month & day. How it will be in long run instead of drop the existing dataset & recreating it ?

praveen_bora · ‎02-08-2016

Hi, We have some fact tables which contains large number of rows. We have partition applied on the month right now. It is most more likely that in coming future we might need to apply partition by week number. As update command is missing in Hive so whenever there is situation to update the historical data we just drop the partition & create a new partition. So applying partitions is necessary. I am wondering is applying partitioning on the existing columns in hive table POSSIBLE ? How to handle the situation where we have to apply the partitioning dynamically based on the load ? i think dropping the table & recreating table for most of the requirement is not good thing

praveen_bora · ‎01-29-2016

We can not create same table for multiple clients. like T_User_Client1, T_User_Client2. What if we create a single table T_User and we add a new column clientId to identify the client. We add this column on all tables or introduce some sort of surrogate to keep the uniqueness of all data. In this case, we will have single schema & if we have to change schema then not much work is needed

praveen_bora · ‎01-29-2016

Hi Neeraj, Yes, we are not comparing SQL Server and Hive. Could you tell us what if we try to load data of all the clients into a single database. Will there be any disadvantages of doing this? We are planning to compare the data of different clients in future. Keeping in same database will share the same schema for all client. Thanks

praveen_bora · ‎01-29-2016

We are creating POC on Hadoop framework .We want to load data of multiple client into Hive tables. As of now, we have separate database for each client on SQL Server. This infrastructure will remain same for OLTP. Hadoop will be used for OLAP. We have some primary dimension tables which are same for each client. All client database has schema. These tables have same primary key value. Till now, this was fine as we have separate database for client. Now we are trying to load multiple client data into same data container (Hive tables). Now we will have multiple row with same primary key value if we load data directly into Hive from multiple SQL Server databases through Sqoop job. I am thinking to use the surrogate key in Hive tables but Hive does not support auto increment but can be achieved with UDF. We don't want to modify the SQL Server data as it's running production data. a. What are the standard/generic way/solution to load multiple client data into Hadoop ecosystem ? We never want that data of different clients gets mixed. Referential constraints are also missing on Hive. b. How primary key of sql server database table can be mapped easily to Hadoop Hive table so data we can pick data by client name ? c. How we can ensure that one client is never able to see the data of other client? Thanks

Online	Offline
Last Visited	‎03-04-2016 12:47 PM

Member Since	‎01-29-2016 11:37 AM
Last Visited	‎03-04-2016 12:47 PM
Posts	38
Kudos received	28

Cloudera Community

how to check on which host HiveServer2 is running?

Re: can we apply the partitioning on the already e...

Re: can we apply the partitioning on the already e...

Re: can we apply the partitioning on the already e...

Re: can we apply the partitioning on the already e...

Re: can we apply the partitioning on the already e...

can we apply the partitioning on the already exist...

Re: Best practice to load multiple client data int...

Re: Best practice to load multiple client data int...

Best practice to load multiple client data into Ha...