Member since
01-29-2016
38
Posts
28
Kudos Received
0
Solutions
03-04-2016
12:47 PM
2 Kudos
I need to apply authorization using Sentry. I need to create the local groups on the host on which HiveServer2 is running. How to know on which host HiveServer2 running ??
... View more
Labels:
03-03-2016
12:11 PM
2 Kudos
For the sake of creating POC, i have create local groups in the policy file sentry-provider.ini [databases]
# Defines the location of the per DB policy file for the customers DB/schema
dw_arcticblue_staging = hdfs://quickstart.cloudera:8020/home/cloudera/Desktop/dw_arcticblue_staging_policy.ini
[groups]
# Assigns each Hadoop group to its set of roles
group1 = analyst_role
group2 = admin_role
group3 = analyst_role[users]
user1 = group1, group2, group3
user2 = group2, group3
[roles]
# The uris below define a define a landing skid which
# the user can use to import or export data from the system.
# Since the server runs as the user "hive" files in that directory
# must either have the group hive and read/write set or
# be world read/write.
analyst_role = server=server1->db=dw_arcticblue_staging ->table=*->action=select
admin_role = server=server1->db=dw_arcticblue_staging ->table=*->action=select, \
server=server1->db=dw_arcticblue_staging ->table=*->action=Insert
# Implies everything on server1 -> customers. Privileges for
# customers can be defined in the global policy file even though
# customers has its only policy file. Note that the Privileges from
# both the global policy file and the per-DB policy file
# are merged. There is no overriding.
arcticeblue_admin_role = server=server1->db=dw_arcticblue_staging
# Implies everything on server1.
admin_role = server=server1
DW_ArcticBlue_staging_policy.ini
[groups]
group1 = dw_all_access
[roles]
dw_all_acess = server = NirvanaServer -> db = dw_arcticblue_staging -> table = * -> action = Insert,\
server = NirvanaServer -> db = dw_arcticblue_staging -> table = * -> action = Select I have updated all configurations in hive and yarn as recommended. Few things which i dont understand How to test the user1,user2 that whether they are authorized or not ? I want to test from hive / impala cli .User1, User2 etc created in the policy file are not real users to login with. We need to create a HDFS sentry-provider.ini file. How we could create ini file in linux? i did not find any relevant document Any help how to test the authorization using sentry on hive???
... View more
Labels:
02-26-2016
11:02 AM
1 Kudo
Hi, On the staging database we are getting updated data from relational database. As HDFS/Hive supports append only. So we need to delete the partitions where data is modified and replace with updated data partitions. Lets say we have a T_UpdatedData table which gets populated by Sqoop per day. If on any data we have updated data from multiple partitions, we need to backup, delete those partitions from the T_ProductionData. Partitions are created on Year, Month, Account columns. Backup should be saved on the T_HistoryData How to can select the different partitions from the T_UpdatedData & copy those partitions from T_ProductionData to T_HistoryData ? Inserting dynamic partitions can be done with following syntax. INSERT OVERWRITE TABLE tablename1 PARTITION (partcol1=val1, partcol2=val2...) Two ways to select the distinct partition data could be 1. SELECT * FROM T_ProductionData WHERE (year,month,account) IN (SELECT distinct year, month, account FROM T_UpdatedData) -- Supported by Oracle. 2. Using joins. Using joins could be more costly as it will try to perform join in all T_ProductionData. What is the best practice to backup the partitions and insert the modified partitions ? Thanks
... View more
Labels:
02-23-2016
02:17 PM
@Neeraj Sabharwal I am still facing some weird issues. While trying to just copy data from one existing table to new table with Create table clone as select * from t_table, its working just perfect. On the other hand while trying to copy data from existing table to another existing table with Insert into table_clone select column1,col2.... from t_table, its throwing Heap space error. Source tables are same in both cases. I have tried different size for the Container, Mapper, reducer, mapreduce.map.java.opts -Xmx5124m so on but its throwing same error every time. few setting are : yarn.scheduler.minimum-allocation-mb : 4GB yarn.scheduler.maximum-allocation-mb : 6GB Container memory : 18 GB mapreduce.map.memory.mb : 6 GB mapreduce.reduce.memory.mb : 8 GB mapreduce.map.java.opts : -Xmx5124m mapreduce.reduce.java.opts : -Xmx6144m I am not able to copy data from non partitioned table to another non partitioned table. Though main requirement is to copy from non- partitioned table to partitioned table
... View more
02-18-2016
02:32 PM
1 Kudo
I have tried setting set mapreduce.map.memory.mb=8192; set mapreduce.reduce.memory.mb=8192; on the hive cli. In another attempt i have also tried EXPORT HADOOP_CLIENT_OPTS='-xMX8G' Still inserting the data from once table to another is throwing the same error. Any other pointer ?
... View more
02-18-2016
12:25 PM
Showing java heap space fata error. Excerpts of log 2016-02-18 03:57:10,320 FATAL [IPC Server handler 0 on 44265] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1455271075351_0034_m_000000_3 - exited : Java heap space I am trying to copy just 16629 rows from source to destination partitioned table. I think partitioning is causing the issue. We have created partitions on Year , month, day , Account (used in most of the filters). It is trying to create a total of 2578 partitions for 16629 rows. I think this is the cause of the Java heap space error
... View more
02-18-2016
12:05 PM
1 Kudo
Hi, I have been copying data from non-partitioned hive table to partitioned hive table but its giving following error and log message. Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask Log : Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
INFO : Number of reduce tasks is set to 0 since there's no reduce operator INFO : number of splits:2 INFO : Submitting tokens for job: job_1455271075351_0034 INFO : The url to track the job:http://quickstart.cloudera:8088/proxy/application_1455271075351_0034/INFO : Starting Job = job_1455271075351_0034, Tracking URL =http://quickstart.cloudera:8088/proxy/application_1455271075351_0034/INFO : Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1455271075351_0034 INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0 INFO : 2016-02-18 03:56:36,684 Stage-1 map = 0%, reduce = 0% INFO : 2016-02-18 03:56:45,448 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 2.35 sec INFO : 2016-02-18 03:57:10,663 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.35 sec INFO : MapReduce Total cumulative CPU time: 2 seconds 350 msec ERROR : Ended Job = job_1455271075351_0034 with errors
... View more
Labels:
02-16-2016
01:37 PM
1 Kudo
Hi, i have import sql server data through Sqoop into hdfs into -target-dir /user/hive/warehouse/db_abc.db/t_tableName . I have import into hdfs as i need to import using the --query switch. Now i want to import data from hdfs into hive. I have tried LOAD DATA INPATH '/user/hive/warehouse/db_abc.db/t_tableName' INTO TABLE db_abc.t_staging_table Above command throwing invalid path : no files matching path error. How i could load data from hdfs to hive or sql server to hive through sqoop? Though i did not find any way to directly import into hive where --query switch is used in Sqoop. Thanks
... View more
Labels:
02-12-2016
02:10 PM
1 Kudo
@Artem Ervits We are running 4 jobs each has switch -m 2. 2 Jobs has source database server on Server1 and other 2 has Server2. As i can see in the job bowser, 3 out of the 4 jobs are running and 1 job is queued up. total of 6 mapper are working on 6 processor machine at the same time. Sql server profiler is showing no incoming queries on the source database. I don't think database instance is choking.
... View more
02-12-2016
11:02 AM
1 Kudo
@Neeraj Sabharwal : I dont find any directory like on linux hdfs://quickstart.cloudera:8020/user/cloudera/tableName. I am new to cloudera quickstart vm and linux too
... View more
02-12-2016
10:36 AM
Earlier i had imported the data from sql server at the same time. Yarn is also setup and FIFO is default Capacity scheduler behavior. I think issue is something else
... View more
02-12-2016
10:33 AM
1 Kudo
@Neeraj Sabharwal Now when i am trying to run the sqoop import --hive-import --hive-table with other parameters again. Its throwing : Encountered IOException running import job : org.apache.hadoop.mared.FileAlreadyExistsExeption : Output director hdfs://quickstart.cloudera:8020/user/cloudera/tableName alredy exists. When i am trying to access the table specified in the --hive-table switch in the hive database, there is not such table. I think half the task is done.
... View more
02-12-2016
10:22 AM
1 Kudo
@Neeraj Sabharwal not finding any error on /var/log/sqoop2/sqoop.log.
... View more
02-12-2016
10:04 AM
1 Kudo
yarn is setup. When i restarted yarn, all jobs failed. 4 of them were showing 0% map, 0% reduce. Do you want Sqoop log or Yarn log ?
... View more
02-12-2016
07:52 AM
1 Kudo
Hi, We are performing some sort of stress testing. How fast data can ingested in running 5 parallel imports? We have many clients so during POC, we like to check how much load we can put during data ingestion in parallel. While i was importing the data from single sql server instance it was working fine. Now while i am trying to import data from 5 sql server databases on the same sql server instance using 5 different sqoop import command, no job is finishing its job. All are showing the import failure message. We have oracle VM virtual box setup with following resources 1. RAM : 24 GB 2. Disk : 256 GB 3. Processors assigned : 6 (Hypertheading enabled) What could be potential reasons for the data ingestion failure? Do i need to have more resources assigned to the VM ? Going to real physical cluster is not an option yet for us. Cloudera forum is mostly unresponsive that why i have been posting questions here
... View more
Labels:
02-10-2016
09:39 AM
Hi , We are connecting Hadoop cloudera CDH distribution through ODBC driver. Queries are generated from SSRS. Few queries are working fine with parameters augmented through ? placeholder. Few other queries with parameters augmented through ? are not executing. Error [HY000][Cloudera][ImpalaODBC] (100) error while executing a query in Impala[HY000]
: AnalysisException : syntax error in line 1 where Date >= ? and Date <= ?
^Encountered : Unexpected characterExpected : Case... Exception : syntax error. If i remove where Date >= ? and Date <= ? or supply the hard coded value then query is working perfect. Few other queries with same filter are working perfect. 1. What should be recommended investigation points? 2. Where could i get the exact impala transformed query to investigate whether query is generated correct or not ?
... View more
Labels:
02-10-2016
07:17 AM
1 Kudo
@Geoffrey Shelton Okot Exporting & import data is working good. Facing issues with the incremental mode import against the SQL Server. It's still open issue https://issues.apache.org/jira/browse/SQOOP-2421
... View more
02-09-2016
12:04 PM
Hi Benjamin , As sqoop incremental mode is not working against the SQL server. Sqoop query clause is the option left. We need to run sqoop query for last date if there are no changes to the historical data in sql server. In case of changes in historical data, query needs to run query for the historical dates where there are changes. Based on the changes in the historical data we need different values of the dates in the where clause of the query. Do oozie varibles or parameters can handle this situation??
... View more
02-08-2016
03:26 PM
we are on the CDH. I will have a look on the PPT. Can you answer my another comment on https://community.hortonworks.com/questions/14313/facing-issues-while-ingesting-data-into-hive.html
... View more
02-08-2016
02:50 PM
1. Can you tell me the url of the presentation so that i can increase the RAM ? 2. I have import data from sql server to hive table without specifying the any file format and data import successfully into hive table. Now i am trying to copy data from hive table to another table which as parquet format defined at table creation. I am trying to copy into all the partitions which are possible based on the combination of the three columns. I have used insert into table t1 partition (c1,c2,c3) select * from t2. Coping from one table to another (Parquet).
... View more
02-08-2016
01:51 PM
I got your point. Loadbalancer idea is not making sense. I was just thinking to break the data into small datasets so that query just check smaller dataset to prepare the output dataset. I am moving data from hive table (staging & unpartitioned) to another table (production & partitioned). Staging table has 1.7 million rows but query is failing with error Error : Java heap space . do i need to increase the memory allocated to JVM ? Staging tables might have 5 million & more rows so what should be the most likely value of the memory allocated to the JVM??
... View more
02-08-2016
01:42 PM
We are now loading our existing historical data into hive. Major fact tables have around 2 million or more rows. Loading 1.7 million rows took 3 hours in virtual box having 6 cores ,24 GB ram & 128GB disk. I got your point. load balancer column should be some dimension column which is mostly used in the filtering.
... View more
02-08-2016
12:36 PM
As you suggested to use the sqoop saved job in https://community.hortonworks.com/questions/13132/best-practice-to-import-data-from-sql-server-to-hi.html. If i use the where condition then i have to supply the value in the where clause manually. I have to create 100 tables for each client. It would be a nightmare to run each job manually by supplying the value. Do we have any way to do it in friendly way? Thanks
... View more
02-08-2016
12:30 PM
Hey Benjamin, Is it good if i put one extra column on the partitioned on clause like PARTITON ON ( MONTHS INT, DAY INT , LOADBALANCER INT). Loadbalancer column in the source database, which is SQL Server, will have value 1 for the normal load. If source table has more data load then Loadbalancer column will have more values like 1,2,... We can create a stored procedure in sql server will update the OLTP Loadbalancer column values in case we fell that we need to further partition data beyond month & day. How it will be in long run instead of drop the existing dataset & recreating it ?
... View more
02-08-2016
10:44 AM
1 Kudo
Hi, We have some fact tables which contains large number of rows. We have partition applied on the month right now. It is most more likely that in coming future we might need to apply partition by week number. As update command is missing in Hive so whenever there is situation to update the historical data we just drop the partition & create a new partition. So applying partitions is necessary. I am wondering is applying partitioning on the existing columns in hive table POSSIBLE ? How to handle the situation where we have to apply the partitioning dynamically based on the load ? i think dropping the table & recreating table for most of the requirement is not good thing
... View more
Labels:
02-05-2016
01:53 PM
I am trying to execute the following job sqoop job --create import_job_staging --hive-import --driver com.microsoft.sqlserver.jdbc.SQLServerDriver --connect "jdbc:sqlserver://asdf.com:1433;database=a" --table asdf --hive-table db.t_a --username sa --password a? Its throwing Error parsing arguments for job exception.When i am trying --import switch instead of the --hive-import switch then it works fine. Any idea how i could make it work?I executed job with --import switch but don't know where the data imported on HDFS and how to insert it into the hive table ?
... View more
02-05-2016
01:52 PM
I am trying to execute the following job sqoop job --create import_job_staging --hive-import --driver com.microsoft.sqlserver.jdbc.SQLServerDriver --connect "jdbc:sqlserver://asdf.com:1433;database=a" --table asdf --hive-table db.t_a --username sa --password a? Its throwing Error parsing arguments for job exception.When i am trying --import switch instead of the --hive-import switch then it works fine. Any idea how i could make it work?I executed job with --import switch but don't know where the data imported on HDFS and how to insert it into the hive table ?
... View more
02-05-2016
01:39 PM
1 Kudo
I am trying to execute the following job
sqoop job --create import_job_staging --hive-import --driver com.microsoft.sqlserver.jdbc.SQLServerDriver --connect "jdbc:sqlserver://asdf.com:1433;database=a" --table asdf --hive-table db.t_a --username sa --password a?
Its throwing Error parsing arguments for job exception.When i am trying --import switch instead of the --hive-import switch then it works fine.
Any idea how i could make it work?I executed job with --import switch but don't know where the data imported on HDFS and how to insert it into
the hive table ?
... View more
02-05-2016
12:35 PM
3 Kudos
Hi, I am new to Hadoop ecosystem. For data ingestion into hive, we are using the sqoop import commands. Data ingestion populates data into staging tables. Now we need to clean up data & insert into the production hive tables. I have written the hive udf to simulate the auto increment feature which works fine in hive shell. Hive query is taking very long to clean up data, generate auto_incremented number. Impala queries are working good. I m wondering if i can use the same hive defined udf in the impala . Is there any way to use the hive udf in the impala shell to generate the auto_incremented number? Thanks
... View more
Labels:
02-04-2016
08:08 AM
2 Kudos
Hi, We are ingesting data from SQL Server (OLTP) to Hive. Incremental updates for updated data not working against SQL Server. It's still open issue https://issues.apache.org/jira/browse/SQOOP-2421 Is there any work around to get the updated data into Hive tables? Thanks
... View more
Labels: