About StevenONeill

StevenONeill · ‎09-12-2017

@Vijay Parmar, I'd suggest a few tests of concatenation vs the temporary table solution suggested by your DBAs vs whatever else you come up with. Once you get a feel for how the processing works you'll arrive at the solution that best works for you.

StevenONeill · ‎09-10-2017

@Vijay Parmar, you can concatenate Hive tables to merge small files together. This can happen while the table is active. The syntax is: ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE; See the Hive documentation for details.

StevenONeill · ‎01-10-2017

@Neeraj Sabharwal, your ALTER TABLE statement should work. One question: Do you place the data for partition in a subdirectory with the partition name? Hive partitions exist as subdirectories. For example, your user table should have a structure similar to this: /external_table_path/date=2010-02-22 /external_table_path/date=2010-02-23 /external_table_path/date=2010-02-24 And so on. The ALTER TABLE statement will create the directories as well as adding the partition details to the Hive metastore. Once the partitions are created you can simply drop the right file/s in the right directory. Cheers, Steven.

StevenONeill · ‎01-10-2017

@sagar pavan, instead of using: --target-dir '/user/tsldp/patelco/' try: --warehouse-dir '/user/tsldp/patelco/' Each table will then be in a subdirectory under '/user/tsldp/patelco/' Cheers, Steven.

StevenONeill · ‎10-04-2016

@Mahesh Mallikarjunappa The Hive serde can be used to load data into HBase via Hive. A simple example follows: -- Create a hive-managed HBase table CREATE TABLE MyHBaseTable(MyKey string, Col1 string, Col2 string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,colfam:col1,colfam:col2") TBLPROPERTIES("hbase.table.name" = "MyNamespace:MyTable") ; -- Insert data into it INSERT INTO TABLE MyHBaseTable SELECT SourceKey, SourceCol1, SourceCol2 FROM SourceHiveTable ; And from Pig, you can read form that same source and write to that same target using the HBase serde as follows: pig -useHCatalog -f script.pig Where script.pig is as below RawData = LOAD 'SourceHiveTable' USING org.apache.hive.hcatalog.pig.HCatLoader(); KeepColumns = FOREACH RawData GENERATE SourceKey, SourceCol1, SourceCol2; STORE KeepColumns INTO 'hbase://MyNamespace:MyTable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colfam:col1,colfam:col2'); Note: You don't specify the key in the STORE statement - the first column is always the key. Hope this helps!

StevenONeill · ‎09-14-2016

@Arkaprova Saha I'm not sure about the --connection-manager option, but I have successfully performed a sqoop import from Teradata to AVRO using Teradata's JDBC driver as follows: sqoop import --driver com.teradata.jdbc.TeraDriver \ --connect 'jdbc:teradata://****/DATABASE=****' \ --username **** --password **** \ --table MyTable \ --target-dir /****/****/**** \ --as-avrodatafile \ --num-mappers 1 Just ensure that the JBDC driver, terajdbc4.jar, is in your $SQOOP_LIB folder. For me, on HDP 2.4 that is /usr/hdp/current/sqoop-client/lib

StevenONeill · ‎07-13-2016

@Emily Sharpe I believe your issue relates to the way Pig is processing the NULL Avro data. Rather than ignoring those NULL values, Pig passes the key and an empty value to HBase, which dutifully stores it. To avoid storing these values, filter them out. The following Pig code shows how to do this for a single key/value Avro source: ImageAvro = LOAD '/path/to/RawAvroData' USING org.apache.pig.piggybank.storage.avro.AvroStorage ('no_schema_check', 'schema_file', '/path/to/AvroSchemaFile.avsc'); filteredImage = FOREACH (FILTER ImageAvro BY SIZE(ImageColumn) > 0) GENERATE KeyColumn, ImageColumn; STORE filteredImage INTO 'hbase://namespace:table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colFamily:col'); Similarly, you can identify the empty cells with the same FILTER operation. Here's how to save a list of keys that have an empty colFamily:col cell: ImageHBase = LOAD 'hbase://namespace:table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colFamily:col', '-loadKey true') as (KeyColumn:chararray, ImageColumn:bytearray); NullImage = FOREACH (FILTER ImageHBase BY SIZE(ImageColumn) == 0) GENERATE KeyColumn; STORE NullImage into '/path/to/flat/file' USING PigStorage;

StevenONeill · ‎06-09-2016

I believe so, yes. The -Dimport.bulk.output can be performed on the target. This will prep the HBase files according to the target version/number of region servers/etc.

StevenONeill · ‎06-03-2016

Totally agree re bulk import. One additional point. You need to ensure the hbase user has access to read/write the files created by the -Dimport.bulk.output step. If it doesn't, the completebulkload step will appear to hang. The simplest way to achieve this is to do: hdfs dfs -chmod -R 777 <dir containing export files> as the owner of those files. completebulkload, running as hbase, simply moves these to the relevant HBase directories. With the permnissions correctly set, this takes fractions of a second.

Online	Offline
Last Visited	‎10-14-2019 02:44 AM

Member Since	‎06-02-2016 01:32 AM
Last Visited	‎10-14-2019 02:44 AM
Posts	15
Kudos received	12

Cloudera Community

Re: sqoop import all tables into HDFS

Re: Loading data into HBase using Pig script

Re: Best way to ensure null values are not stored ...

Re: Facing small file issue on Hive

Re: Facing small file issue on Hive

Re: External Hive Partitioned table, is empty!!

Re: sqoop import all tables into HDFS

Re: Loading data into HBase using Pig script

Re: Sqoop : Teradata to HDFS using AVRO file forma...

Re: Best way to ensure null values are not stored ...

Re: I'm getting RegionTooBusyException when trying...

Re: I'm getting RegionTooBusyException when trying...