About peterlandis

peterlandis · ‎04-14-2017

To set this in Impala to execute either as a SQL file or hue you would set the variables as shown in the first 2 lines below. Below is an example of using static partitioning and dynamic partitioning together where the EffDTM and FirstNameLetter are static partitioning and LastNameLetter uses dynamic partitioning. set var:hive.exec.dynamic.partition.mode=nonstrict; set var:hive.exec.dynamic.partition=true; INSERT OVERWRITE TABLE person_by_name PARTITION (EffDTM='201701', FirstNameLetter='a', LastNameLetter) firstname, lastname, .. .. lower(substr(lastname, 0, 1)) as LastNameLetter FROM person WHERE lower(substr(firstname, 0, 1)) = 'a';

peterlandis · ‎04-07-2017

Dynamic Partitioning is an alternative to this per reading Cloudera Partitioning SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.dynamic.partition=true; INSERT OVERWRITE TABLE T PARTITION (FirstNameLetter, LastNameLetter) SELECT ...., lower(substr(name, 0, 1)), lower(substr(name, 1, 1))

peterlandis · ‎04-06-2017

Wondering if there’s a better way to do this type of partitioning where I want to partition the data based on First Name stating with a-z and Last Name starting with a-z creating 676 buckets due to how the table is queried and volume. I wasn’t sure if there is someway to dynamically create the partition. What I did was create (in the example below I only left it to 3 partitions for the sake of typing) partitions based on the First letter of the names. Then select from my raw original table and inserted into the partition point based on the criteria. Is this the most efficient way of doing this when querying the table by firstname and lastname or are there better ways to partition this table or improve query performance? Example: CREATE EXTERNAL TABLE name_partition( FName STRING, LName STRING, Notes String ) PARTITIONED BY (FirstNameLetter STRING, LastNameLetter STRING) STORED AS PARQUET LOCATION '/data/shared_workspace/'; ALTER TABLE name_partition ADD PARTITION (FirstNameLetter = 'a', LastNameLetter = 'a'); ALTER TABLE name_partition ADD PARTITION (FirstNameLetter = 'a', LastNameLetter = 'b'); ALTER TABLE name_partition ADD PARTITION (FirstNameLetter = 'a', LastNameLetter = 'c'); FROM name_raw INSERT OVERWRITE TABLE name_partition PARTITION(FirstNameLetter ='a', LastNameLetter = 'a') SELECT FName, Lname, Notes WHERE lower(Fname) like 'a%' AND lower(Lname)like 'a%' INSERT OVERWRITE TABLE name_partition PARTITION(FirstNameLetter ='a', LastNameLetter = 'b') SELECT FName, Lname, Notes WHERE lower(Fname) like 'a%' AND lower(Lname)like 'b%'; INSERT OVERWRITE TABLE name_partition PARTITION(FirstNameLetter ='a', LastNameLetter = 'c') SELECT FName, Lname, Notes WHERE lower(Fname) like 'a%' AND lower(Lname)like 'c%';

peterlandis · ‎04-01-2017

So I was able to solve the issue but not using a Serde. What I did was load the fixedwidth file into a single string and then used substring to parse through the line as follows: CREATE TABLE FIXEDWIDTH_TEMP ( whole_line STRING ) LOCATION '/data/shared_workspace/test/fixedwidth_temp'; CREATE EXTERNAL TABLE FIXEDWIDTH_STG ( ZIP_CODE_3043 STRING, ZIP4_CODE_3044 STRING, PREFIX_TITLE_8093 STRING, STATE_3038 STRING, PHONE_3032 STRING, ... .. 1800 lines more ... ) LOCATION '/data/shared_workspace/test/fixedwidth_stg'; INSERT INTO FIXEDWIDTH_STG SELECT SUBSTRING(acxiom_line,1,5), SUBSTRING(acxiom_line,6,4), SUBSTRING(acxiom_line,10,1), SUBSTRING(acxiom_line,11,3), SUBSTRING(acxiom_line,14,2), SUBSTRING(acxiom_line,16,10), SUBSTRING(acxiom_line,26,12), . . .1800 lines more FROM FIXEDWIDTH_TEMP;

peterlandis · ‎03-31-2017

Here's what the hive log says. Now it's not because there's an unclosed bracket. As I can see it. It runs out of storage space in the array for the input.reg. I tested an input.reg with 400 attributes. Then went up to 800 and it gave the same error. So figuring that if the error says Unclosed counted closure near index 4000 then with each attribute containing 6 size of an array aka (.{5}) then I bet you can only have around 666 attributes in the reg expression granted the size of an element was less then 10 otherwise you'd have less being that the array size would then be 7. (.{1 ^ at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:400) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:239) at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:269) at org.apache.hive.service.cli.operation.Operation.run(Operation.java:337) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:439) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:405) at org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:257) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:501) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:746) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.regex.PatternSyntaxException: Unclosed counted closure near index 4000 (.{1}) (.{1 ^ at java.util.regex.Pattern.error(Pattern.java:1955) at java.util.regex.Pattern.closure(Pattern.java:3141) at java.util.regex.Pattern.sequence(Pattern.java:2134) at java.util.regex.Pattern.expr(Pattern.java:1996) at java.util.regex.Pattern.group0(Pattern.java:2905) at java.util.regex.Pattern.sequence(Pattern.java:2051) at java.util.regex.Pattern.expr(Pattern.java:1996) at java.util.regex.Pattern.compile(Pattern.java:1696) at java.util.regex.Pattern.<init>(Pattern.java:1351) at java.util.regex.Pattern.compile(Pattern.java:1054) at org.apache.hadoop.hive.contrib.serde2.RegexSerDe.initialize(RegexSerDe.java:111) at org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53) at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDeWithoutErrorCheck(SerDeUtils.java:542) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:389) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:273) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:266) at org.apache.hadoop.hive.ql.exec.DDLTask.describeTable(DDLTask.java:3151) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:378) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:214) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1976) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1689) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1421) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1205) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1200) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:237) ... 15 more

peterlandis · ‎03-31-2017

First: I get no error when creating the table. It's when I try to DESCRIBE the table or select * from the table I get the following error in beeline command line or Hue interface: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unclosed counted closure near index 4000 (.{5}) (.{4}) (.{1}) (.{3}) (.{2}) (.{10}) (.{12}) (.{30}) (.{1}) (.{30}) (.{10}) (.{2}) (.{35}) (.{30}) (.{61}) (.{47}) (.{1}) (.{3}) (.{4}) (.{3}) (.{4}) (.{3}) (.{4}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{1}) (.{2}) (.{2}) (.{2}) In the Clouder Manager -> Diagnostic tab-> Logs I search Hive for Error or Fatal and nothing comes back.

peterlandis · ‎03-31-2017

I have a large 1881 column fixed width file format. I want to create an external Hive table mapping to hdfs file. So I was able to succesfully do this by using the RegexSerDe as shown below. But when I try creating this with an 1881 column table, hive creates the table but you cannot run describe test_fixed_width as it errors out and throws an exception: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unclosed counted closure near index 4000 It appears that there is a buffer overflow error. How can I resolve this? Are there other techniques that could be applied? CREATE EXTERNAL TABLE test_fixed_width ( Col1 STRING, Col2 STRING, Col3 STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(.{3})(.{2})(.{1})", "output.format.string" = "%1$s %2$s %3$s %4$s" ) LOCATION '/data/';

peterlandis · ‎03-06-2017

Argggg. Ok I need to find a wall and pound my head against it. The issue was I was running the first_etl.pig as pig -x local first_etl.pig which runs it locally expecting a local file and what I want is to run this on the Hadoop cluster. Running this as pig first_etl.pig fires this off and finds the file.

peterlandis · ‎03-06-2017

I'm a newbie at Pig scripting and just walking through some examples (Cloudera on demaind training to be specific). Anyway I load a file hdfs dfs -put $ADIR/data/ad_data1.txt /dualcore/ Check that the directory has proper permissions via hdfs dfs -l / I can see it's chmod 777 for /dualcore and also check the /dualcore/ad_data1.txt is also set properly in HDFS. Now when I try to the pig -x local first_etl.pig script I get the following ERROR: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: file:/dualcore/ad_data1.txt QUESTION: The file is at the root /dualcore/ad_data1.txt. When I cat the file [hdfs dfs -cat /dualcore/ad_data1.txt] it displays the data. Do I need to specify something other than LOAD '/dualcore/ad_data1.txt' ? SCRIPT: data = LOAD '/dualcore/ad_data1.txt' using PigStorage(':') AS (keyword:chararray, campaign_id:chararray, date:chararray, time:chararray, display_site:chararray, was_clicked:int, cpc:int, country:chararray, placement:chararray); reordered = FOREACH data GENERATE campaign_id, date, time, UPPER(TRIM(keyword)), display_site, placement, was_clicked, cpc; STORE reordered INTO '/dualcore/ad_data1/';

Online	Offline
Last Visited	‎08-09-2017 01:17 PM

Member Since	‎03-06-2017 08:04 PM
Last Visited	‎08-09-2017 01:17 PM
Posts	9
Kudos received	1

Cloudera Community

Re: Impala Partitioning by Name question

Re: Impala Partitioning by Name question

Re: Pig Load - ERROR 2118: Input path does not exi...

Re: Impala Partitioning by Name question

Re: Impala Partitioning by Name question

Impala Partitioning by Name question

Re: Loading a large fixedwidth file format into Hi...

Re: Loading a large fixedwidth file format into Hi...

Re: Loading a large fixedwidth file format into Hi...

Loading a large fixedwidth file format into Hive

Re: Pig Load - ERROR 2118: Input path does not exi...

Pig Load - ERROR 2118: Input path does not exist