About bleonhardi

bleonhardi · ‎07-11-2016

if you have a general location you can only use time based scheduling ( every 15 min for example ) you cannot use retention or do late arrivals etc. All the advanced waiting for data to arrive features in oozie are essentially out. Essentially mimicks the datasets in oozie: https://oozie.apache.org/docs/4.2.0/CoordinatorFunctionalSpec.html#a5._Dataset

bleonhardi · ‎07-11-2016

You need to look into the logs. Most likely yarn logs of the Map Task of your Oozie launcher. This contains the sqoop command execution and any errors you would normally see on the command line. You can get them from resourcemanager ( click on your oozie launcher job and go through to the map task or use yarn application -logs. You can find any issues in the actual data transfer in the kicked off Mapreduce job which is a separate job

bleonhardi · ‎07-11-2016

Pretty sure its a community feature and not supported yet even if it should have found its way into the release. Also its very limited right now. Can you really replace your scheduling with it? It still seems to need a bit of time in the oven.

bleonhardi · ‎07-11-2016

Regarding how refer to Sunile. Pig is nice and flexible, Hive is good if you know SQL and your RFID data is already basically in a flat table format, Spark also works well ... But the question is if you really want to process 100GB of data on the sandbox. The memory settings are tiny there is a single drive data is not replicated ... If you do it like this you can just use python on a local machine. If you want a decent environment you might want to set up 3-4 nodes on a VMware server perhaps 32GB of RAM for each? That would give you a nice little environment and you could actually do some fast processing.

bleonhardi · ‎07-11-2016

@Sunile Manjee yeah it works Alex tested it as well. Falcon does not point to one RM. It goes to the yarn-site.xml and finds the RM couple that has one of the ones you specified. Then it tries both.

bleonhardi · ‎07-07-2016

yeah not sure why they call it multiple times. I think the record reader classes are simply initiated multiple times during split and other creation for various reasons. In the end your code needs to be able to survive empty calls and avoid duplication of connection objects. I know of no ways to fix this. In my example I could see relatively easily if the conf object was valid because my config fields ( the storage handler parameters ) where not always in the object. I then simply initialized the connection object and made sure not to create it a second time.

bleonhardi · ‎07-06-2016

You mean to exclude two columns? That one would definitely work: (id1|id2)?+.+ Your version would say id1 once or not at all followed by id2 once or not at all followed by anything else. So should work too I think.

bleonhardi · ‎07-05-2016

It depends a bit what you do with it. Each time you send the data over the network a native datatype will be much faster. 999999999 would be 4bytes as integer but 10 bytes as a string ( characters plus length). That holds for shuffles and any operation that needs to copy data. ( Although some functions like a Map might not be impacted since the Strings will not be necessarily copied in Java if you only assign them for example ) It also costs you more RAM in the Executor which can be a considerable factor in Spark when doing aggregations/joins/sorts etc. And finally when you actually need to do computations on these columns you would have to cast them which will be a pretty costly operation that you could save yourself otherwise. So depending on your usecase the performance and RAM difference can vary between not really significant and considerable.

bleonhardi · ‎07-05-2016

One common problem in SQL is that you join two tables and get duplicate column names from the two tables. When you now for example want to create a CTAS you will get "duplicate column name" errors. You also often want to exclude the join key from the result set since it is by definition duplicate. Database schemas often prefix column names with a letter from the table to fix at least the first issue. Like TPCH: lineitem.l_key and orders.o_key. A common approach to fix this is to explicitely specify all column names in your SELECT list. However it looks like Hive has some cool/dirty tricks up its sleeve to make this easier. Regular Expressions to specify column names. Testsetup: describe tb1; id int name string describe tb2; id int age int You then have to disable the use of quotes in identifiers because that interferes with the regex. set hive.support.quoted.identifiers=none; And now you can use Java Regex to select all columns from the right table that is NOT the key. So essentially you get all columns but the duplicate. If you have non-join key columns that are duplicate you can exclude them and rename them with the AS statement after: create table tb3 asselect tb1.*, tb2.`(id)?+.+`from tb1, tb2 where tb1.id = tb2.id; You can see that I select all columns from the left table and then use the `` quotes to specify a regular expression for the columns from the right side I want to use.The regex is essentially asking for any string unless it starts with id ( it essentially means "the string id once or not at all ( ?+) and any string following). This means if the whole string is id it will not be matched because the remainder of the regex needs to match something. You could also specify multiple columns: (id|id2)?+.+ or (id*)?+.+. This gives me a result table with all columns from the left table and all columns but the key column from the right table. describe tb3; id int name string age int Hive is really cool.

bleonhardi · ‎07-05-2016

I was about to say its not possible. I assume both tables have a column with the same name which is one of the reasons database schemas often prefix column names with a letter from the table. Like TPCH: lineitem.l_key and orders.o_key. In this case you would have to suck it up and name all columns in the join using the AS statement. However it looks like Hive has some cool/dirty tricks up its sleeve. Regular Expressions to specify column names. Here is my setup: hive> describe tb1; id int name string hive> describe tb2; id int age int You then have to disable the use of quotes in identifiers because that interferes with the regex. hive> set hive.support.quoted.identifiers=none; And now you can use Java Regex to select all columns from the right table that is NOT the key. So essentially you get all columns but the duplicate. If you have non-join key columns that are duplicate you can exclude them and rename them with the AS statement after: hive> create table tb3 as select tb1.*, tb2.`(id)?+.+` from tb1, tb2 where tb1.id = tb2.id; You can see that I select all columns from the left table and then use the `` quotes to specify a regular expression for the columns from the right side I want. The regex is a mean trick essentially asking for any string unless it starts with id ( it says id once or not at all and something following. This means if it matches the full string it will not be matched because the remainder of the regex needs to match something. You could also do (id|id2)?+.+ or (id*)?+.+. This gives me a result table with all columns from the left table and all columns but the key column from the right table. hive> describe tb3; id int name string age int Neat. You never stop learning something new. Hive is really cool. Edit: Actually made a little article out of the question because these regex would have made my life much easier multiple times before. So giving a bit more attention to it seems to make sense: https://community.hortonworks.com/articles/43510/excluding-duplicate-key-columns-from-hive-using-re.html

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Falcon Feed File System Storage format

Re: import data into hive with sqoop

Re: Falcon native scheduler to replace Oozie

Re: How to process large volume of data(e.g, 100 G...

Re: Falcon with HA Resource Manager

Re: Hive Storage Handlers

Re: Excluding Duplicate Key Columns from Hive usin...

Re: DataFrames: inefficiencies of keeping all fiel...

Excluding Duplicate Key Columns from Hive using Re...

Re: Can I do a Hive CTAS on the results of a join?