About bleonhardi

bleonhardi · ‎05-08-2016

Storing data in Hive is nothing but storing data in HDFS ( hadoop ). Just look in /apps/hive/warhouse/databasename/tablename Hive is very simple that way. No magical tablespaces or other things. A table is just a folder in HDFS with some files in it and an entry in the metastore telling hive what format the data in that folder has. So Using ORC in Hive is the same as storing ORC in HDFS. So yes make an external table on the tbl file and transform it into an ORC table. All good.

bleonhardi · ‎05-08-2016

Just before we answer the questions before: I prefer to do it a different way: If you run incremental loads you have a hard time to roll back imports tht may or may not have failed. It is easier if you associate each import with a partition in hive and then just delete that partition in case of failure. I.e. if you want to load data hourly create a hourly partitioned table, run an hourly oozie job and use coord:dateformat to provide min/max parameters for that hour. This way you can just re-run the oozie instance in case of any failure and everything will be perfect. If you do incremental loads in the middle of a time time period you don't have much control of the data entering your tables. If you rerun a job you have duplicate data. Apart from that: 1) If you want a central metastore for sqoop jobs that run in Oozie I think you need to setup the metastore and then use the --meta-connect parameter to it. That jira is helpful https://issues.apache.org/jira/browse/SQOOP-453 2) You can do import-all-tables 4) Depends. If its the same database I would say the answer is no. Since the bottleneck will be most likely the network or the database returning data. In that case its better to increase the number of mappers and run imports one by one. For small tables or ones you cannot partition loading in parallel might be good. However you will have a bit of overhead in the cluster since each parallel oozie job will have three empty containers ( oozie launcher AM, oozie launcher map, sqoop am ) this can add up on small clusters as well 5) yes

bleonhardi · ‎05-08-2016

Use ORC in HDP ( and parquet in CDH ). They are much much faster than text. Spark also supports reading from ORC. Some tips on how to get the best performance out of your table setup ( partitioning, predicate pushdown ... ) http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data Oh and finally if you want to use mostly SQL then Hive on Tez with all new features ( CBO, Vectorization, ORC, ... ) beats the pants of Spark no questions asked. Spark is cool if you want to do something like Streaming, Data mining do some programming style data transformation or your data sets are small and you can pin them in memory for repeated queries.

bleonhardi · ‎05-08-2016

[OVERWRITE] in a syntax diagram means that you can optionally use the OVERWRITE keyword. If you want to append data remove the keyword. If you want to overrwrite the table remove the square brackets. ( Also in new hive versions snappy compression is slower or equivalent and much more space hungry than the default zip )

bleonhardi · ‎05-07-2016

1) You essentially have two options. Use Sqoop import-all-tables with exclude as you mention. However in that case you have a single sqoop action in oozie and no parallelity in oozie. However sqoop might provide that. You have some limitations though ( only straight imports all columns , ... ) Alternatively you make an oozie flow that uses a fork and then one single table sqoop action per table. In that case you have fine grained control over how much you want to run in parallel. ( You could for example load 4 at a time by doing Start -> Fork -> 4 Sqoop Actions -> Join -> Fork -> 4 Sqoop Actions -> Join -> End 2) If you want incremental load I don't think the Sqoop import-all-tables is possible. So one Sqoop action per table it is. Essentially you can either use Sqoop incremental import functionality ( using a property file ) or use WHERE conditions and give through the date parameter from the coordinator. You can use coord:dateformat to transform your execution date. 3) Run One coord for each table OR have a Decision action in the oozie workflow that skips some sqoop actions Like Start -> Sqoop1 where date = mydate -> Decision if mydate % 3 = 0 then Sqoop2 else end. 4) incremental imports load the new data into a folder in HDFS. If you run it the folder needs to be deleted. If you use append it doesn't delete the old data in HDFS. Now you may ask why would I ever not want append and the reason is that you mostly do something with the data after like importing the new data to a hive partitioned table. If you would use append he would load the same data over and over.

bleonhardi · ‎05-07-2016

Essentially hive.server2.tez.default.queues exists for pre initialized Tez sessions. Normally starting an Application Master takes around 10 seconds so the first query will be significantly slow. However you can set hive.server2.tez.initialize.default.sessions=true. This will initialize hive.server2.tez.sessions.per.default.queue AMs for each of the queues which will then be used for query execution. For most situations I would not bother with it too much since subsequent queries will reuse existing AMs ( which have an idle wait time ). However if you have strong SLAs you may want to use it. the tez.queue.name is then the actual queue you want to execute in. If you hit one of the default queues the AM is already there and everything is faster. You might have distinct queues for big heavy and small interactive queries however you still need to set the queue yourself.

bleonhardi · ‎05-06-2016

No you don't have to install a local KDC, you have to configure SSSD to connect to AD for linux user authentication. As said AD normally provides kerberos tickets automatically. To create a new service user in AD you best talk to your AD team. Once you have created a hue service user ( in the same group as the hdfs etc. users ) you should be able to export the keytab. The guide would be for a standard KDC. Which is also an option, however if you want a standard KDC then you need to add a one way trust from the AD to your local KDC.

bleonhardi · ‎05-06-2016

Not directly I am afraid. You can write a MapReduce job that transforms them into normal delimited data. Similar to the way it was done with Tika here. ( Assuming you have lots of small files ) https://community.hortonworks.com/repos/4576/apache-tika-integration-with-mapreduce.html You would however need to use a Java library like POI instead of Tika https://poi.apache.org/ To read it directly in Hive you need to write an HiveInputFormat. You can use this inputformat class as an example: https://sreejithrpillai.wordpress.com/2014/11/06/excel-inputformat-for-hadoop-mapreduce/ If you return a row for each record that is delimited and pretend to the Hive Serde that its a text inputformat you might be able to get it working.

bleonhardi · ‎05-06-2016

For business users you do not need to do anything in Hadoop. You need to configure Linux ( with SSSD ) to connect to your active directory. Once you can log on to linux with your active discovery account and get a valid kerberos ticket for your realm ( which is the default configuration for AD: when you log on you normally get a kerberos ticket ). The rest is done by the authtolocal rules which take the kerberos ticket from the realm and then strip out the username from it and identify it as the Hadoop user. So myuser1@MYREALM.COM will be mapped to the hadoop user myuser1. But there should be a default rule in place. So the key point is to configure SSSD on your nodes. https://community.hortonworks.com/articles/14463/auth-to-local-rules-syntax.html

bleonhardi · ‎05-06-2016

First HDFS, First Hive it doesn't really matter. You can transform your data any way you want after. CTAS tables are great. I personally would use HDFS first because then you can easily use other tools like pig on the raw data as well to do some data checking, cleaning, transformations. Pig can be nice to check formatting issues for example. ( You can for example read delimited data as strings and apply regexes on it easier than you could in hive ). Commonly in Hadoop I would do: - Have a /data/raw/project1/timestamp folder which contains all your raw ingested data and keep it as long as you want. - Run transformation jobs to create clean Hive/ORC tables - Run any aggregation jobs on the ORC tables if possible ( faster ) ( you can for example export those to tableau etc ) For sqoop it doesn't matter so much because the data that comes out of a database is normally already clean so hive directly works well but for other data sources ( flat files ) you normally have some data quality issues that might require some pig or whatever magic.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: create hive orc table

Re: sqoop job for incremental import execution fro...

Re: create hive orc table

Re: create hive orc table

Re: optimized oozie workflow to import multiple ta...

Re: tez.queue.name vs hive.server2.tez.default.que...

Re: How do you add a a new User from Active Direct...

Re: Hi, Is there a way to load xlsx file into hive...

Re: How do you add a a new User from Active Direct...

Re: Import to HDFS or Hive