About bleonhardi

bleonhardi · ‎02-01-2016

Read what I wrote I think it contains links to most possible approaches to connect a DB to Kafka. Writing a sqoop extension that directly writes into Kafka might be possible but I suppose its more work than just to write the MapReduce job yourself using the DBInputFormat and KafkaOutputformat. If you want to use Spark for batch I would read from the same Kafka topic I would use for realtime ( storm,spark streaming) if you use spark streaming you can just use different timeframes.

bleonhardi · ‎02-01-2016

This! You can but you shouldn't. Cross-datacenter hadoop clusters do not perform. Don't do it. Use data replication software like DistCP or Falcon to sync clusters in different data centers instead.

bleonhardi · ‎02-01-2016

Kafka itself doesn't pull any data. It is a data persistence store. One question: Why do you need Kafka? It is a great persistence store and a great input layer for Storm/Spark Streaming because of its replay capabilities. However databases have similar characteristics. So you normally should be able to directly connect to the RDBMS with Storm/Spark as well. But lets think how you could implement real-time streaming from a database: 1) Best way IMO: push data into Kafka at the same time you put it in the database. I.e. don't pull it OUT of the DB, push it in Kafka at the same time you put it into the DB. ( for example by adding a second hook to the web app that writes the data. ) You can then use Kafka for all analytics, use it as a source for your warehouse and realtime analytics and you do not need to do the ETL that is normally needed on the transactional db. Its also as realtime as it gets. However this is not always possible. 2) There are some log replication tools that can integrate with Kafka http://www.oracle.com/us/products/middleware/data-integration/goldengate-for-big-data-ds-2415102.pdf GoldenGate for Java seems to fit the bill. Edit: The Kafka guys have an example for postgres using a log replication tool called bottled water. This is the same approach. The article also explains the problem nicely. http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/ Also pretty realtime. 3) Use some batched service that runs every x seconds/minutes and runs a SQL command that loads data with a new timestamp/ unique id and puts it into Kafka. This can be - a little Java Producer with a JDBC driver - You could use Storm: http://storm.apache.org/documentation/storm-jdbc.html - Spark Streaming might have one as well Or simply a scheduled job doing the copying perhaps MapReduce https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/lib/db/DBInputFormat.html https://github.com/kafka-dev/kafka/blob/master/contrib/hadoop-producer/src/main/java/kafka/bridge/hadoop/KafkaOutputFormat.java - pretty sure flume can do it as well EDIT: Added sqoop2 for completion sake Sqoop has a Kafka connector. But its only available in sqoop2 not sqoop1. Unfortunately HDP doesn't currently support sqoop2. So it would have to be manually installed. http://sqoop2.readthedocs.org/en/latest/Connectors.html#kafka-connector Obviously the last is not really realtime and the question is if you use storm or Spark why you would need kafka in the middle. Since the db is already a persisted store that can replay loads. Hope that helps

bleonhardi · ‎01-29-2016

- a/c) as Neeraj says Hive provides access controls You could for example create one database per client. Access Controls on table column level can be done with Ranger or SQLStdAuth (or the simple FileSystem Authentication) https://community.hortonworks.com/content/kbentry/597/getting-started-with-sqlstdauth.html - if the client data is in the same table you have a bigger problem, hive does not yet provide row level authentication however you could create one view for each customer with a where clause. Most likely not feasable for large number of clients. b) Don't understand this completely, the SQL Server primary key contains the clientid? You can put this into Hive the same way you use it in SQL Server however you have no row level access controls. If that is what you are asking. Also as you say there are no referential constraints which is actual very normal for warehousing databases. So you need to do the constraint check during loads.

bleonhardi · ‎01-29-2016

By and large, large ORC files are better. HDFS has a sweetspot for files that are 1-10 times the block size. But 20GB should also be ok. There will be one map task for each block of the ORC file anyway. So the difference should be not big as long as your files are as big or bigger than a block. Files significantly smaller than a block would be bad though. If you create a very big file just keep an eye out for stripe sizes in the ORC file if you see any performance problems. I have sometimes seen very small stripes due to memory restrictions in the writer. So if you want to aggregate a large amount of data as fast as possible having a single big file would be good. However having one 20GB ORC file also means you have loaded it with one task so the load will normally be too slow. You may want to have a couple reducers to increase load speed. Alternatively you can also use ALTER TABLE CONCATENATE to merge small ORC files together. More details on how to influence the load can be found below. http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

bleonhardi · ‎01-29-2016

Not sure what you mean with "confined" users and pulled from KDC. Its just the hadoop (linux) users/groups you want to give access to these services. For example if you have a linux group hadoopadmins who should be able to run these services you would specify them. KDC principals are mapped to linux users by Hadoop using the authtolocal rules. Normally the linux users will come from LDAP/AD but that does not have to be the case.

bleonhardi · ‎01-29-2016

For the data in HDFS? Sure. https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html --fields-terminated-by <char> Sets the field separator character

bleonhardi · ‎01-29-2016

yes this can be done in oozie. I would suggest a shell action. You need to upload all files you need ( libraries etc. ) by adding them in file tags. I for example normally have a shell script that does a kinit for kerberos if needed ( you would need to upload the keytab as well) and then executes the python scripts with the parameters like outputFolder. Now this can run on any datanode so all need access to your RSS feed. However you could also use an SSH action to connect to an edge node. <action name="mypython"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <exec>setupAndRun.sh</exec> <env-var>outputFolder=${outputFolder}</env-var> <env-var>targetFolder=${targetFolder}</env-var> <file>${nameNode}/hdfsfolder/setupAndRun.sh#setupAndRun.sh</file> <file>${nameNode}/hdfsfolder/mypython.py#mypython.py</file> </shell> <ok to="end" /> <error to="kill" /> </action>

bleonhardi · ‎01-28-2016

If you enable it you also need to define ACLs for the different yarn services. I.e. define users and groups that can execute specific tasks. More details can be found here. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ServiceLevelAuth.html#Enable_Service_Level_Authorization

bleonhardi · ‎01-28-2016

? The path looks good. ${YEAR} is replaced with the current year and so on.. However what do you see when you look into ResourceManager as described above.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: How to integrate kafka to pull data from RDBMS

Re: Deploying hadoop cluster

Re: How to integrate kafka to pull data from RDBMS

Re: Best practice to load multiple client data int...

Re: storage strategy of OCR / Parquet file

Re: service level authorisation

Re: sqoop import string delimiter

Re: Scheduling a Python script in OOZIE

Re: service level authorisation

Re: Falcon Process running but not processing Data...