Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5534 | 08-12-2016 01:02 PM | |
2225 | 08-08-2016 10:00 AM | |
2653 | 08-03-2016 04:44 PM | |
5582 | 08-03-2016 02:53 PM | |
1446 | 08-01-2016 02:38 PM |
02-01-2016
12:57 PM
1 Kudo
Read what I wrote I think it contains links to most possible approaches to connect a DB to Kafka. Writing a sqoop extension that directly writes into Kafka might be possible but I suppose its more work than just to write the MapReduce job yourself using the DBInputFormat and KafkaOutputformat. If you want to use Spark for batch I would read from the same Kafka topic I would use for realtime ( storm,spark streaming) if you use spark streaming you can just use different timeframes.
... View more
02-01-2016
10:44 AM
1 Kudo
This! You can but you shouldn't. Cross-datacenter hadoop clusters do not perform. Don't do it. Use data replication software like DistCP or Falcon to sync clusters in different data centers instead.
... View more
02-01-2016
10:19 AM
3 Kudos
Kafka itself doesn't pull any data. It is a data persistence store. One question: Why do you need Kafka? It is a great persistence store and a great input layer for Storm/Spark Streaming because of its replay capabilities. However databases have similar characteristics. So you normally should be able to directly connect to the RDBMS with Storm/Spark as well. But lets think how you could implement real-time streaming from a database: 1) Best way IMO: push data into Kafka at the same time you put it in the database. I.e. don't pull it OUT of the DB, push it in Kafka at the same time you put it into the DB. ( for example by adding a second hook to the web app that writes the data. ) You can then use Kafka for all analytics, use it as a source for your warehouse and realtime analytics and you do not need to do the ETL that is normally needed on the transactional db. Its also as realtime as it gets. However this is not always possible. 2) There are some log replication tools that can integrate with Kafka http://www.oracle.com/us/products/middleware/data-integration/goldengate-for-big-data-ds-2415102.pdf GoldenGate for Java seems to fit the bill. Edit: The Kafka guys have an example for postgres using a log replication tool called bottled water. This is the same approach. The article also explains the problem nicely. http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/ Also pretty realtime. 3) Use some batched service that runs every x seconds/minutes and runs a SQL command that loads data with a new timestamp/ unique id and puts it into Kafka. This can be - a little Java Producer with a JDBC driver - You could use Storm: http://storm.apache.org/documentation/storm-jdbc.html - Spark Streaming might have one as well Or simply a scheduled job doing the copying perhaps MapReduce https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/lib/db/DBInputFormat.html https://github.com/kafka-dev/kafka/blob/master/contrib/hadoop-producer/src/main/java/kafka/bridge/hadoop/KafkaOutputFormat.java - pretty sure flume can do it as well EDIT: Added sqoop2 for completion sake Sqoop has a Kafka connector. But its only available in sqoop2 not sqoop1. Unfortunately HDP doesn't currently support sqoop2. So it would have to be manually installed. http://sqoop2.readthedocs.org/en/latest/Connectors.html#kafka-connector Obviously the last is not really realtime and the question is if you use storm or Spark why you would need kafka in the middle. Since the db is already a persisted store that can replay loads. Hope that helps
... View more
01-29-2016
12:02 PM
1 Kudo
- a/c) as Neeraj says Hive provides access controls You could for example create one database per client. Access Controls on table column level can be done with Ranger or SQLStdAuth (or the simple FileSystem Authentication) https://community.hortonworks.com/content/kbentry/597/getting-started-with-sqlstdauth.html - if the client data is in the same table you have a bigger problem, hive does not yet provide row level authentication however you could create one view for each customer with a where clause. Most likely not feasable for large number of clients. b) Don't understand this completely, the SQL Server primary key contains the clientid? You can put this into Hive the same way you use it in SQL Server however you have no row level access controls. If that is what you are asking. Also as you say there are no referential constraints which is actual very normal for warehousing databases. So you need to do the constraint check during loads.
... View more
01-29-2016
10:34 AM
3 Kudos
By and large, large ORC files are better. HDFS has a sweetspot for files that are 1-10 times the block size. But 20GB should also be ok. There will be one map task for each block of the ORC file anyway. So the difference should be not big as long as your files are as big or bigger than a block. Files significantly smaller than a block would be bad though. If you create a very big file just keep an eye out for stripe sizes in the ORC file if you see any performance problems. I have sometimes seen very small stripes due to memory restrictions in the writer. So if you want to aggregate a large amount of data as fast as possible having a single big file would be good. However having one 20GB ORC file also means you have loaded it with one task so the load will normally be too slow. You may want to have a couple reducers to increase load speed. Alternatively you can also use ALTER TABLE CONCATENATE to merge small ORC files together. More details on how to influence the load can be found below. http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
... View more
01-29-2016
10:02 AM
1 Kudo
Not sure what you mean with "confined" users and pulled from KDC. Its just the hadoop (linux) users/groups you want to give access to these services. For example if you have a linux group hadoopadmins who should be able to run these services you would specify them. KDC principals are mapped to linux users by Hadoop using the authtolocal rules. Normally the linux users will come from LDAP/AD but that does not have to be the case.
... View more
01-29-2016
09:46 AM
1 Kudo
For the data in HDFS? Sure. https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html --fields-terminated-by <char> Sets the field separator character
... View more
01-29-2016
09:43 AM
2 Kudos
yes this can be done in oozie. I would suggest a shell action. You need to upload all files you need ( libraries etc. ) by adding them in file tags. I for example normally have a shell script that does a kinit for kerberos if needed ( you would need to upload the keytab as well) and then executes the python scripts with the parameters like outputFolder. Now this can run on any datanode so all need access to your RSS feed. However you could also use an SSH action to connect to an edge node. <action name="mypython">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>setupAndRun.sh</exec>
<env-var>outputFolder=${outputFolder}</env-var>
<env-var>targetFolder=${targetFolder}</env-var>
<file>${nameNode}/hdfsfolder/setupAndRun.sh#setupAndRun.sh</file>
<file>${nameNode}/hdfsfolder/mypython.py#mypython.py</file>
</shell>
<ok to="end" />
<error to="kill" />
</action>
... View more
01-28-2016
08:35 PM
3 Kudos
If you enable it you also need to define ACLs for the different yarn services. I.e. define users and groups that can execute specific tasks. More details can be found here. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ServiceLevelAuth.html#Enable_Service_Level_Authorization
... View more
01-28-2016
08:33 PM
1 Kudo
? The path looks good. ${YEAR} is replaced with the current year and so on.. However what do you see when you look into ResourceManager as described above.
... View more