Member since
01-11-2016
355
Posts
230
Kudos Received
74
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
8191 | 06-19-2018 08:52 AM | |
3147 | 06-13-2018 07:54 AM | |
3575 | 06-02-2018 06:27 PM | |
3881 | 05-01-2018 12:28 PM | |
5403 | 04-24-2018 11:38 AM |
05-12-2016
07:46 AM
Hi @Ryan Cicak, Cloudbreak is able to provision Ambari clusters at different providers (OpenStack, AWS, GCP, Azure) on your behalf. http://sequenceiq.com/cloudbreak-docs/latest/architecture/ If you would like to run a multi-node Ambari cluster on your local machine the following project is able to do that by using docker containers:
https://github.com/sequenceiq/docker-ambari Br, Tamas
... View more
05-10-2016
11:02 AM
Contrary to popular believe Spark is not in-memory only a) Simple read no shuffle ( no joins, ... ) For the initial reads Spark like MapReduce reads the data in a stream and processes it as it comes along. I.e. unless there is a reason spark will NOT materialize the full RDDs in memory ( you can tell him to do it however if you want to cache a small dataset ) An RDD is resilient because spark knows how to recreate it ( re read a block from hdfs for example ) not because its stored in mem in different locations. ( that can be done too though. ) So if you filter out most of your data or do an efficient aggregation that aggregates on the map side you will never have the full table in memory. b) Shuffle This is done very similarly to MapReduce as it writes the map outputs to disc and reads them with the reducers through http. However spark uses an aggressive filesystem buffer strategy on the Linux filesystem so if the OS has memory available the data will not be actually written to physical disc. c) After Shuffle RDDs after shuffle are normally cached by the engine ( otherwise a failed node or RDD would require a complete re run of the job ) however as abdelkrim mentions Spark can spill these to disc unless you overrule that. d) Spark Streaming This is a bit different. Spark streaming expects all data to fit in memory unless you overwrite settings.
... View more
05-07-2016
07:34 PM
1 Kudo
1) You essentially have two options. Use Sqoop import-all-tables with exclude as you mention. However in that case you have a single sqoop action in oozie and no parallelity in oozie. However sqoop might provide that. You have some limitations though ( only straight imports all columns , ... ) Alternatively you make an oozie flow that uses a fork and then one single table sqoop action per table. In that case you have fine grained control over how much you want to run in parallel. ( You could for example load 4 at a time by doing Start -> Fork -> 4 Sqoop Actions -> Join -> Fork -> 4 Sqoop Actions -> Join -> End 2) If you want incremental load I don't think the Sqoop import-all-tables is possible. So one Sqoop action per table it is. Essentially you can either use Sqoop incremental import functionality ( using a property file ) or use WHERE conditions and give through the date parameter from the coordinator. You can use coord:dateformat to transform your execution date. 3) Run One coord for each table OR have a Decision action in the oozie workflow that skips some sqoop actions Like Start -> Sqoop1 where date = mydate -> Decision if mydate % 3 = 0 then Sqoop2 else end. 4) incremental imports load the new data into a folder in HDFS. If you run it the folder needs to be deleted. If you use append it doesn't delete the old data in HDFS. Now you may ask why would I ever not want append and the reason is that you mostly do something with the data after like importing the new data to a hive partitioned table. If you would use append he would load the same data over and over.
... View more
05-07-2016
06:38 PM
1 Kudo
Essentially hive.server2.tez.default.queues exists for pre initialized Tez sessions. Normally starting an Application Master takes around 10 seconds so the first query will be significantly slow. However you can set hive.server2.tez.initialize.default.sessions=true.
This will initialize hive.server2.tez.sessions.per.default.queue AMs for each of the queues which will then be used for query execution.
For most situations I would not bother with it too much since subsequent queries will reuse existing AMs ( which have an idle wait time ). However if you have strong SLAs you may want to use it.
the tez.queue.name is then the actual queue you want to execute in. If you hit one of the default queues the AM is already there and everything is faster. You might have distinct queues for big heavy and small interactive queries however you still need to set the queue yourself.
... View more
05-09-2016
05:53 PM
Awesome. Thanks Abdelkrim for the great info, this helps.
... View more
05-06-2016
05:14 PM
4 Kudos
Hi @Indrajit swain, You are hitting the ElasticSearch that Atlas is running in background for its operations. This is why you get an older version of ES when you curl port 9200. To check it, stop your ES instance and check if you have something listening to port 9200 netstat -npl | grep 9200 You should still have something listening even when your ES is down. You can see the configuration of existing ES in Atlas configuration in Ambari When ES starts and find its port used (9200) it picks the next available one. So your ES instance will be running on port 9201. You can see it in the starting logs (like in my example) : [2016-05-06 17:09:41,452][INFO ][http ] [Speedball] publish_address {127.0.0.1:9201}, bound_addresses {127.0.0.1:9201} You can try to curl the two ports to test: [root@sandbox ~]# curl localhost:9200
{
"status" : 200,
"name" : "Gravity",
"version" : {
"number" : "1.2.1",
"build_hash" : "6c95b759f9e7ef0f8e17f77d850da43ce8a4b364",
"build_timestamp" : "2014-06-03T15:02:52Z",
"build_snapshot" : false,
"lucene_version" : "4.8"
},
"tagline" : "You Know, for Search"
}
[root@sandbox ~]# curl localhost:9201
{
"name" : "Speedball",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "2.3.2",
"build_hash" : "b9e4a6acad4008027e4038f6abed7f7dba346f94",
"build_timestamp" : "2016-04-21T16:03:47Z",
"build_snapshot" : false,
"lucene_version" : "5.5.0"
},
"tagline" : "You Know, for Search"
}
You can also change the port of ES to something you want in the yaml file. Hope this helps
... View more
10-08-2017
10:34 PM
easy to integrate NiFi -> Kafka -> Spark or Storm or Flink or APEX Also NiFi -> S2s -> Spark / Flink / ...
... View more
04-28-2017
04:10 PM
1 Kudo
This is a good article by our intern James Medel to protect against accidental deletion: USING HDFS SNAPSHOTS TO PROTECT IMPORTANT ENTERPRISE DATASETS Sometime back, we introduced the ability to create snapshots to protect important enterprise data sets from user or application errors. HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system and are:
Performant and Reliable: Snapshot creation is atomic and instantaneous, no matter the size or depth of the directory subtree Scalable: Snapshots do not create extra copies of blocks on the file system. Snapshots are highly optimized in memory and stored along with the NameNode’s file system namespace In this blog post we’ll walk through how to administer and use HDFS snapshots. ENABLE SNAPSHOTS In an example scenario, Web Server logs are being loaded into HDFS on a daily basis for processing and long term storage. The logs are loaded in a few times a day, and the dataset is organized into directories that hold log files per day in HDFS. Since the Web Server logs are stored only in HDFS, it’s imperative that they are protected from deletion. /data/weblogs /data/weblogs/20130901 /data/weblogs/20130902 /data/weblogs/20130903 In order to provide data protection and recovery for the Web Server log data, snapshots are enabled for the parent directory: hdfs dfsadmin -allowSnapshot /data/weblogs Snapshots need to be explicitly enabled for directories. This provides system administrators with the level of granular control they need to manage data in HDP. TAKE POINT IN TIME SNAPSHOTS The following command creates a point in time snapshot of the /data/weblogs/directory and its subtree: hdfs dfs -createSnapshot /data/weblogs This will create a snapshot, and give it a default name which matches the timestamp at which the snapshot was created. Users can provide an optional snapshot name instead of the default. With the default name, the created snapshot path will be: /data/weblogs/.snapshot/s20130903-000941.091. Users can schedule a CRON job to create snapshots at regular intervals. Example, when you run CRON job: 30 18 * * * rm /home/someuser/tmp/*, the comand tells your file system to run the content from the tmp folder at 18:30 every day. For instance, to integrate CRON jobs with HDFS snapshots, run the command: 30 18 * * * hdfs dfs -createSnapshot /data/weblogs/* to schedule Snapshots to be created each day at 6:30. To view the state of the directory at the recently created snapshot: hdfs dfs -ls /data/weblogs/.snapshot/s20130903-000941.091 Found3 items drwxr-xr-x - web hadoop 02013-09-0123:59/data/weblogs/.snapshot/s20130903-000941.091/20130901 drwxr-xr-x - web hadoop 02013-09-0200:55/data/weblogs/.snapshot/s20130903-000941.091/20130902 drwxr-xr-x - web hadoop 02013-09-0323:57/data/weblogs/.snapshot/s20130903-000941.091/20130903 RECOVER LOST DATA As new data is loaded into the web logs dataset, there could be an erroneous deletion of a file or directory. For example, an application could delete the set of logs pertaining to Sept 2nd, 2013 stored in the /data/weblogs/20130902 directory. Since /data/weblogs has a snapshot, the snapshot will protect from the file blocks being removed from the file system. A deletion will only modify the metadata to remove /data/weblogs/20130902 from the working directory. To recover from this deletion, data is restored by copying the needed data from the snapshot path: hdfs dfs -cp /data/weblogs/.snapshot/s20130903-000941.091/20130902/data/weblogs/ This will restore the lost set of files to the working data set: hdfs dfs -ls /data/weblogs Found3 items drwxr-xr-x - web hadoop 02013-09-0123:59/data/weblogs/20130901 drwxr-xr-x - web hadoop 02013-09-0412:10/data/weblogs/20130902 drwxr-xr-x - web hadoop 02013-09-0323:57/data/weblogs/20130903 Since snapshots are read-only, HDFS will also protect against user or application deletion of the snapshot data itself. The following operation will fail: hdfs dfs -rmdir /data/weblogs/.snapshot/s20130903-000941.091/20130902 NEXT STEPS With HDP 2.1, you can use snapshots to protect your enterprise data from accidental deletion, corruption and errors. Download HDP to get started.
... View more
03-29-2016
07:47 PM
4 Kudos
Hi @Vadim, OpenCV is famous for image processing in general. They have several tools for image and face recognition. Here is an example of how to do face recognition with OpenCV: tutorial. In terms of integration with Hadoop, there's a framework called HIPI developed by University of Virginia for leveraging HDFS and MapReduce for large scale image processing. This framework supports OpenCV too. Finally, for image processing in motion, you can use HDF with an OpenCV processor like the one published here
... View more
- « Previous
- Next »