Member since
02-10-2016
50
Posts
14
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1371 | 02-08-2017 05:53 AM | |
1068 | 02-02-2017 11:39 AM | |
3269 | 01-27-2017 06:17 PM | |
1417 | 01-27-2017 04:43 PM | |
1924 | 01-27-2017 01:57 PM |
02-12-2017
01:30 PM
Repo Description Superset is a data exploration platform designed to be visual, intuitive and interactive. [this project used to be named Caravel, and Panoramix in the past] Screenshots & Gifs View Dashboards
View/Edit a Slice
Query and Visualize with SQL Lab
Superset Superset's main goal is to make it easy to slice, dice and visualize data. It empowers users to perform analytics at the speed of thought. Superset provides: A quick way to intuitively visualize datasets by allowing users to create and share interactive dashboards A rich set of visualizations to analyze your data, as well as a flexible way to extend the capabilities An extensible, high granularity security model allowing intricate rules on who can access which features, and integration with major authentication providers (database, OpenID, LDAP, OAuth & REMOTE_USER through Flask AppBuiler) A simple semantic layer, allowing to control how data sources are displayed in the UI, by defining which fields should show up in which dropdown and which aggregation and function (metrics) are made available to the user Deep integration with Druid allows for Superset to stay blazing fast while slicing and dicing large, realtime datasets Fast loading dashboards with configurable caching Repo Info Github Repo URL https://github.com/airbnb/superset Github account name airbnb Repo name superset
... View more
02-12-2017
10:46 AM
Repo Description Thrill is an EXPERIMENTAL C++ framework for algorithmic distributed Big Data batch computations on a cluster of machines. It is currently being designed and developed as a research project at Karlsruhe Institute of Technology and is in early testing. More information on goals and mission see http://project-thrill.org. For easy steps on Getting Started refer to the Live Documentation. Repo Info Github Repo URL https://github.com/thrill/thrill/ Github account name thrill Repo name thrill/
... View more
02-11-2017
01:21 PM
1 Kudo
Repo Description The Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies. Apache Ignite In-Memory Data Fabric is designed to deliver uncompromised performance for a wide set of in-memory computing use cases from high performance computing, to the industry most advanced data grid, highly available service grid, and streaming. Advanced Clustering Ignite nodes can automatically discover each other. This helps to scale the cluster when needed, without having to restart the whole cluster. Developers can also leverage from Ignite’s hybrid cloud support that allows establishing connection between private cloud and public clouds such as Amazon Web Services, providing them with best of both worlds. Data Grid (JCache) Ignite data grid is an in-memory distributed key-value store which can be viewed as a distributed partitioned hash map, with every cluster node owning a portion of the overall data. This way the more cluster nodes we add, the more data we can cache. Unlike other key-value stores, Ignite determines data locality using a pluggable hashing algorithm. Every client can determine which node a key belongs to by plugging it into a hashing function, without a need for any special mapping servers or name nodes. Ignite data grid supports local, replicated, and partitioned data sets and allows to freely cross query between these data sets using standard SQL syntax. Ignite supports standard SQL for querying in-memory data including support for distributed SQL joins. Our data grid offers many features, some of which are:
Primary & Backup Copies. Near Caches. Cache queries and SQL queries. Continuous Queries. Transactions. Off-Heap Memory. Affinity Collocation. Persistent Store. Automatic Persistence. Data Loading. Eviction and Expiry Policies. Data Rebalancing Web Session Clustering. Hibernate L2 Cache. JDBC Driver. Spring Caching. Topology Validation. Streaming & CEP Ignite streaming allows to process continuous never-ending streams of data in scalable and fault-tolerant fashion. The rates at which data can be injected into Ignite can be very high and easily exceed millions of events per second on a moderately sized cluster. Real-time data is ingested via data streamers. We offer streamers for JMS 1.1, Apache Kafka, MQTT, Twitter, Apache Flume and Apache Camel already, and we keep adding new ones every release. The data can then be queried within sliding windows, if needed: Compute Grid Distributed computations are performed in parallel fashion to gain high performance, low latency, and linear scalability. Ignite compute grid provides a set of simple APIs that allow users distribute computations and data processing across multiple computers in the cluster. Distributed parallel processing is based on the ability to take any computation and execute it on any set of cluster nodes and return the results back. We support these features, amongst others:
Distributed Closure Execution. MapReduce & ForkJoin Processing. Clustered Executor Service. Collocation of Compute and Data. Load Balancing. Fault Tolerance. Job State Checkpointing. Job Scheduling. Service Grid Service Grid allows for deployments of arbitrary user-defined services on the cluster. You can implement and deploy any service, such as custom counters, ID generators, hierarchical maps, etc. Ignite allows you to control how many instances of your service should be deployed on each cluster node and will automatically ensure proper deployment and fault tolerance of all the services. Ignite File System Ignite File System (IGFS) is an in-memory file system allowing work with files and directories over existing cache infrastructure. IGFS can either work as purely in-memory file system, or delegate to another file system (e.g. various Hadoop file system implementations) acting as a caching layer. In addition, IGFS provides API to execute map-reduce tasks over file system data. Distributed Data Structures Ignite supports complex data structures in a distributed fashion:
Queues and sets: ordinary, bounded, collocated, non-collocated. Atomic types: AtomicLong and AtomicReference . CountDownLatch . ID Generators. Distributed Messaging Distributed messaging allows for topic based cluster-wide communication between all nodes. Messages with a specified message topic can be distributed to all or sub-group of nodes that have subscribed to that topic. Ignite messaging is based on publish-subscribe paradigm where publishers and subscribers are connected together by a common topic. When one of the nodes sends a message A for topic T, it is published on all nodes that have subscribed to T. Distributed Events Distributed events allow applications to receive notifications when a variety of events occur in the distributed grid environment. You can automatically get notified for task executions, read, write or query operations occurring on local or remote nodes within the cluster. Hadoop Accelerator Our Hadoop Accelerator provides a set of components allowing for in-memory Hadoop job execution and file system operations. MapReduce An alternate high-performant implementation of job tracker which replaces standard Hadoop MapReduce. Use it to boost your Hadoop MapReduce job execution performance. IGFS - In-Memory File System A Hadoop-compliant IGFS File System implementation over which Hadoop can run over in plug-n-play fashion and significantly reduce I/O and improve both, latency and throughput. Secondary File System An implementation of SecondaryFileSystem . This implementation can be injected into existing IGFS allowing for read-through and write-through behavior over any other Hadoop FileSystem implementation (e.g. HDFS). Use it if you want your IGFS to become an in-memory caching layer over disk-based HDFS or any other Hadoop-compliant file system. Supported Hadoop distributions
Apache Hadoop. Cloudera CDH. Hortonworks HDP. Apache BigTop. Spark Shared RDDs Apache Ignite provides an implementation of Spark RDD abstraction which allows to easily share state in memory across Spark jobs. The main difference between native Spark RDD and IgniteRDD is that Ignite RDD provides a shared in-memory view on data across different Spark jobs, workers, or applications, while native Spark RDD cannot be seen by other Spark jobs or applications. Repo Info Github Repo URL https://github.com/apache/ignite Github account name apache Repo name ignite
... View more
02-11-2017
01:19 PM
1 Kudo
Repo Description Apache JMeter features include: Ability to load and performance test many different server/protocol types:
Web - HTTP, HTTPS SOAP / REST FTP Database via JDBC LDAP Message-oriented Middleware (MOM) via JMS Mail - SMTP(S), POP3(S) and IMAP(S) Native commands or shell scripts TCP Full multi-threading framework allows concurrent sampling by many threads and simultaneous sampling of different functions by separate thread groups. Careful GUI design allows faster Test Plan building and debugging. Caching and offline analysis/replaying of test results. Highly Extensible core:
Pluggable Samplers allow unlimited testing capabilities. Several load statistics may be chosen with pluggable timers. Data analysis and visualization plugins allow great extensibility and personalization. Functions can be used to provide dynamic input to a test or provide data manipulation. Scriptable Samplers (Groovy, BeanShell, BSF- and JSR223- compatible languages) Repo Info Github Repo URL https://github.com/apache/jmeter Github account name apache Repo name jmeter
... View more
02-10-2017
06:33 PM
2 Kudos
Repo Description Apache Flink Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities. Learn more about Flink at http://flink.apache.org/ Features
A streaming-first runtime that supports both batch processing and data streaming programs Elegant and fluent APIs in Java and Scala A runtime that supports very high throughput and low event latency at the same time Support for event time and out-of-order processing in the DataStream API, based on the Dataflow Model Flexible windowing (time, count, sessions, custom triggers) accross different time semantics (event time, processing time) Fault-tolerance with exactly-once processing guarantees Natural back-pressure in streaming programs Libraries for Graph processing (batch), Machine Learning (batch), and Complex Event Processing (streaming) Built-in support for iterative programs (BSP) in the DataSet (batch) API Custom memory management for efficient and robust switching between in-memory and out-of-core data processing algorithms Compatibility layers for Apache Hadoop MapReduce and Apache Storm Integration with YARN, HDFS, HBase, and other components of the Apache Hadoop ecosystem Repo Info Github Repo URL https://github.com/apache/flink Github account name apache Repo name flink
... View more
02-09-2017
08:28 PM
No, it is not possible: "A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns" Source: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
... View more
02-08-2017
05:53 AM
1 Kudo
Very good question! Let's dig into Hadoop's source to find this out. The audit log uses java.net.InetAddress's toString() method to obtain a text format of the address: https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L7049 InetAddress's returns the information in "hostname/ip" format. If the hostname is not resolvable (reverse lookup is not working) then you get a starting slash: http://docs.oracle.com/javase/7/docs/api/java/net/InetAddress.html#toString()
... View more
02-03-2017
01:56 PM
It really depends on your use-case and latency requirements. If you need to store Storm's result into HDFS then you can use a Storm HDFS Bolt. If you only need to store the source data I'd suggest to store from Kafka or Flume. That'll result a lower latency on the Storm topology and better decoupling.
... View more
02-02-2017
12:15 PM
In Storm's nomenclature 'nimbus' is the cluster manager: http://storm.apache.org/releases/1.0.1/Setting-up-a-Storm-cluster.html Spark calls the cluster manager as 'master': http://spark.apache.org/docs/latest/spark-standalone.html
... View more
02-02-2017
11:39 AM
Hello, Both storm & spark supports local mode. In Storm you need to create a LocalCluster instance then you can submit your job onto that. You can find description and example in the links: http://storm.apache.org/releases/1.0.2/Local-mode.html https://github.com/apache/storm/blob/1.0.x-branch/examples/storm-starter/src/jvm/org/apache/storm/starter/WordCountTopology.java#L98 Spark's approach on local mode is somewhat different. The allocation is controlled through the spark-master variable which can be set to local (or local[*], local[N] where N is a number). If local is specified executors will be started on your machine. Both Storm and Spark has monitoring capabilities through a web interface. You can find details about them here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_storm-component-guide/content/using-storm-ui.html http://spark.apache.org/docs/latest/monitoring.html Yarn is not a requirement but an option for distributed mode, both Spark & Storm is able to function on their own.
... View more