Member since
07-11-2016
11
Posts
0
Kudos Received
0
Solutions
05-10-2017
09:17 PM
Repo Description Airpal is a web-based, query execution tool which leverages Facebook's PrestoDB to make authoring queries and retrieving results simple for users. Airpal provides the ability to find tables, see metadata, browse sample rows, write and edit queries, then submit queries all in a web interface. Once queries are running, users can track query progress and when finished, get the results back through the browser as a CSV (download it or share it with friends). The results of a query can be used to generate a new Hive table for subsequent analysis, and Airpal maintains a searchable history of all queries run within the tool. Repo Info Github Repo URL https://github.com/airbnb/airpal Github account name airbnb Repo name airpal
... View more
05-10-2017
09:16 PM
Repo Description omniduct is a Python 2/3 package that provides a uniform interface for connecting to and extracting data from a wide variety of (potentially remote) data stores (including HDFS, Hive, Presto, MySQL, etc). It is especially useful in contexts where the data stores are only available via remote gateway nodes, where omniduct can automatically manage port forwarding over SSH to make these data stores available locally. It also provides convenient magic functions for use in IPython and Jupyter Notebooks. omniduct has been extensively tested internally, but until our 1.0.0 release, we offer no guarantee of API stability. Documentation for both users and developers will be arriving shortly, but the code is currently being offered for early adopters. Repo Info Github Repo URL https://github.com/airbnb/omniduct Github account name airbnb Repo name omniduct
... View more
Labels:
05-10-2017
09:14 PM
Repo Description
Deployment is automated: simple, safe and repeatable for any AWS account Easily scalable from megabytes to terabytes per day Infrastructure maintenance is minimal, no devops expertise required Infrastructure security is a default, no security expertise required Supports data from different environments (ex: IT, PCI, Engineering) Supports data from different environment types (ex: Cloud, Datacenter, Office) Supports different types of data (ex: JSON, CSV, Key-Value, or Syslog) Supports different use-cases like security, infrastructure, compliance and more Repo Info Github Repo URL https://github.com/airbnb/streamalert Github account name airbnb Repo name streamalert
... View more
05-10-2017
09:08 PM
Repo Description There are two ways to think of SSDs in system design. One is to think of SSD as an extension of disk, where it plays the role of making disks fast and the other is to think of them as an extension of memory, where it plays the role of making memory fat. The latter makes sense when persistence (non-volatility) is unnecessary and data is accessed over the network. Even though memory is thousand times faster than SSD, network connected SSD-backed memory makes sense, if we design the system in a way that network latencies dominate over the SSD latencies by a large factor. To understand why network connected SSD makes sense, it is important to understand the role distributed memory plays in large-scale web architecture. In recent years, terabyte-scale, distributed, in-memory caches have become a fundamental building block of any web architecture. In-memory indexes, hash tables, key-value stores and caches are increasingly incorporated for scaling throughput and reducing latency of persistent storage systems. However, power consumption, operational complexity and single node DRAM cost make horizontally scaling this architecture challenging. The current cost of DRAM per server increases dramatically beyond approximately 150 GB, and power cost scales similarly as DRAM density increases. Fatcache extends a volatile, in-memory cache by incorporating SSD-backed storage. SSD-backed memory presents a viable alternative for applications with large workloads that need to maintain high hit rate for high performance. SSDs have higher capacity per dollar and lower power consumption per byte, without degrading random read latency beyond network latency. Fatcache achieves performance comparable to an in-memory cache by focusing on two design criteria:
Minimize disk reads on cache hit Eliminate small, random disk writes The latter is important due to SSDs' unique write characteristics. Writes and in-place updates to SSDs degrade performance due to an erase-and-rewrite penalty and garbage collection of dead blocks. Fatcache batches small writes to obtain consistent performance and increased disk lifetime. SSD reads happen at a page-size granularity, usually 4 KB. Single page read access times are approximately 50 to 70 usec and a single commodity SSD can sustain nearly 40K read IOPS at a 4 KB page size. 70 usec read latency dictates that disk latency will overtake typical network latency after a small number of reads. Fatcache reduces disk reads by maintaining an in-memory index for all on-disk data. Repo Info Github Repo URL https://github.com/twitter/fatcache Github account name twitter Repo name fatcache
... View more
05-10-2017
09:06 PM
Repo Description About Apache DistributedLog (DL) is a high-throughput, low-latency replicated log service, offering durability, replication and strong consistency as essentials for building reliable real-time applications. Status The Apache DistributedLog project is in the process of incubating. This includes the creation of project resources, the refactoring of the initial code submissions, and the formulation of project documentation, planning and the improvements of existing user and operation documents. Any feedback and contributions are welcome. Repo Info Github Repo URL https://github.com/twitter/distributedlog Github account name twitter Repo name distributedlog
... View more
05-10-2017
09:02 PM
Repo Description Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats, Writables, Pig LoadFuncs, Hive SerDe, HBase miscellanea, etc. The majority of these are in production at Twitter running over data every day. Repo Info Github Repo URL https://github.com/twitter/elephant-bird Github account name twitter Repo name elephant-bird
... View more
Labels:
05-10-2017
08:59 PM
Repo Description GraphJet is a real-time graph processing library written in Java that maintains a full graph index over a sliding time window in memory on a single server. This index supports a variety of graph algorithms including personalized recommendation algorithms based on collaborative filtering. These algorithms power a variety of real-time recommendation services within Twitter, notably content (tweets/URLs) recommendations that require collaborative filtering over a heterogeneous, rapidly evolving graph. GraphJet is able to support rapid ingestion of edges in an evolving graph while concurrently serving lookup queries through a combination of compact edge encoding and a dynamic memory allocation scheme. Each GraphJet server can ingest up to one million graph edges per second, and in steady state, computes up to 500 recommendations per second, which translates into several million edge read operations per second. More information about the internals of GraphJet can be found in the VLDB'16 paper. Repo Info Github Repo URL https://github.com/twitter/GraphJet Github account name twitter Repo name GraphJet
... View more
05-10-2017
08:55 PM
Repo Description Scalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading, a Java library that abstracts away low-level Hadoop details. Scalding is comparable to Pig, but offers tight integration with Scala, bringing advantages of Scala to your MapReduce jobs. Repo Info Github Repo URL https://github.com/twitter/scalding Github account name twitter Repo name scalding
... View more
Labels:
05-10-2017
08:52 PM
Repo Description It was originally developed as part of Scalding's Matrix API, where Matrices had values which are elements of Monoids, Groups, or Rings. Subsequently, it was clear that the code had broader application within Scalding and on other projects within Twitter. See the Algebird website for more information. Repo Info Github Repo URL https://github.com/twitter/algebird Github account name twitter Repo name algebird
... View more
Labels:
05-10-2017
08:46 PM
Repo Description With Coursera, ebooks, Stack Overflow, and GitHub -- all free and open -- how can you afford not to take advantage of an open source education? Repo Info Github Repo URL https://github.com/datasciencemasters/go Github account name datasciencemasters Repo name go
... View more