Cloud Computing and Virtualization

User-level System Services with Libservices, Polytropon and Danaus

Containers are commonly used to run the data-intensive applications of different tenants in cloud infrastructures. The storage I/O of the colocated tenants is typically handled by the shared system kernel of the container host. When a data-intensive container competes with a noisy neighbor, the kernel I/O services can cause performance variability and slowdown. This is a challenging problem for which several approaches have already been explored. Although the dynamic resource allocation, kernel structure replication, and hardware-level virtualization are helpful, they incur costs of high implementation complexity and execution overhead. As a realistic cost-effective alternative, we isolate the I/O path of each tenant by running dedicated storage systems at user level on reserved resources.We introduce the libservices as a unified user-level storage abstraction to dynamically provision per tenant container root filesystems, application data filesystems and image repositories. The Polytropon toolkit is a collection of user-level components configurable to build several types of filesystems. The toolkit provides an application library to invoke the standard I/O calls, a user-level path to isolate the tenant I/O traffic to private host resources, and user-level filesystem services distinct per tenant. Furthermore, we introduce the RCQB concurrent queue that relaxes operation ordering for improved communication throughput, and we provide the SMO pipelined memory copy that is faster than standard methods. We use Polytropon to build the Danaus client of a distributed filesystem optionally combined with a union filesystem.

Multitenant Access Control with Dike

In a virtualization environment that serves multiple tenants (independent organizations), storage consolidation at the filesystem level is desirable because it enables data sharing, administration efficiency, and performance optimizations. The scalable deployment of filesystems in such environments is challenging due to intermediate translation layers required for networked file access or identity management. First we define the entities involved in a multitenant filesystem and present relevant security requirements. Then we introduce the design of the Dike authorization architecture. It combines native access control with tenant namespace isolation and compatibility to object-based filesystems. We introduce secure protocols to authenticate the participating entities and authorize the data access over the network.

Client-side Journaling for Durable Shared Storage with Arion

Hardware consolidation in the datacenter often leads to scalability bottlenecks from heavy utilization of critical resources, such as the storage and network bandwidth. Client-side caching on durable media is already applied at block level to reduce the storage backend load, but has received criticism for added overhead, restricted sharing, and possible data loss at client crash.We introduce a journal to the kernel-level client of an object-based distributed filesystem in order to improve durability at high I/O performance and reduced shared resource utilization. Storage virtualization at the file interface achieves clear consistency semantics across data and metadata, supports native file sharing among clients, and provides flexible configuration of durable data staging at the host. Over a prototype that we have implemented, we experimentally quantify the performance and efficiency of the proposed Arion system in comparison to a production system.

Search Privacy with Lethe

Secure keyword search in shared infrastructures prevents stored documents from leaking con dential information to unauthorized users. We assume that a shared index provides con dentiality if it can only be used by users authorized to search all the documents contained in the index. We introduce the Lethe indexing work ow to improve query and update e ciency in secure keyword search. Lethe clusters together documents with similar sets of authorized users, and only creates shared indices for con gurable volumes of documents with common users. Based on the published statistics of an existing dataset, we show that Lethe generates an indexing organization that simultaneously achieves both low search and update cost.

Key-Value Stores with Rangetable/Rangemerge

Datastores are distributed systems that manage enormous amounts of structured data for online serving and batch processing applications. The NoSQL datastores weaken the traditional relational and transactional model in favor of horizontal scalability. They usually support concurrent operations with demanding throughput and latency requirements which may vary across different workload types. A typical tradeoff between query handling and total update costs frequently leads to design decisions that sacrifice query responsiveness for higher update throughput. For this tradeoff, critical component at each storage server is a storage layer that schedules data transfers between memory and disk. After consideration of a similar function in full-text search engines, we systematically examine alternative approaches for online maintenance of structured data over a datastore. Subsequently, we introduce and analyze the Rangemerge algorithm to minimize search time at reasonable total insertion cost. We implement several representative algorithms and experimentally evaluate their update and query cost under the same conditions. We conclude that the Rangemerge algorithm achieves minimal search time and substantially reduces the data insertion time of known methods with comparable search efficiency.

Storage on Solid-State Disks with JLFS

Flash storage is increasingly offering competitive advantages as either a standalone storage device in mobile systems or a distinct layer in the storage hierarchy of enterprise servers. Compatibility with legacy systems is possible through a typical block interface offered by a flash-translation layer embedded in the device controller. Alternatively, the filesystem running at the top can exploit application semantic information to improve critical functions of the flash device. In particular, a logstructured filesystem (LFS) may be directly applied to effectively manage the flash storage, but it is also hurdled by garbage-collection overheads. In order to address this issue, we propose JLFS that proactively cleans garbage from written data by journaling the LFS itself. In our design, we place both the journal and the LFS on the flash device. From memory, we temporarily transfer to the journal the recently updated data, and permanently copy to LFS the data that has remained unmodified sufficiently long. In ongoing research, we are developing a JLFS prototype system implementation and evaluation.

Full-text Search with Proteus

In dynamic environments with frequent content updates, we require online full-text search that scales to large data collections and achieves low search latency. Several recent methods that support fast incremental indexing of documents typically keep on disk multiple partial index structures that they continuously update as new documents are added. However, spreading indexing information across multiple locations on disk tends to considerably decrease the search responsiveness of the system. In the present paper, we take a fresh look at the problem of online full-text search with consideration of the architectural features of modern systems. Selective Range Flush is a greedy method that we introduce to manage the index in the system by using fixed-size blocks to organize the data on disk and dynamically keep low the cost of data transfer between memory and disk. As we experimentally demonstrate with the Proteus prototype implementation that we developed, we retrieve indexing information at latency that matches the lowest achieved by existing methods. Additionally, we reduce the total building cost by 30% in comparison to methods with similar retrieval time.

Durable Streaming Storage with Okeanos

Synchronous small writes play a critical role in the reliability and availability of file systems and applications that use them to safely log recent state modifications and quickly recover from failures. However, storage stacks usually enforce page-sized granularity in their data transfers from memory to disk. We experimentally show that subpage writes may lead to storage bandwidth waste and high disk latencies. To address the issue in a journaled file system, we propose wasteless journaling as a mount mode that coalesces synchronous concurrent small writes of data into full page-sized blocks before transferring them to the journal. Additionally, we propose selective journaling that automatically applies wasteless journaling on data writes whose size lies below a fixed preconfigured threshold. In the Okeanos prototype implementation that we developed, we use microbenchmarks and application-levelworkloads to showsubstantial improvements in write latency, transaction throughput and storage bandwidth requirements.

Decentralized Data Sharing

Secure Federation

The integration of storage resources across different administrative domains can serve as building block for the development of efficient collaboration environments. In order to improve application portability across such environments, we target data sharing facilities that securely span multiple domains at the filesystem rather than the application level. We introduce the hypergroup as an heterogeneous two-layer construct, where the upper layer consists of administrative domains and the lower layer of users from each participating domain. We use public keys to uniquely identify users and domains, but rely on credentials to securely bind users and domains with hypergroups. Each domain is responsible for authenticating its local users across the federation, and employs access control lists to specify the rights of individual users and hypergroups over local storage resources. In comparison to existing systems, we show both analytically and experimentally reduced transfer cost of remote authorizations and improved scalability properties.

Disk Caching

Current trends in business and research collaboration encourage secure data sharing over wide-area networks with minimal intervention of the end user. Instead of having users explicitly initiate traditional file transfers that replicate datasets close to computation resources, it would be preferable to have a caching proxy to automatically replicate datasets and hide transfer delays during the repetitive use of data. Lately, the design of caching proxies for distributed filesystems is attracting research interest from the systems community mostly in terms of getting existing filesystems interoperational with local file systems for persistent caching purposes. We point out the need for efficient storage management in caching proxies so that accesses of cached data from the local disk of the proxy have performance comparable to or better than direct disk accesses from the local file system.

Grid Storage

Data Grids

Data grids are middleware systems that offer secure shared storage of massive scientific datasets over wide area networks. Main challenge in their design is to provide reliable storage, search and transfer of numerous or large files over geographically dispersed heterogeneous platforms. The Storage Resource Broker (SRB) is an example of such a system that has been deployed in multiple high-performance scientific projects during the past few years. In the present paper, we take a detailed look at several of its functional features, and examine its scalability using synthetic and trace-based workloads. Unlike traditional file systems, SRB uses a commodity database to manage both system and user-defined metadata. We quantitatively evaluate this decision, and draw insightful conclusions about its implications to the system architecture and performance characteristics. We find that the bulk transfer facilities of SRB demonstrate good scalability properties, and we identify the bottleneck resources across different data search and transfer tasks. We examine the sensitivity to several configuration parameters, and provide details about how different internal operations contribute to the overall performance.

Active Storage with Lerna

We examine the problem of supporting application-specific computation within a network file server. Our objectives are i) to introduce an easy to use yet powerful architecture for executing both custom-developed and legacy applications close to the stored data, ii) to investigate the potential performance improvement to I/O-intensive processing from this approach, iii) to exploit the I/O-traffic information available within the file server for more effective resource management. One main difference from previous active storage research is our emphasis on the expressive power and usability of the network server interface. We describe an extensible active storage framework that we built in order to demonstrate the feasibility of the proposed system design. We show that accessing large datasets over a wide-area network through a regular file system can penalize the system performance, unless application computation is moved close to the stored data. Our conclusions are substantiated through experimentation with a popular multi-layer map warehouse application.

Scalable Download Servers with Circus

File downloading remains one of the most popular and cost-effective methods for digital content dissemination to vast audiences. However, due to growing popularity in multimedia data types, or constantly added sophistication into modern software packages, downloaded files tend to become larger over time. This trend is further encouraged by the growing number of users with broadband network access capability. Practical limitations in the main memory of a network server make increasingly common the case of download servers storing files larger than their buffer cache. Serving such files considerably undermines the benefits of reference locality and reduces the effectiveness of existing caching techniques to keep server disk access load under control.

In order to address this issue, we propose the departure from the traditional approach that requires each file to be transmitted sequentially to the client. Instead, we advocate sending data blocks to each client out-of-order, as long as duplication is avoided and the entire file can be reliably reassembled at the receiving side. We introduce a new algorithm (Circus) for choosing which data blocks to send to each client. Our algorithm takes advantage of data sharing opportunities among concurrently served requests, and we compare its performance with existing download server designs using a prototype implementation that we developed. We find our system to keep the file download time close to minimum across different workloads and several factors below what existing implementations achieve. We also show that our method more than doubles the server network bandwidth, while reducing by an order of magnitude the measured disk bandwidth in several cases.

Scalable Media Servers with Exedra

Despite impressive advances in storage and networking technology, widespread distribution of acceptable quality video over the Internet remains unfulfilled promise. Results from previous research show that the resource requirements involved can be significantly reduced when using variable bit-rate encoding. However, they leave mostly open the hard problems of efficiently storing and transferring such data, when compared to the simpler case of dealing with constant bit-rate streams.

In our recent work, we have focused on the related disk storage problem, and demonstrated that managing variable bit-rate video can be both feasible and efficient. Special disk striping techniques can achieve high resource utilization and system throughput, while the number of supported users scales almost linearly as the number of available disks increases. Exedra is a video server prototype that we have designed and built in order to support the above arguments. It uniquely combines the following key features:

Support of variable bit-rate streams with deterministic QoS guarantees

Striping of stream data across multiple disks

Detailed reservation of system resources over time

Fault-tolerant operation can provide uninterrupted operation during device failures. We investigate data replication techniques and disk bandwidth reservation policies that successfully distribute the access load of a failed disk/node equally across the surviving ones. We experimentally demonstrate that single device failures can be tolerated at minimally reduced throughput during normal operation.

For first time, we formulate as a smoothing problem the need for balancing bandwidth and memory utilization when prefetching stream data from disks. Subsequently, we propose a new algorithm for optimal stream smoothing that takes into consideration the shared nature of server memory, and the mechanical movement overhead involved in disk access delays. The method is demonstrated to be applicable to both homogeneous and heterogeneous disk arrays.

Operating-System Support for Parallel Applications

Parallel applications can be executed using the idle computing capacity of workstation clusters. However, it remains unclear how to most effectively schedule the processors among different applications. Processor scheduling algorithms that were successful for shared-memory machines have proven to be inadequate for distributed memory environments due to the high costs of remote memory accesses and redistributing data.

We investigated how knowledge of system load and application characteristics can be used in scheduling parallel applications on workstation clusters. We proposed a new algorithm, which, by properly exploiting both the information types above, performed better than other non-preemptive scheduling rules, and nearly as well as idealized versions of preemptive rules.