Requirements: RDF and social applications
Update 2009-08-06: added more information on named graphs, a reference to AliBaba, and a clarification on text handling.
RDF data is managed in a decentralized manner which makes it ideal for social applications (where many people collaborate). In this post, I've collected requirements for RDF engines on which a social application is to be built. The main features are
- Named graphs: are supported by almost all RDF engines. They partition the RDF repository. Social applications should authorize access at graph granularity. That way, some graphs can be private and others public. RDF allows one to hide the “seams” between graphs at will. An RDF repository should support this by enabling one to show and hide graphs on the fly, during access. SPARQL and Sesame can both do this. The former by constraining the graph URI, the latter by specifying a set of contexts when invoking RepositoryConnection.getStatements().
- Distributed version control: provides two abilities. First, versioning is useful for personal use (history, undo) and collaborative use (conflict management, tracking who made what changes). Second, peer-to-peer synchronization is useful for offline use, backup, and collaboration. Pastwatch is an example of very clever (file-based) distributed version control.
- Text handling: to make long texts that are stored in RDF literals more accessible, one should be able to configure what property values are to be indexed. Ideally, version control would only store changes between versions (as opposed to the complete text). As an alternative to storing the text in the RDF repository, one can let the property point to an external document management system. Still, the necessity for version control remains.
- Record the author of a statement: so that a social application can track who contributed what.
Less important features:
- Support for XML literals in SPARQL
- Ease of use: should be easy to install and use; should focus on core RDF repository features.
Three RDF engines come close:
- Open Anzo: an RDF engine that supports versioning, user-based authentication, and text indexing. Replication is possible, but not in a distributed manner. Open Anzo’s philosophy is very much in line with this post.
- IBM Semantic Layered Research Platform: does not seem to be updated any more. Poorly documented. I'm not sure if it can do distributed synchronization. Update: This is Open Anzo's precursor (see comments below).
- OpenLink data spaces: powerful, offers all kinds of import and export services. But the free version does not have replication. I'm not sure how far beyond two-way replication its features go.
- KiWi (Knowledge in a wiki): an intriguing social content platform that rolls its own RDF engine. Its content model deviates from pure RDF. It also cannot do distributed synchronization. Not publically available, yet.
- Sesame has a new project called AliBaba that provides repository federation and change logging.
Related technologies:
- Changesets: an RDF vocabulary for keeping a history of changes. Useful for exporting data from a repository that supports versioning.