Thursday, June 27, 2013

How are Publishers Rewarded for Exposing Linked-Data?

Disclaimer: This document poses the question asked in the title without offering anything which can be reasonably called an answer.  It is my hope that members of the relevant communities who know more than I do on the topic can provide some insight into potential answers.

Utilization of linked-data by applications is predicated upon the existence of accessible linked-data.  In much the same way that publishers were told they could put their content online in formats like HTML, we now tell them they can expose their information as linked-data using formats like RDFa and JSON-LD.  However, where the former had the fairly obvious benefit of making the publisher’s content visible to human consumers, the later seems to lack any immediately realizable end.

Lofty visions of automated agents and reasoning engines which would operate over the ever expanding web of linked-data have been touted since around the time that the phrase “Semantic Web” was being coined.  It was indicated that, by exposing their information as linked-data, publishers could “hook-in” to these agents, making themselves visible to their users.  Such agents however have yet to materialize and seem to be touted less and less from my observation, which I feel is unfortunate, but that’s an entirely different post.

Many “Semantic Web Applications” which I have seen, either in writings online or at conferences, are in-fact Semantically-Enabled applications which use some Semantic technologies, some of which have been born of the forge of the Semantic Web, in combination with other technologies (AI, NPL, etc), in order to build up a triple store and reason over it or operate upon it.  These have been interesting applications but they are not Semantic Web applications as they go well beyond the boundary of utilizing exposed linked-data.  Further, they are often operating in specialized domains over semantically enabled datasets and not over arbitrarily exposed information on publisher’s sites.  As such, in and of themselves, such applications are providing no reward to the average content publisher.

Search Engines have taken up the torch to some extent in the form of Schema.org.  This gives publishers a reason to expose their data as well as a concrete vocabulary to use in its exposition, but it positions the “Semantic Web” to be re-branded as “SEO 2.0,” which in my mind would be a loss of the initial vision.  It is, however, from what I can find, the only realizable end of publishing linked-data along with your content.

When talking about / attempting to explain the Semantic Web to friends, family, and co-workers, I often employ the Chicken or the Egg metaphor in accounting for why this concept has not yet become ubiquitous (though I am sure some would disagree with the statement that it is not ubiquitous).  If we take the Chicken to be the accessible data and the Egg to be applications, we may be getting closer to the Chicken, with the help of efforts such as Schema.org to an extent, which would give the Egg a raison d’ĂȘtre.  In my experience the lack of a reasonable Egg to point to greatly complicates the task of encouraging publishers to expose their information as linked-data.

A final note: I would be very happy to be corrected on my observations and to be told that the Egg already exists (ideally by being pointed to such an Egg).

Saturday, April 21, 2012

Stability and Fragility of Namespaces

While working on a blog post which will soon be published (and linked to) on CITYTECH, Inc's site, I mentally ran across the subject of updating a namespace definition within a domain of data.  More concretely, I was considering why Apache Jackrabbit does not allow updates (or unregistrations for that matter) to namespaces once they are established within a given Repository.  It seemed to me initially that allowing changes to namespaces would be valuable, for example, as new versions of an ontology were published.  Considering the matter further however I began to realize how dangerous such a practice would be.

Consider the following scenario.  Let us say that I told you that I bought a new shirt, the color of which was blue.  However, instead of saying that, I said "I bought a new shirt, tar klonor blue."  You would look quizically at me and perhaps question your hearing of my statement, because what I didn't tell you was that I had contrived a new statement, "tar klonor" which meant "having the color".

This example is somewhat absurd in and of itself but it is essentially what would happen to a machine's ability to understand linked-data statements if a namespace were changed in the domain of the data being represented.

Consider now a more concrete example.  Let us say that I have created a food ontology identified by the URI http://example.com/food/v1.0/.  Now let us say that I have two documents containing food information.  I present these documents in listing 1 and listing 2 respectively.

@prefix food: <http://example.com/food/v1.0/> .
@prefix ex: <http://somesite.com/things/> .


ex:americanCheese a food:Cheese .
ex:lambChop a food:Meat .
ex:apple a food:Fruit .
ex:provalone a food:Cheese .

Listing 1

@prefix food: <http://example.com/food/v1.0/> .
@prefix me: <http://self.com/items/> .


me:camembert a food:Cheese . 
me:brusselSprouts a food:Vegetable .

Listing 2

If I were to search over this dataset for all resources which are http://example.com/food/v1.0/Cheese, I would find three things.  Now, let us say that I create a new version of the ontology and identify it with the URI http://example.com/food/v2.0/ however I only update document 1 with the new namespace.  Now, if I perform the same search, I only find one thing.  I know in my heart of hearts that I meant for http://example.com/food/v1.0/Cheese to be semantically equivalant to http://example.com/food/v2.0/Cheese, however a system looking at this data has no reason to make this connection (nor should it).  It is equivalant to me creating the new phrase "tar klonor" and then assuming that you will understand the meaning of my sentances including said phrase.  One solution to the problem would be to update the second document along with the first, however this assumes that all documents and systems utilizing the URI of this ontology are under your control.  If your ontology is more widely used, this is not viable.

OWL does expose some mechanisms for handling this (see http://www.w3.org/TR/2004/REC-owl-guide-20040210/#OntologyVersioning), however these seem cumbersome and rely on a system to implement the understanding of the versioning constraints.  Further, some of the more robust constraints are only available in OWL-Full, the implementation and usage of which is far from trivial.  And this only covers ontology versioning.  What about specifications which are not ontologies?

Some time ago, a version 1.0 of Dublin Core existed and there was talk of creating a version 2.0 after version 1.1 (some old notes on this and on translations of DC).  Imagine if you already had your data published in DC 1.0 when 1.1 was pushed out.  The change to version 1.1 updated the URI of the specification and as such, made your data obsolete for all intents and purposes.  Given this, it's clear why the RDF URI still has "1999" in it.  Also, on some specification sites (such as FOAF) you will find statements concerning the stability of the specification URI, specifically, it's not changing.

Coming to the end of this rather long winded discussion, I suppose the bottom line is, Jackrabbit does not need to support changes to namespaces, because namespaces shouldn't change.  Updating a namespace in your domain of data is equivalent to updating all nodes of data using that namespace, which should not be taken lightly.


DnL8Tar
-PCM

Friday, January 20, 2012

Automatic Shopping List - A Use Case for Linked Data

For some time now I've wanted shopping lists automatically generated from recipes.  In fact, I suspect there are sites which will perform this action on a single recipe basis, though I don't have the patience to search for them now.  From a single recipe it is fairly trivial to generate a shopping list.  In fact, one could simply print the recipe - there is normally an ingredients list included.  Working this way however, one would need to go shopping every single time they wanted to cook something, or one would need to print a number of recipes and reconcile overlap in the list manually. 


Consider then an application which would take n recipes and aggregate the ingredients into a shopping list.  Conceptually this is of value in situations where one is disciplined enough to plan their meals for the whole week. In a family setting I imagine the value is increased as you can plan meals for the whole family for a period of time and make sure you are minimizing the trips to the grocery store.  


There is a question of how the application would receive information about the recipes for which it is generating a list however.  This is where open linked data comes in.  If recipe providers (Food Network, All Chefs, etc) were to expose their recipe data as linked data, it could be collected into a single system and generally reasoned upon presuming it followed or was coerced into a somewhat standard ontology (or set thereof).  A user would enter a URL for a recipe into the application, indicating when they planned to prepare the dish.  After entering a number of recipes, the user would elect to generate a shopping list encompassing a certain time period and the system would generate the list based on all of the recipes at once.


One can imagine a number of optimizations to the results, but one which comes to mind, and is most often made manifest in my personal life, is a reconciliation of the shopping list with the contents of the pantry.  Last weekend, I was preparing beef stew.  Knowing that I would need Worcestershire sauce, I picked up a bottle, not remembering if I already had some.  When I arrived at home I found that I had an already unopened bottle sitting in my cupboard.  Had I known this while I was at the store, I could have avoided the expenditure.  Similarly, if the system, which I am endeavoring to contrive with this post, had access to exposed data concerning the users larder, it could adjust the list making sure it included only items which the user would need to buy to supplement their current stock.  


Considering the concept in complete reverse, the system could also suggest recipes based on what you currently have "in stock."  This feature may be more useful than those previously described depending on your lifestyle.


DnL8Tar
-PCM

Sunday, May 15, 2011

On Data Not Accessible in an Expected Tuple Format - A Continuation of the Magic Hat Discussion

In my prior post I brought up the concept of using a "Magic Hat" mechanism to obtain resources regardless of the "physical" location of the resources.  An assumption which this mechanism makes is that those resources which are to be retrieved can be retrieved in a standardized format, in this case, data tuples.  There are however many systems housing interesting and useful data which do not serve this data in such a format directly.  

What is to be done then in the situation where access to data contained in such a  system through the Magic Hat is desirable is largely based  on what mechanisms the system affords for acquiring its data in general.  If the system provides no API via which to pull the data but does expose the ability to edit the template code with which the data is rendered, RDFa could be added to the template code in order to add semantics to the rendered data, making it more accessible by a tuple store.  This approach is quite limited however as, while it facilitates the pulling of a single resource through the magic hat, it does little to ease the asking of a question about all of the data contained in the system.  To elaborate, consider the request "provide all data created by [A] which concerns the topic [B] and was written after the date [C]."  Such a request would be hard to satisfy in such a system as we are largely limited to considering a single resource at a time.
 
The provision of an API has become quite standard in online systems however and any newer system or application which does not expose one is most likely a) in beta with an API on the way, or b) not worth using.  How requests are to be made to the API and the format of data returned by the API is left to the designer of the API and is not guaranteed to match the request and data format which the Magic Hat expects.  As such, some coercion on both the request and response end is necessary.  I've taken this approach with the retrieval of Blogger data and will speak further on it in my next post.  For now I shall suffice it to say that such a mechanism allows for much more robust requests but is limited by the complexity of coercion necessary to make more and more complex requests.

An approach which I have seen / heard taken by some applications is to retrieve all data of interest from a system and house it locally in a tuple format.  While this has the upside of allowing for robust data requests and combinations using expected request and response formats, it does have the significant overhead of maintaining consistency between the local system and the remote system which is still the owner of the data.

DnL8Tar
-PCM

Monday, April 25, 2011

A Magic Hat

I have not been to many magic shows in my life so my perspective here may not be ideal.  Looking back on the few I have attended however, I do not recall ever seeing a magician pull anything out of their iconic top hat, nor even wearing said hat.  The pervasiveness of this image in popular media however assures me that the concept is not lost on society.


Consider the magicians hat.  A simple, undecorated, top hat, from which the trained user is able to pull any number of objects.  With a wave and a wink the magician is able to make materialize whatever object suits their fancy, be it a bouquet of flowers, a deck of playing cards, or the infamous bunny rabbit.  If one breaks themselves from the mundane reality of the trick, one is free to consider where these objects must have originated from.  The bunny rabbit for example surely had been hopping along a field of grass and clover prior to its arrival at the hat.  At the will of the magician it was plucked from its prior location in time and space, traveling at the speed of thought across the gap in these dimensions between the field and the hat, arriving finally in the magicians trained hand, ideally no worse for wear. 





The consumer of the trick, namely the audience, need not concern themselves with where the object appearing from the hat originated, they need only concern themselves with the fact that the objects, which certainly were not in the hat to begin with, have been made manifest in its confines and have been produced to the satisfaction of all onlookers.  


An application of this concept to the procurement of resources by an application is one of the projects keeping me occupied at the moment.  The "Magic Hat" is what I call a "Resource Container."  The principal of the container is the same as the magic hat insomuch as the user of the container is able to bring fourth from the container a resource required, not needing to concern themselves with where that resource is coming from.  The user may require a rabbit and a deck of playing cards, and while the rabbit may be pulled from a grassy field and the playing cards from a drug store shelve the users only concern is that they started off not having these two resources and they now have both of them.

A Resource Container discovers sources of resources via some discovery mechanism (the simplest of which is user configuration however more robust auto-discovery mechanisms are conceptually possible).  When a particular resource is requested of the container, if the container does not have a cached version, or if the cached version is stale, it determines, based on its known sources of resources and its knowledge of those sources, which sources to probe for information about the requested resource.  The container then produces this information in a standardized format to the satisfaction of the requester who need not know the source of the information (so long as the source is trustworthy by the users definition of trust).

To make this description more concrete, consider a resume display application.  The user may define a layout in which they want their resume to appear, indicating where they want their work history to appear, their education history, their personal information, etc.  The user may then attach this application to a Resource Container in order to acquire the actual information necessary to populate the various sections.  Imagine further an ideal online environment in which every place you have worked hosts some open API via which you are able to query your work history so long as you have proper authority to do so.  Similarly imagine that educational institutions have such a system as well.  When you request all work history tied to yourself, the Resource Container would be tasked with finding all known sources which would have resources of type "Work History" and would further be tasked with acquiring these resources.  You as a user and your front end application would not care that your work history is coming from three, four, five, etc different sources.  Using the same mechanisms to pull education history from your various schools and personal information from, perhaps, some government site, or from another service such as linkedin, the Resource Container would be able to provide to the front end application, all necessary data acquired from, ideally, the canonical sources of said data, abstracting from the front end the mundane details of the data sources.


Immediate Difficulties 


In implementing the aforementioned system I've ran into a couple immediate difficulties which I'm working through now and which are worth documenting.


Requested Resources vs Containing Resource: When requesting a particular resource, the subject of the resource being requested is often not enough information to determine where to acquire data concerning the resource.  Going back to the prior example, consider a resume application where any number of users are able to format their resumes online, connecting the front end to sources of resource information applicable to them (ie, their particular employers and educators).  If the system were requested to look up all work history data tied to a particular user, one could imagine the system being able to determine based on pre-set user configuration, which sources to look at for this information.  However, if the system were asked to pull up data for a particular work history resource, without being told much more than the URI of that resource, the system would have a hard time determining where to pull this data from.  In such an application it would quickly become impractical to ask every known employer data source whether they had information about the URI.  As such, a scheme must be arrived at to give the Resource Container more information about the particular request than just the URI.  


Performance Concerns: While I have not attempted any type of benchmarking with what's been coded thus far, I presume that performance is not going to be amazing in situations where data is needed from multiple sources at one time.  If one has a home page which is pulling data from 15 different sources, one can imagine that this may not be quick, and the speed of the population of this page will largely be dependent on the speed of the sources providing the data.  In such situations it is probably prudent to populate large portions of the page asynchronously which, while not adding much, if any, complexity to the Resource Container itself, adds to the complexity of the implementing front ends.


DnL8Tar
-PCM

Friday, March 25, 2011

On Information and Trust

Within a Semantic Web application (I supposed it does not have to be a Semantic Web application, however that's the context I have been working in and thus is most immediate to this thought) you, as a user, may be presented with and using data from multiple sources.  Since the data is all semantically linked it does not necessarily need to reside in a single store but can be obtained from another authority on the subject of the resource in question.  


A question this brought up for me was one of trust.  Trust specifically concerning the validity of the information retrieved from a data source.  Since anyone can post information about a subject given a canonical URI, the "truthfulness" of the data is questionable at best.  For instance, presuming the URI http://example.com/people/ChuckNorris was intended to reference the Chuck Norris, I could, on my website, post a tuple such as


http://example.com/people/ChuckNorris foaf:nick "Prissy Pants Magee" 


which would establish that Chuck Norris has at least one nickname and that nickname is "Prissy Pants Magee." 


Perhaps, because I was not thinking clearly, this was a concern of mine for some time.  However I eventually came to the realization that, simply moving towards a Semantic Web does not remove the onus of understanding the sources of data from the user.  As has always been the case, the consumer of information must know that the information is coming from a source that the consumer trusts.  A user trusting one newspaper and not another will be analogous to a user trusting one data source concerning Chuck Norris facts and not another.  And similarly, any tuple I may post on my website will be used only insofar as I (or my website) am (is) regarded as an authority on the subject matter.


DnL8Tar
-PCM

Saturday, October 30, 2010

Concerning mod_rewrite as Related to URI Manifestation

While not a tenant of the Semantic-Web concept, certainly an ideal is the substantial manifestation of resource URI's insomuch as a web-page can be "substantial."  As defined by Semantic-Web methodologies (and RDF in general), each Resource is identified by a URI.  These URIs, being identifiers and not true locations as URLs are, are not required to point to anything substantial.  It is, however, helpful if they do, as this exposes much more robust access to the Resource.  


To exemplify this point, let us say that I have identified a particular resource, for the sake of example we'll say this resource is a single comic strip, with the uri http://example.com/comics/1.2/ns/a-funny-comic-1.  As a URI this is completely valid, regardless of whether the identifier points to anything useful when typed into a web-browser.  Let us further say that, while attributing identifiers to comics, you also set yourself to the task of creating an application housing these identified resources and exposing access to them via a webpage.  A user of the webpage is to enter the URI of the resource they wish to see and, if said resource exists within your system, it is returned to them.  Here we have a situation where the resource itself is uniquely identified by a URI and is accessible by an application.  


It would be much more convenient however if, instead of needing to know both the URI of the desired resource and the location and usage of the application via which said resource may be retrieved, the seeker of a resource could simply type the URI into a third application with which the user is readily familiar, such as a web-browser, in order to retrieve the desired resource.  In the particular example which we have been exploring this is somewhat trivial assuming you own the domain "http://example.com" (obviously, in practice, http://example.com would be replaced by a domain you do in fact own).  You could create a directory structure on your server equivalent to the URI requested and place an index file within the final child directory.  Then a request to the URI specified above would retrieve http://example.com/comics/1.2/ns/a-funny-comic-1/index.html.   


This is a reasonable starting place but the pitfalls are numerous and some not so obvious.  There is the clear downside of maintenance, be it manual or automatic.  There is also the less obvious downside of URI support.  Taking this approach forces you to use URLs as the URIs of all of your resources, limiting your naming potential.  To illustrate this, let us say that instead of the URI chosen above, you have identified your comic with the slightly different URI http://example.com/comics/1.2/ns#a-funny-comic-1.  The change is a single character but how a web-browser will handle this request is vastly different than how it would handle the previous request.  As per section 2.2 of RFC1738,


The character "#" is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it.


You might be tempted to skirt around the issue by percent-encoding such characters, however this is not a viable solution either since, per section 2.2 of RFC398 (coincidence that these notes are both from section 2.2 of their respective documents?), "#" is a reserved general delimiter and, 


URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent.


Given this let us return to a previous portion of the example.  What would be ideal is if you could leverage the URIs full scope of delimiters, and instead of the URI typed into the web-browser being treated as a URL, it was passed to a subsequent application which would properly treat it as a URI and return to the user the identified resource.

mod_rewrite

mod_rewrite is an Apache module which allows you to rewrite a requested URL based on a series of conditions and rules.  The rewriting happens on the sever side so the user is not made aware of the change in request (that is to say, the address which the user has entered into the address bar does not change).  This module gives us the ability to "catch" a requested URI before it is fully processed as a URL and rewrite the request into something the server will actually be able to process as the user expects.  mod_rewrite may be implemented at either the server or directory level and while it is always best to implement such modules at the server level, there are cases where it is more appropriate (or necessary in the case of a shared server) to implement the module at a directory level via a .htaccess file.

Returning again to our example, we might write rewrite rules stating "any time a user makes a request to an address starting with http://example.com/1.2, hand that request to our resource retrieval application (introduced earlier in this post) passing the fully requested URI as a GET parameter."  Such a rule might look like this (assuming implementation at the directory level starting at the root directory of example.com):

RewriteEngine on

RewriteRule ^comics/1.2/(.*) request-resource.php?uri=http://example.com/comics/1.2/ns/$1 [L]

Breaking this down, the RewriteRule first specifies the condition under which the rule should fire.  This condition is presented as a PERL regular expression.  The regular expression presented in this example states that this rule will fire whenever it encounters a request "starting with comics/1.2/ and followed by any number of characters.  The RewriteRule then specifies the action which should be taken when the rule is fired.  In this case, the request is rewritten as a request to the code "request-resource.php" and the requested URI is passed as a GET parameter (the $1 within the URI indicates the remaining portion of the request which was denoted by "(.*)" in the first portion of the RewriteRule).  Finally a set of optional flags may be presented to the RewriteRule.  In this example we present [L] meaning that this is the last rule which should fire.

Given this code (and the assumption that the previously discussed application is in-fact accessible via the php file request-resource.php) users may now enter URIs for resources defined within the http://example.com/comic/1.2/ns/ namespace simply by entering the URI into a web browser.  

Note: this technique will not handle all URIs, for example, "urn:oasis:names:specification:docbook:dtd:xml:4.1.2" would not be accessible via this methodology though it still stands as a valid URI.




DnL8Tar
-PCM