Sunday, May 15, 2011

On Data Not Accessible in an Expected Tuple Format - A Continuation of the Magic Hat Discussion

In my prior post I brought up the concept of using a "Magic Hat" mechanism to obtain resources regardless of the "physical" location of the resources.  An assumption which this mechanism makes is that those resources which are to be retrieved can be retrieved in a standardized format, in this case, data tuples.  There are however many systems housing interesting and useful data which do not serve this data in such a format directly.  

What is to be done then in the situation where access to data contained in such a  system through the Magic Hat is desirable is largely based  on what mechanisms the system affords for acquiring its data in general.  If the system provides no API via which to pull the data but does expose the ability to edit the template code with which the data is rendered, RDFa could be added to the template code in order to add semantics to the rendered data, making it more accessible by a tuple store.  This approach is quite limited however as, while it facilitates the pulling of a single resource through the magic hat, it does little to ease the asking of a question about all of the data contained in the system.  To elaborate, consider the request "provide all data created by [A] which concerns the topic [B] and was written after the date [C]."  Such a request would be hard to satisfy in such a system as we are largely limited to considering a single resource at a time.
The provision of an API has become quite standard in online systems however and any newer system or application which does not expose one is most likely a) in beta with an API on the way, or b) not worth using.  How requests are to be made to the API and the format of data returned by the API is left to the designer of the API and is not guaranteed to match the request and data format which the Magic Hat expects.  As such, some coercion on both the request and response end is necessary.  I've taken this approach with the retrieval of Blogger data and will speak further on it in my next post.  For now I shall suffice it to say that such a mechanism allows for much more robust requests but is limited by the complexity of coercion necessary to make more and more complex requests.

An approach which I have seen / heard taken by some applications is to retrieve all data of interest from a system and house it locally in a tuple format.  While this has the upside of allowing for robust data requests and combinations using expected request and response formats, it does have the significant overhead of maintaining consistency between the local system and the remote system which is still the owner of the data.


Monday, April 25, 2011

A Magic Hat

I have not been to many magic shows in my life so my perspective here may not be ideal.  Looking back on the few I have attended however, I do not recall ever seeing a magician pull anything out of their iconic top hat, nor even wearing said hat.  The pervasiveness of this image in popular media however assures me that the concept is not lost on society.

Consider the magicians hat.  A simple, undecorated, top hat, from which the trained user is able to pull any number of objects.  With a wave and a wink the magician is able to make materialize whatever object suits their fancy, be it a bouquet of flowers, a deck of playing cards, or the infamous bunny rabbit.  If one breaks themselves from the mundane reality of the trick, one is free to consider where these objects must have originated from.  The bunny rabbit for example surely had been hopping along a field of grass and clover prior to its arrival at the hat.  At the will of the magician it was plucked from its prior location in time and space, traveling at the speed of thought across the gap in these dimensions between the field and the hat, arriving finally in the magicians trained hand, ideally no worse for wear. 

The consumer of the trick, namely the audience, need not concern themselves with where the object appearing from the hat originated, they need only concern themselves with the fact that the objects, which certainly were not in the hat to begin with, have been made manifest in its confines and have been produced to the satisfaction of all onlookers.  

An application of this concept to the procurement of resources by an application is one of the projects keeping me occupied at the moment.  The "Magic Hat" is what I call a "Resource Container."  The principal of the container is the same as the magic hat insomuch as the user of the container is able to bring fourth from the container a resource required, not needing to concern themselves with where that resource is coming from.  The user may require a rabbit and a deck of playing cards, and while the rabbit may be pulled from a grassy field and the playing cards from a drug store shelve the users only concern is that they started off not having these two resources and they now have both of them.

A Resource Container discovers sources of resources via some discovery mechanism (the simplest of which is user configuration however more robust auto-discovery mechanisms are conceptually possible).  When a particular resource is requested of the container, if the container does not have a cached version, or if the cached version is stale, it determines, based on its known sources of resources and its knowledge of those sources, which sources to probe for information about the requested resource.  The container then produces this information in a standardized format to the satisfaction of the requester who need not know the source of the information (so long as the source is trustworthy by the users definition of trust).

To make this description more concrete, consider a resume display application.  The user may define a layout in which they want their resume to appear, indicating where they want their work history to appear, their education history, their personal information, etc.  The user may then attach this application to a Resource Container in order to acquire the actual information necessary to populate the various sections.  Imagine further an ideal online environment in which every place you have worked hosts some open API via which you are able to query your work history so long as you have proper authority to do so.  Similarly imagine that educational institutions have such a system as well.  When you request all work history tied to yourself, the Resource Container would be tasked with finding all known sources which would have resources of type "Work History" and would further be tasked with acquiring these resources.  You as a user and your front end application would not care that your work history is coming from three, four, five, etc different sources.  Using the same mechanisms to pull education history from your various schools and personal information from, perhaps, some government site, or from another service such as linkedin, the Resource Container would be able to provide to the front end application, all necessary data acquired from, ideally, the canonical sources of said data, abstracting from the front end the mundane details of the data sources.

Immediate Difficulties 

In implementing the aforementioned system I've ran into a couple immediate difficulties which I'm working through now and which are worth documenting.

Requested Resources vs Containing Resource: When requesting a particular resource, the subject of the resource being requested is often not enough information to determine where to acquire data concerning the resource.  Going back to the prior example, consider a resume application where any number of users are able to format their resumes online, connecting the front end to sources of resource information applicable to them (ie, their particular employers and educators).  If the system were requested to look up all work history data tied to a particular user, one could imagine the system being able to determine based on pre-set user configuration, which sources to look at for this information.  However, if the system were asked to pull up data for a particular work history resource, without being told much more than the URI of that resource, the system would have a hard time determining where to pull this data from.  In such an application it would quickly become impractical to ask every known employer data source whether they had information about the URI.  As such, a scheme must be arrived at to give the Resource Container more information about the particular request than just the URI.  

Performance Concerns: While I have not attempted any type of benchmarking with what's been coded thus far, I presume that performance is not going to be amazing in situations where data is needed from multiple sources at one time.  If one has a home page which is pulling data from 15 different sources, one can imagine that this may not be quick, and the speed of the population of this page will largely be dependent on the speed of the sources providing the data.  In such situations it is probably prudent to populate large portions of the page asynchronously which, while not adding much, if any, complexity to the Resource Container itself, adds to the complexity of the implementing front ends.


Friday, March 25, 2011

On Information and Trust

Within a Semantic Web application (I supposed it does not have to be a Semantic Web application, however that's the context I have been working in and thus is most immediate to this thought) you, as a user, may be presented with and using data from multiple sources.  Since the data is all semantically linked it does not necessarily need to reside in a single store but can be obtained from another authority on the subject of the resource in question.  

A question this brought up for me was one of trust.  Trust specifically concerning the validity of the information retrieved from a data source.  Since anyone can post information about a subject given a canonical URI, the "truthfulness" of the data is questionable at best.  For instance, presuming the URI was intended to reference the Chuck Norris, I could, on my website, post a tuple such as foaf:nick "Prissy Pants Magee" 

which would establish that Chuck Norris has at least one nickname and that nickname is "Prissy Pants Magee." 

Perhaps, because I was not thinking clearly, this was a concern of mine for some time.  However I eventually came to the realization that, simply moving towards a Semantic Web does not remove the onus of understanding the sources of data from the user.  As has always been the case, the consumer of information must know that the information is coming from a source that the consumer trusts.  A user trusting one newspaper and not another will be analogous to a user trusting one data source concerning Chuck Norris facts and not another.  And similarly, any tuple I may post on my website will be used only insofar as I (or my website) am (is) regarded as an authority on the subject matter.