Meta information searches on the Gnutella Network

Sumeet Thadani, Lime Peer Technologies LLC (LimeWire)

 

Abstract: LimeWire and other clients on the Gnutella network currently respond to searches by doing string matching between the query and the names of files on the hosts’ library.  Consequently searches are restricted to strings that can be contained in a filename.

           

We propose a technique for allowing richer querying. Every file in a host’s library may have some meta-data associated with it. Query Requests will encode the richer queries and responses will contains results based on the rich query searches in addition to the regular results. The proposed scheme will ensure that the protocol continues to work with older clients, which do not understand the embedded rich queries.

 

 

Introduction

 

LimeWire and other clients on the Gnutella network currently respond to searches by doing string matching between the query and the names of files on the hosts’ library.  Consequently searches are restricted to strings that can be contained in a filename and directory path.

 

For example, if you were looking for a book titled “The Big Bang – Origin of the universe”, which was written Mr. John Doe and was published by the ABC Publishing Co. in March 1997.

 

If anyone actually had the book, the file would probably be called “The Big Bang.txt”. If a user searched for “Big Bang” on the current system she would get the correct response, along with probably a thousand others, which are not really of interest to her.

 

The idea is to allow users to specify more information in the query, and to be able to search more efficiently for the information that really concerns them. Each file in a user’s library can be associated with multiple sets of meta-data tags.

 

To use the running example, the file “Big Bang.txt” could be associated with multiple sets of meta-data tags. One such tagset (a set of related tags that contain related information about the file) could be the “publishing information” tagset, which could contain information like:

 

·        Publisher = ABC Publishing Co.

·        Publish Date = March 1997.

·        Author = John Doe

 

Similarly other tagsets may be associated with the file, such as the “general information tagset” which would contain general information about the article. This set may look like:

 

·        Title = “The Big Bang – Origin of the universe”

·        Number of chapters = 24

·        Genre = Non-fiction.

 

Once this information is associated with a file, it would be possible for other users to search on the basis of author (maybe with other information in the publishing tagset), or on the basis or of genre  (with some other information in the general tagset).

 

We now propose how this can be accomplished.

 

Implementation

 

Encoding

 

The meta-information will be encoded in XML. Note that it’s possible to use other encoding systems (like binary encoding) to represent the grammar we wish other clients to be able to understand (and respond to). However, we choose XML based on the following advantages:

 

·        Parsers are easily available.

·        XML is a well-understood standard.

 

Please see appendix A for a brief discussion on encoding the meta-information in binary.

 

To use the running example from above, each “data set” in the examples above has a particular format. For instance the “general -information tagset” contains information about Title, Number of Chapters and Genre. But in what format should we encode Queries and Replies to carry information pertaining to this tagset?

 

In order for this scheme to be effective we require all clients to be aware of the format of this encoding of the information. We use a template for this.

 

Each template contains a “Reply Element” and a “Request Element”. The purpose of having these embedded in the template is that, once everyone has the same template, there is a standard about what fields one can expect in the query (corresponding to Request Element) and what fields are possible in the reply (corresponding to the fields in the Reply Element).

 

Typically the set of fields in the Reply Elements will be a superset of the elements in the Request Element. This is so because it is conceivable that there are some fields that could be in replies that are not really searchable. For example we could have “comment field” in the Reply Element but not in the Request Element.  The reason being that once someone searches for a file on the basis of the fields in the Request Element, the reply could contain an additional field that contains the previous owner’s comments about the file. But the comments field is not really searchable.

 

It should be pointed out that, there would be no need for separate Request and Reply Elements if they had the same set of fields. They could both be the same element. However, as we discussed above, there may be fields in the Reply that we anticipate will not be there in the Request, hence the need for two different Elements.

 

So, both the Reply Element and the Request Element contain references to various fields, which are defined later in the template. After the Reply and Request Elements have been defined in the template, each field is individually defined.  Obviously, the set of all the field elements defined at this stage is the union of all the fields defined in the Request Element and the Reply Element.

 

Our implementation requires that there should be three types of fields. String fields, Integer fields and Choice Fields. The choice fields are made up of user-defined choices. Examples of all three types are available in the XML encoding of the template for the “books general template”:

 

 

<Template uri=" http://limewire.com/books.gmlt" userDefined ="yes">

      <Request title=" Book Search " keywords="Book">

            <FieldRef name="TIT1" />

            <FieldRef name="CH" />

            <FieldRef name="GEN" />

      </Request>

 

      <Reply title="Book Info">

            <FieldRef name="TIT1" />

            <FieldRef name="CH" />

            <FieldRef name="GEN" />

            <FieldRef name="COMM" />

      </Reply>

 

<Field name="TIT1" displayString="Title">

            <String/>

      </Field>

     

<Field name="CH" displayString=" Number of Chapters">

            <Integer/>

</Field>

 

<Field name="COMM" displayString="Comment">

<String/>

</Field>

     

<Field name="GEN" displayString="Genre">

<Choice value="0" displayString="Fiction" />

<Choice value="1" displayString=" Non-Fiction" />

<Choice value="2" displayString="Thriller" />  

      </Field>

</Template>

 

 

We now go through each of the elements in the sample template above and explain it in some detail.

 

The template is associated with a URI. So if a host gets a query, and sees that it does not have the template, to understand the rich query (or the rich reply) it should go to the URI and download the required template. The location of the URI is specified in all Queries and Replies.

 

Another thing that is specified is, whether this template is user defined or the meta-data embedded is embedded in the file itself. If the meta-data is embedded in the file the template must be created accordingly, but the template still has to be defined by users.

 

We now discuss how Queries and Query Replies work. We assume that the files in the user’s library are somehow associated with some meta-data. In a later section of this paper we explain the process of creating meta-data and associating it with files in the user’s library.

 

 

Query Requests

 

We will use the running example to illustrate the creation of a Query Request. When the user indicates that she wants to do a rich search, she will be shown a list of templates the system is aware of. The user chooses one such template on the basis of which she wishes to search, and populates the appropriate fields of the template as per her requirements.

 

This is illustrated below:

 

 


 

  1. Select tag search. The system shows you the tags you can search on.

 

 

 

  1. Populate the fields of the search criterion. As below

 

 

 

 

 

 

 

 

  1. Click on search.

 

 

Note: There may or may not be a regular query in the Query Request.

 

The Query Request is created with this format:

 

·        Bytes 0 and 1: Minimum speed (Not changed)

·        Bytes 2 to null termination (say position x): normal string query (Not changed)

·        Bytes x+1 to next null termination: rich Query (Added)

 

This is illustrated in the figure below:

 

Original Query:

 

 

 

 

 

 

 

The proposed new Query Request will look like this:

 

 

 

 

 

 

Older clients that do not understand rich queries will just ignore the stuff after the first null. Newer clients which wish to take advantage of the rich query will understand that the first null is not the end of the packet and there is a rich query after that first null.

 

Here is an example of a rich query part of the Query Request.

 

<GML template =’ http://limewire.com/books.gmlt>

            <TIT1 = “The Big Bang – Origin of the Universe”/>

            <CH = “23”/>

            <GEN=1>

</GML>

 

The above rich Query is looking for a file with Title = “The Big Bang – Origin of the Universe” which has 23 chapters and is a non-fiction book. In order to understand these things, of course the other clients will need to have access to the template.

 

Examples of a few more legal rich queries are

 

<GML template =’ http://limewire.com/books.gmlt>

            <TIT1 = “The Big Bang – Origin of the Universe”/>

            <CH = “23”/>

</GML>

 

<GML template =’ http://limewire.com/books.gmlt>

            <TIT1 = “The Big Bang – Origin of the Universe”/>

            <GEN=1>

</GML>

 

Note: The above Rich Query String corresponds to the picture of the rich search in the screen shot shown earlier.

 

<GML template =’ http://limewire.com/books.gmlt>

            <CH = “23”/>

            <GEN=1>

</GML>

 

The rich queries need not contain information for all the fields of the template, only the ones the user wishes to base her search on. But it is illegal to switch the order of the fields: For instance this rich –query is illegal

 

<GML template =’ http://limewire.com/books.gmlt>

            <CH = “23”/>

            <TIT1 = “The Big Bang – Origin of the Universe”/>

            <GEN=1>

</GML>

 

Nor is it legal to use fields like “COMM” that are used in the Reply Element and not in the Request Element of the template even though, the field is defined in the template.

 

Needless to say, we do not allow the use of fields like “ABC” (which are not in the template at all) in Rich Query Requests, because this field would be completely arbitrary, and hence would make no sense.

 

We do not permit out-of-order rich queries for efficiency reasons. Besides we do have a template that does specify an order.

Query Reply

 

If a rich query reaches an older client that does not recognize the rich queries, it will do a normal search and send out a “normal reply” - based on a search with the normal part of the Query and a string-matching algorithm with the file names in the users library.

 

If the rich query reaches a newer client that understands rich queries, the search proceeds on the basis of the rich query (and also the normal query) and a host creates a Query Reply as a result of this search and sends it out just like a normal reply.

 

The original query reply with two matches for files is illustrated below:

 

 

 

 

 

                       

 

The proposed new Query reply will look like this:

 

 

 

 

 

 

Here is an example of the XML part of a response (corresponding to one file).

 

 

<GMLReplyCollection identifier =”BigBang.txt”>

<GML template =’ http://limewire.com/books.gmlt>

            <TIT1>

The Big Bang – Origin of the Universe”

            </TIT1>

            <CH >

23

            </CH>

            <GEN>

1

<GEN/>

      </GML>

     

<GML template =’ http://limewire.com/publishing.gmlt>

            <PUB>

ABC Publishing Co.

            </PUB>

            <DTE >

March 1997

            </DTE>

            <AUT>

John Doe

<AUT/>

      </GML>     

</GMLReply Collection>

 

When a search is successful, we return only the meta-data of the corresponding file, but we return all the meta-data corresponding to that file.

 

Note, if a certain file has no meta-data and a match is found for that file, the meta-data part of the Query Reply will be empty for that part of the response.

 

We have reason to believe that the proposed changes will not affect existing clients in any way, since we have observed that Gnotella actually does embed some XML data within their query replies, which other clients now tend to ignore.

 

So, in the Query Reply, for each file that we are sending a response about, we also include the GMLReplyCollection information (if there is any).

 

Meta-Data creation

 

First some information about tags, for those of you who are wondering what all this GML stuff is all about. GML stands for Gnutella Markup Language (some people in our company also call it Greg’s Markup Language).

 

There can be two types of meta-information associated with each file in the library. Embedded meta-data and user defined meta-data.

 

Embedded meta-data: In some cases, the file has some embedded meta- data about the file. This embedded meta-data corresponds to one tagset. The format of the information in this tagset (the tagset still has to be defined, and the user-defined attribute has to be set to “no”) is pre-defined and is not editable.  LimeWire has the capability to scan files and find any internal meta-data.

 

User defined meta-data: Some files may be associated with meta-data according to certain user-defined template(s). In LimeWire there are two possible ways for a file to be associated with user-defined templates.

 

1.      The user adds information to the various fields of a certain template and the resulting tagset is associated with the file. This is illustrated in the figures below:

 

The various steps are:

 

 

 

a.      Select the file and right click, choose “View/Edit tags”

 


b.      A user interface shows up that asks you to add tags

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

c.      Click the add button and choose the template according to which you want to populate the tags.

 

 

 

d.      Populate the fields as shown below.

 

 

 

 

 

 

 

 

 

e.      Save.

 

 

Note that each file may be associated with multiple tagsets, and many tagsets may correspond to the same template (The template to tagsets relationship is a one-to-many relationship for each document).

 

2.      The user searches on the basis of a tagset (say X), and gets a response for a file, (say from user A). Now on user A’s machine, the file is associated with another tagset (say Y). When the user successfully downloads the file, the file will be associated with the tagsets X and Y in her machine.

 

Note that X and Y may be either user defined or embedded.

 

 

Thus each file is associated with (multiple) tagsets.

 

We use the running example to create 2 tagsets for the file – BigBang.txt

 

Note that we have two tagsets corresponding to two different templates in the example below. However we had defined the template for only the first one above. We have not defined the template for the second tagset in this example.

 

 

<GMLReplyCollection identifier =”BigBang.txt”>

<GML template =’ http://limewire.com/books.gmlt>

                        <TIT1>

The Big Bang – Origin of the Universe”

                        </TIT1>

                        <CH >

23

                        </CH>

                        <GEN>

1

<GEN/>

            </GML>

           

<GML template =’ http://limewire.com/publishing.gmlt>

                        <PUB>

ABC Publishing Co.

                        </PUB>

                        <DTE >

March 1997

                        </DTE>

                        <AUT>

John Doe

<AUT/>

            </GML>         

</GMLReply Collection>

 

 

Meta-data creation involves making templates of the information we wish to encode, and posting the template to a URI where it’s accessible to all. Once the template is out there, many people could start using it to encode and search for the data they really want. Some files have embedded information. These require templates to be made for them and posted as well.
Appendix A

 

Binary format for encoding meta-data

 

It’s possible to encode the meta-information in binary (although LimeWire’s alpha implementation uses XML encoding). The advantage of this is obviously the conservation of bandwidth. But there is also a price to pay – in that we lose the ability to have nested meta-data.

 

First lets take a look at how this scheme can work.

 

The template for the schema will still be in XML.

 

In Queries and Query Replies, meta-data will be in the same “position” within the message (i.e. between the double null in the Query Reply and after the first null in the Query). However, the structure of the Queries and Query Replies will be different than described in the sections above.

 

Query Requests

 

Let’s use the running example. A Query Request in our example would normally have three fields (maybe less) in the Rich Query String. When using XML tags, the values of those fields would be encoded within the tags, to indicate what data they represent. The template would be very important in deciding what the data means.

 

When using binary, the template becomes even more important. The first part of the rich query must contain the URI of the template. Followed by a delimiter. This would be followed by individual values for each of the fields in the template. Each value must be delimiter separated. If a particular field in the Query was not populated by the user in the doing the search – that field would still have to have a null (or zero) value in the Rich Query String. Why? – Because we now only have the template to define what field to expect in what position – so order becomes very important and we just cannot skip any field!

 

Query Replies

           

The concept is the same as Query Requests. The various fields in the Rich Query Reply must be delimiter separated and they must all be in the same order as specified in the template. Again, if a certain field does not have a value, it should still be included in the Rich Query Reply to ensure that the ordering remains consistent with the template.

 

 

 

 

 

 

The trade-offs

 

The clear benefit of using binary encoding is that the size of the Rich Query Requests and Replies will be much smaller than if XML were sent on the wire. This will save some bandwidth, which definitely means a lot in a network where bandwidth is a big constraint.

 

But using a binary encoding is not a benefit that comes free of cost; there is a price to pay in terms of the flexibility of the meta-information encoding. When using binary encoding it is not possible to have meta-information to a depth greater than one. That is we cannot have a tag embedded in a tag embedded in another tag etc. (we lose the recursive nature of XML that adds to much of its power).