Valo has a powerful semi-structured repository – perfect for storing and searching for millions of documents at low latencies. I’ll show you a very quick tutorial on how to build a search engine for log files so you can build something more elaborate for yourselves!

The Architectural Building Blocks

Firstly, some architectural basics. Valo has a RESTful interface that wraps around Apache Lucene, a Java-based indexing and search technology. The simple HTTP interface means it is easy to build with Valo in your language of choice and you can painlessly scale your data across multiple nodes.

The Data

In this example, I have used a sample of 10,000 Apache logs from NASA. Applications and machines can generate millions of events per day and having a log search tool is a useful way to extract the needle from the haystack. We’ve already documented a tutorial on how to load the data into Valo using logstash.

Querying the log files

Once the data has been stored (and indexed) in Valo’s semi-structured repository, it is time to start querying!

To perform a simple search, we first need to create a json document, which I named ‘search.json’ with the following content:

{
  "uris"  : [
    "/streams/demo/infrastructure/apache"
  ],
  "query": {
    "base": "search('takeoff')"
  }
}

In the json above, I have specified that I want to search only the apache stream with the keyword ‘takeoff’.

In order to get the results, I executed the following HTTP Post command on the terminal:

curl -H "Content-Type: application/json" -X POST --data @search.json http://localhost:8888/ssr/demo/_search

It turns out there is only one HTTP log event containing the search term ‘takeoff’:

{
    "count": 1,
    "atLeastNMoreResults": 0,
    "items": [
        {
            "uri": "/streams/demo/infrastructure/apache",
            "score": 0.6154206395149231,
            "data": {
                "message": "ix-sd6-10.ix.netcom.com - - [01/Jul/1995:03:33:25 -0400] \"GET /htbin/wais.pl?takeoff HTTP/1.0\" 200 6898",
                "path": "/Users/jpatani/Desktop/sample-apache-log 2",
                "host": "ITRSLP101",
                "clientip": "ix-sd6-10.ix.netcom.com",
                "ident": "-",
                "auth": "-",
                "timestamp": "01/Jul/1995:03:33:25 -0400",
                "verb": "GET",
                "request": "/htbin/wais.pl?takeoff",
                "httpversion": 1.0,
                "response": 200,
                "bytes": 6898,
                "utctimestamp": "1995-07-01T07:33:25Z",
                "version": 1
            }
        }
    ]
}

We can also make the search a little more sophisticated if we change 'search.json' to the following:

{
  "uris"  : [
    "/streams/demo/infrastructure/apache"
  ],
  "query": {
    "base": "search('404 orbit*')",
    "filters": [
    { "type": "range", "field": "utctimestamp", "from": "1995-07-01", "fromInclusive": true, "to": "1996-07-02", "toInclusive": true }
    ],
    "count":10
  }
}

In the above search, we are searching for all 404 error messages with the prefix orbit in the message (the asterisk in the search term denotes a wildcard). I've also added two filters. The first is a time range filter which limits the results to events between the first and second of July 1995. The second filter imposes a count limit of only 10 results.

We can also add more search terms into the query itself using the '&&' or '||' operators:

{
  "uris"  : [
    "/streams/demo/infrastructure/apache"
  ],
  "query": {
    "base": "search('gif') && verb == 'GET' && hour(utctimestamp) > 22"
  }
}

The above query will search for the keyword 'gif' where the 'verb' field is a GET request and the event takes place after 10pm.

Beyond searching for keywords, there are other ways to make life easier when searching such as taxonomies. Taxonomies are a way to classify documents when they are indexed to allow users to easily drill down into search results. I'll show you some simple examples to illustrate this in a future blog post.

A simple front-end interface

To avoid using the command line interface, I also built a very rudimentary interface, which allows you to search through the logs in a more UI-friendly way. The code is available on Github.

alt text

What are other uses of text search?

There are plenty of other use cases where full text search is relevant - even outside of analytics for example blog search and e-commerce product search. You can even follow our Twitter streaming tutorial: store thousands of tweets and build a Tweet search engine!