TRY ME

Try Valo for free

We want to show you something amazing.

We'll send you a link to download a fully functional Valo copy to play with.



Great! Check your email and enjoy Valo



Apologies, we seem to be having a problem processing your input, please try again

Semi-structured repository

The semi-structured repository (SSR) is based on Lucene which provides very powerful text search capabilities. Even though Lucene is geared towards indexing text, it also has very good index support for numerical data. However, if the data contains purely numerical fields the TSR might be a better fit as a repository for this kind of data.

In contrast to other systems based on Lucene, the SSR does not use the index to store the raw payloads. Instead, it stores the payloads separately in a highly efficient and compacting data store. The data in the data store is indexed in Lucene to provide fast execution of historical queries.

By default the SSR will index all fields in a payload, even those fields which are not present in the schema. The fields without a schema mapping will be indexed using a default index which best suits the data type. The correct schema type can be set at a later point by updating the schema. Although fields which are not defined in the schema cannot be used in a query, they can be used in a native SSR search. See Search API (SSR).

Indexing of fields can be disabled via the SSR config. Doing this will reduce the overall space required to index payloads in the SSR repository but it will introduce a performance overhead at query time for those fields which are not indexed (if the fields are used in a query).

The SSR stores and index data which can be represented as a JSON like document. A document can represent a line in a log file, an OS system event, a trade, etc. Documents can be nested and can contain collections as the following example illustrates;

{
    "id" : "9a2df62c-f97d-4971-932d-0099bc4efa49",
    "host" : {
      "name" : "ITRSPC173",
      "ip"   : "192.168.220.43"
    },
    "os" : {
      "name"          : "Windows",
      "version"       : "8.1",
      "manufacturer"  : "Microsoft"
    },
    "processors" : {
      "count" : 2,
      "items" : [
        { "model" : "Core i7" },
        { "model" : "Core i7" }
      ]
    }
}

Repository configuration

Storing documents in the repository is enabled by specifying a repository mapping as outlined in Streams API. As a minimum the following mapping should be specified;

{
  "name" : "ssr"
}

By default, the repository will store all documents and index all fields in the document. If a stream schema is defined it will apply the appropriate index type and analyzer. If no schema is defined, it will use a default index type appropriate to the native field type.

The index behavior for a particular stream can be overridden by the repository config.

{
  "name" : "ssr",
  "config" : {
    "defaultAnalyser" : "DefaultAnalyser",
    "fields" : [
      { "field": "id", "analyzer": "KeywordAnalyser" },
      { "field": "os.name", "index": false, "store": true  }
    ]
  }
}

The defaultAnalyzer specifies the default analyzer to use for string fields when none is specified.

Note

A fields which has a schema mapping might not use the default analyzer as it will use an analyzer which best suits the schema type. See Index fields for the default values used by each schema field.

The indexing behavior of each payload field can be overridden by listing the field in the fields section. The analyzer field overrides the default analyzer. If index is set to false the field will not be indexed but still available in the original payload. If store is set to true the value will be stored in an efficient columnar format in the index. Storing the value in the index enables very fast histograms to be executed within the repository.

Note

Not storing the field in the index does not mean the field cannot be retrieved. The field will still be available in the original payload which is stored separately to the index. What it means is that the repository cannot perform certain optimized queries like a histogram on the particular field.

Index fields

This sections outlines how the schema maps to index types. The settings for indexed, stored, and analyzer listed for each type are the default settings. These can be overridden in the repository config but in most cases the defaults are well suited for the type.

If a schema is defined the repository will use the appropriate index types. In the schema below we have only partially defined the schema for the example json. Payload fields with no schema mapping will be indexed using a default index type which best matches the raw payload type.

{
 "version": "1.0.0",
 "topDef": {
   "type": "record",
   "properties": {
     "id" : { "type": "contributor" },
     "host" : {
         "type": "record",
          "properties": {
            "name" : { "type": "string" },
            "ip" : { "type": "ip" }
          }
     }
   }
 }
}

String

Indexed: true, Stored: false, Analyzer: default

Applies a text search on the field. See Search Syntax Documentation.

Example; from cpu where search(host, “ABC123”)

prefix(field, pattern)

Matches on fields starting with the specified pattern.

Example; from cpu where prefix(host, “ABC”)

Byte

Indexed: true, Stored: true

Short

Indexed: true, Stored: true

Int

Indexed: true, Stored: true

Long

Indexed: true, Stored: true

Double

Indexed: true, Stored: true

Boolean

Indexed: true, Stored: true

Duration ~~~~~~~-

Indexed: true, Stored: true

Contributor

Indexed: true, Stored: true, Analyzer: LowerCaseKeywordAnalyzer

Supports the same functions as a string index.

Datetime

Indexed: true, Stored: true

year(field)

Extracts the year from a datetime. Requires the field to be stored in the index.

Example; from cpu groupby year(tradedate) sum(value)

month(field)

Extracts the month from a datetime. Requires the field to be stored in the index.

Example; from cpu groupby month(tradedate) sum(value)

day(field)

Extracts the day from a datetime. Requires the field to be stored in the index.

Example; from cpu groupby day(tradedate) sum(value)

hour(field)

Extracts the hour from a datetime. Requires the field to be stored in the index.

Example; from cpu groupby hour(tradedate) sum(value)

minute(field)

Extracts the minute from a datetime. Requires the field to be stored in the index.

Example; from cpu groupby minute(tradedate) sum(value)

second(field)

Extracts the second from a datetime. Requires the field to be stored in the index.

Example; from cpu groupby second(tradedate) sum(value)

Date

Indexed: true, Stored: true

year(field)

Extracts the year from a datetime. Requires the field to be stored in the index.

Example; from cpu groupby year(tradedate) sum(value)

month(field)

Extracts the month from a datetime. Requires the field to be stored in the index.

Example; from cpu groupby month(tradedate) sum(value)

day(field)

Extracts the day from a datetime. Requires the field to be stored in the index.

Example; from cpu groupby day(tradedate) sum(value)

Time

Indexed: true, Stored: true

hour(field)

Extracts the hour from a datetime. Requires the field to be stored in the index.

Example; from cpu groupby hour(tradedate) sum(value)

minute(field)

Extracts the minute from a datetime. Requires the field to be stored in the index.

Example; from cpu groupby minute(tradedate) sum(value)

second(field)

Extracts the second from a datetime. Requires the field to be stored in the index.

Example; from cpu groupby second(tradedate) sum(value)

Email

Indexed: true, Stored: true

Ip

Indexed: true, Stored: true

URI

Indexed: true, Stored: true

UUID

Indexed: true, Stored: true