marisa

package module
v1.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 15, 2025 License: MIT Imports: 26 Imported by: 0

README

go-marisa

Go Reference Test Attest marisa build

Go bindings for marisa-trie.

MARISA is a read-only space-efficient trie data structure optimized for lookup, reverse lookup, commmon prefix search (keys which are prefixes of the query), and predictive search (keys starting with the query).

This library wraps a WebAssembly build of MARISA using wazero.

Getting started
var trie marisa.Trie
if err := trie.Build(EnglishWords(), marisa.Config{}); err != nil {
    panic(err)
}

id, ok, err := trie.Lookup("iterate")
if err != nil {
    panic(err)
}
if !ok {
    fmt.Println("not found")
}

key, ok, err := trie.ReverseLookup(id)
if err != nil {
    panic(err)
}
if !ok {
    fmt.Println("not found")
}

fmt.Println("l", id, key)

for id, key := range trie.PredictiveSearchSeq("iterat")(&err) {
    fmt.Println("p", id, key)
}
if err != nil {
    panic(err)
}

for id, key := range trie.CommonPrefixSearchSeq("iterated")(&err) {
    fmt.Println("c", id, key)
}
if err != nil {
    panic(err)
}
Limitations

This library supports little-endian MARISA dictionaries up to 4 GiB. On 32-bit systems, the size is limited to around 2 GiB. These are limitations of MARISA itself.

Big-endian dictionaries (i.e., ones generated with the native tools on big-endian hosts) are not supported.

Memory-mapped dictionaries are only supported on unix-like platforms.

Performance

On platforms which support wazero's JIT compiler, it's about 2-3x slower than the native library. Using the interpreter, it's about 70-150x slower.

The memory usage should be around the same other than a ~115K overhead per trie. Keys are copied to/from Go strings when used.

Design

The API is stable, type-safe, idiomatic, and does not leak implementation details of the marisa-trie library. All errors are handled appropriately and returned.

It supports common Go interfaces like encoding.BinaryMarshaler, encoding.BinaryUnmarshaler, encoding.BinaryAppender, io.WriterTo, and io.WriterFrom.

It provides optimized iterable APIs for queries.

Tries are garbage collected automatically along with other Go objects when there are no more references to it or iterators derived from it.

This module also includes drop-in replacements for the native command-line tools. The have compatible input/output and exit codes, but the error messages may differ.

Concurrency

Each trie can be used across goroutines, but must not be used concurrently by multiple goroutines.

Multiple iterators can be active at once, but must not be called concurrently by multiple goroutines with each other or the trie it came from.

Testing

The wasm blob is fully reproducible and verified.

The tries written by this wrapper will be bit-identical to the ones generated by the native marisa-build.

There are comprehensive unit tests for all exposed functionality.

Documentation

Overview

Package marisa contains bindings for the marisa-trie library.

Example (Load)
trie, err := marisa.Open("words.dat")
if err != nil {
	panic(err)
}
fmt.Println(trie)
Example (Marshal)
var trie marisa.Trie
if err := trie.Build(slices.Values(testdata.Words), marisa.Config{}); err != nil {
	panic(err)
}

buf, err := trie.MarshalBinary()
if err != nil {
	panic(err)
}

fmt.Println(trie.Size(), trie.DiskSize(), len(buf))

if err := trie.UnmarshalBinary(buf); err != nil {
	panic(err)
}

fmt.Println(trie.Size())
Output:

466550 1413352 1413352
466550
Example (Save)
var trie marisa.Trie
if err := trie.Build(slices.Values(testdata.Words), marisa.Config{}); err != nil {
	panic(err)
}

f, err := os.Create("words.dat")
if err != nil {
	panic(err)
}
defer f.Close()

if _, err := trie.WriteTo(f); err != nil {
	panic(err)
}

if err := f.Close(); err != nil {
	panic(err)
}

Index

Examples

Constants

View Source
const (
	MinNumTries = int(_MARISA_MIN_NUM_TRIES)
	MaxNumTries = int(_MARISA_MAX_NUM_TRIES)
)

NumTries specifies the number of tries to use. Usually, more tries make a dictionary space-efficient but time-inefficient.

The limits are an implementation detail.

Variables

This section is empty.

Functions

func Initialize

func Initialize() error

Initialize compiles the wasm binary.

This is called implicitly when Trie is used for the first time.

Types

type CacheLevel

type CacheLevel int

CacheLevel specifies the cache size. A larger cache enables faster search but takes a more space.

const (
	HugeCache CacheLevel = iota + 1
	LargeCache
	NormalCache
	SmallCache
	TinyCache
)

func (CacheLevel) String added in v0.0.2

func (c CacheLevel) String() string

type Config

type Config struct {
	NumTries   int
	CacheLevel CacheLevel
	TailMode   TailMode
	NodeOrder  NodeOrder
}

Config specifies options for a dictionary. Any unspecified options will be set to their default.

type Key

type Key struct {
	ID  uint32
	Key string
}

type NodeOrder

type NodeOrder int

NodeOrder specifies the arrangement of nodes, which affects the time cost of matching and the order of predictive search.

const (
	// LabelOrder arranges nodes in ascending label order. LabelOrder is useful
	// if an application needs to predict keys in label order.
	LabelOrder NodeOrder = iota + 1

	// WeightOrder arranges nodes in descending weight order. WeightOrder is
	// generally a better choice because it enables faster matching.
	WeightOrder
)

func (NodeOrder) String added in v0.0.2

func (t NodeOrder) String() string

type TailMode

type TailMode int

TailMode specifies the kind of TAIL implementation.

const (
	// TextTail merges last labels as zero-terminated strings. So, it is
	// available if and only if the last labels do not contain a NULL character.
	// If TextTail is specified and a NULL character exists in the last labels,
	// the setting is automatically switched to MARISA_BINARY_TAIL.
	TextTail TailMode = iota + 1

	// BinaryTail also merges last labels but as byte sequences. It uses a bit
	// vector to detect the end of a sequence, instead of NULL characters. So,
	// BinaryTail requires a larger space if the average length of labels is
	// greater than 8.
	BinaryTail
)

func (TailMode) String added in v0.0.2

func (t TailMode) String() string

type Trie

type Trie struct {
	// contains filtered or unexported fields
}

Trie is a read-only in-memory little-endian MARISA dictionary.

At the moment, it must not be used concurrently. This restriction may be lifted in the future. It's okay to nest iterators or call methods from within one.

On 64-bit systems, the maximum dictionary size is 4GiB. On 32-bit systems, the maximum dictionary size is 2 GiB. Note that if you build/load the same trie twice, it needs twice the amount of memory since it swaps it at the end.

Example (Query)
var trie marisa.Trie
if err := trie.Build(slices.Values(testdata.Words), marisa.Config{}); err != nil {
	panic(err)
}

id, ok, err := trie.Lookup("iterate")
if err != nil {
	panic(err)
}
if !ok {
	fmt.Println("not found")
}

key, ok, err := trie.ReverseLookup(id)
if err != nil {
	panic(err)
}
if !ok {
	fmt.Println("not found")
}

fmt.Println("l", id, key)

for id, key := range trie.PredictiveSearchSeq("iterat")(&err) {
	fmt.Println("p", id, key)
}
if err != nil {
	panic(err)
}

for id, key := range trie.CommonPrefixSearchSeq("iterated")(&err) {
	fmt.Println("c", id, key)
}
if err != nil {
	panic(err)
}
Output:

l 262491 iterate
p 352923 iterative
p 413344 iteratively
p 413345 iterativeness
p 352924 iteration
p 413346 iterations
p 352925 iterating
p 262491 iterate
p 352926 iterated
p 352927 iterately
p 352928 iterates
p 262492 iterator
p 352929 iterator's
p 352930 iterators
c 4 i
c 2235 ite
c 17192 iter
c 262491 iterate
c 352926 iterated

func Load

func Load(r io.Reader) (*Trie, error)

Load is shorthand for initializing a dictionary with Trie.ReadFrom.

func New

func New(b []byte) (*Trie, error)

New is shorthand for initializing a dictionary with Trie.UnmarshalBinary. Using Load with a bytes.Reader may result in a more optimal in-memory layout.

func Open

func Open(name string) (*Trie, error)

Open opens a dictionary from a file.

func (*Trie) AppendBinary

func (t *Trie) AppendBinary(b []byte) ([]byte, error)

func (*Trie) Build

func (t *Trie) Build(keys iter.Seq[string], cfg Config) error

Build builds a dictionary out of the specified set of keys, with a weight of 1 for each.

Example
keys := []string{
	"a",
	"a/b",
	"a/b/c",
	"b",
	"b/c",
	"c",
}

var trie marisa.Trie
if err := trie.Build(slices.Values(keys), marisa.Config{}); err != nil {
	panic(err)
}

fmt.Println(&trie)
Output:

marisa.Trie(size=6 io_size=4088 total_size=3403 num_tries=3 num_nodes=7 tail_mode=text node_order=weight)

func (*Trie) BuildWeights

func (t *Trie) BuildWeights(keys iter.Seq2[string, float32], cfg Config) error

BuildWeights builds a dictionary out of the specified set of keys and weights. If a key is specified multiple times, the weights are accumulated.

Example
keys := map[string]float32{
	"a":     1.0,
	"a/b":   1.0,
	"a/b/c": 1.0,
	"b":     1.0,
	"b/c":   5.0,
	"c":     3.5,
}
// note: weights are cumulative and include the weights of all children

cfg := marisa.Config{
	NodeOrder: marisa.WeightOrder,
}

var trie marisa.Trie
if err := trie.BuildWeights(maps.All(keys), cfg); err != nil {
	panic(err)
}

fmt.Println(&trie)

var err error
for _, key := range trie.DumpSeq()(&err) {
	fmt.Println(key)
}
if err != nil {
	panic(err)
}
Output:

marisa.Trie(size=6 io_size=4088 total_size=3403 num_tries=3 num_nodes=7 tail_mode=text node_order=weight)
b
b/c
c
a
a/b
a/b/c

func (*Trie) CommonPrefixSearch

func (t *Trie) CommonPrefixSearch(query string, limit int) ([]Key, error)

CommonPrefixSearchSeq returns keys which equal any prefix of the query string. If the limit is -1, all keys are returned.

func (*Trie) CommonPrefixSearchSeq

func (t *Trie) CommonPrefixSearchSeq(query string) func(*error) iter.Seq2[uint32, string]

CommonPrefixSearchSeq returns keys which equal any prefix of the query string.

func (*Trie) DiskSize

func (t *Trie) DiskSize() uint32

DiskSize returns the serialized size of the dictionary.

func (*Trie) Dump

func (t *Trie) Dump(limit int) ([]Key, error)

Dump dumps all keys. If the limit is -1, all keys are returned.

func (*Trie) DumpSeq

func (t *Trie) DumpSeq() func(*error) iter.Seq2[uint32, string]

DumpSeq dumps all keys.

func (*Trie) Lookup

func (t *Trie) Lookup(key string) (uint32, bool, error)

Lookup checks whether a key is registered or not, returning its ID.

func (*Trie) MapFile

func (t *Trie) MapFile(f *os.File, offset int64, length int64) error

MapFile mmaps a file and loads the dictionary from it. On error, the trie is left unchanged. If not supported by the current platform, an error matching errors.ErrUnsupported is returned.

func (*Trie) MarshalBinary

func (t *Trie) MarshalBinary() ([]byte, error)

MarshalBinary serializes the dictionary.

func (*Trie) NodeOrder added in v0.0.2

func (t *Trie) NodeOrder() NodeOrder

NodeOrder returns the tail mode of the dictionary. If unknown, it returns zero.

func (*Trie) NumNodes

func (t *Trie) NumNodes() uint32

NumNodes returns the number of nodes in the dictionary.

func (*Trie) NumTries

func (t *Trie) NumTries() uint32

NumTries returns the number of tries in the dictionary

func (*Trie) PredictiveSearch

func (t *Trie) PredictiveSearch(query string, limit int) ([]Key, error)

PredictiveSearch returns keys starting with a query string. If the limit is -1, all keys are returned.

func (*Trie) PredictiveSearchSeq

func (t *Trie) PredictiveSearchSeq(query string) func(*error) iter.Seq2[uint32, string]

PredictiveSearch returns keys starting with a query string.

func (*Trie) ReadFrom

func (t *Trie) ReadFrom(r io.Reader) (int64, error)

ReadFrom reads a dictionary from r. On success, it will have read exactly the size of the dictionary. On error, the trie is left unchanged.

func (*Trie) ReverseLookup

func (t *Trie) ReverseLookup(id uint32) (string, bool, error)

ReverseLookup gets a key by its ID.

func (*Trie) Size

func (t *Trie) Size() uint32

Size returns the number of keys in the dictionary. Key are numbered from 0 to Size-1.

func (*Trie) String

func (t *Trie) String() string

String returns a human-readable description of the dictionary.

func (*Trie) TailMode added in v0.0.2

func (t *Trie) TailMode() TailMode

TailMode returns the tail mode of the dictionary. If unknown, it returns zero.

func (*Trie) TotalSize

func (t *Trie) TotalSize() uint32

TotalSize returns the in-memory size of the dictionary.

func (*Trie) UnmarshalBinary

func (t *Trie) UnmarshalBinary(b []byte) error

UnmarshalBinary copies b and maps the trie directly from it. This is faster than Trie.ReadFrom, but may have a less optimal memory layout. On error, the trie is left unchanged.

func (*Trie) WriteTo

func (t *Trie) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the dictionary to w.

Directories

Path Synopsis
cmd
marisa-benchmark command
Command marisa-benchmark is a Go re-implementation of the same command.
Command marisa-benchmark is a Go re-implementation of the same command.
marisa-build command
Command marisa-build is a Go re-implementation of the same command.
Command marisa-build is a Go re-implementation of the same command.
marisa-common-prefix-search command
Command marisa-common-prefix-search is a Go re-implementation of the same command.
Command marisa-common-prefix-search is a Go re-implementation of the same command.
marisa-dump command
Command marisa-dump is a Go re-implementation of the same command.
Command marisa-dump is a Go re-implementation of the same command.
marisa-lookup command
Command marisa-lookup is a Go re-implementation of the same command.
Command marisa-lookup is a Go re-implementation of the same command.
marisa-predictive-search command
Command marisa-predictive-search is a Go re-implementation of the same command.
Command marisa-predictive-search is a Go re-implementation of the same command.
marisa-reverse-lookup command
Command marisa-reverse-lookup is a Go re-implementation of the same command.
Command marisa-reverse-lookup is a Go re-implementation of the same command.
cxxerr
Package cxxerr wraps C++ exceptions.
Package cxxerr wraps C++ exceptions.
walloc
Package walloc implements allocators for wazero.
Package walloc implements allocators for wazero.
wexcept
Package wexcept throws and catches Go and C++ exceptions.
Package wexcept throws and catches Go and C++ exceptions.
wexport
Package wexport contains type-safe wrappers for exporting host functions.
Package wexport contains type-safe wrappers for exporting host functions.
wwrap
Package wwrap wraps wazero modules.
Package wwrap wraps wazero modules.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL