构建大型单词到索引到单词字典的最有效数据结构是什么?

问题描述 投票:0回答:2

我想索引非常多的字符串(将每个字符串映射到数值),但也能够从其数字索引中检索每个字符串。

由于内存问题,使用散列表或python dict不是一个选项所以我决定使用radix trie来存储字符串,我可以非常快速地检索任何字符串的索引并处理大量的字符串。

我的问题是我还需要从它们的数字索引中检索字符串,如果我维护一个“反向索引”列表[string1,string2,...,stringn],我将失去Trie的内存优势。

我想也许“反向索引”可能是指向某种Trie结构的最后一个节点的指针列表,但首先,python中没有指针,其次我不确定我是否可以拥有“节点级” “访问我目前正在使用的Trie结构。

这种数据结构是否已经存在?如果不是,你会如何在python中做到这一点?

python dictionary data-structures trie natural-language-processing
2个回答
2
投票

根据What data structure to use to have O(log n) key AND value lookup?,您需要两个用于键和值查找的同步数据结构,每个结构都保存对另一个叶节点的引用。

ID查找的结构可以是任何具有足够效率的结构 - 平衡树,哈希表,另一个trie。

为了能够从叶节点引用中提取值,trie需要允许1)叶节点引用本身(不一定是真正的Python引用,它的API可以使用的任何东西); 2)走上特里从该参考中提取单词。

请注意,引用实际上是一个唯一的整数,因此如果您的ID不大于整数,则将某些内容重用为ID是有意义的 - 例如trie节点引用自己。然后,如果特里API可以验证这样的引用(即告诉它是否有一个带有这样引用的已使用节点),这将作为ID查找,你根本不需要第二个结构!这样,虽然'cuz引用值(有效的内存地址)在进程和运行之间发生变化,但ID将是非持久的。


0
投票

我正在回答自己,因为我最终创建了自己的数据结构,非常适合我使用python3内置函数的单词到索引到单词的问题。

我试图让它干净而有效,但显然还有改进的余地,C绑定会更好。

所以最终结果是一个indexedtrie类,它看起来像一个python dict(如果用default_factory参数调用它,则是defaultdict),但也可以像列表那样查询,因为会自动维护一种“反向索引”。

存储在内部基数trie中的键可以是任何可订阅对象(字节,字符串,元组,列表)以及要在其中存储任何内容的值。

indextrie类也是可选择的,你可以从radix尝试关于“前缀搜索”和这类事情的优势中受益!

trie中的每个键都与一个唯一的整数索引相关联,您可以使用索引检索键或使用键检索索引,整个过程快速且内存安全,所以我个人认为这是最好的数据结构之一。世界,它应该集成在python标准库:)。

足够的说话,这是代码,随意适应和使用它:

"""
A Python3 indexed trie class.
An indexed trie's key can be any subscriptable object. 
Keys of the indexed trie are stored using a "radix trie", a space-optimized data-structure which has many advantages (see https://en.wikipedia.org/wiki/Radix_tree).
Also, each key in the indexed trie is associated to a unique index which is build dynamically.

Indexed trie is used like a python dictionary (and even a collections.defaultdict if you want to) but its values can also be accessed or updated (but not created) like a list!

Example:
    >>> t = indextrie()
    >>> t["abc"] = "hello"
    >>> t[0]
    'hello'
    >>> t["abc"]
    'hello'
    >>> t.index2key(0)
    'abc'
    >>> t.key2index("abc")
    0
    >>> t[:]
    [0]
    >>> print(t)
    {(0, 'abc'): hello}
"""

__author__ = "@fbparis"

_SENTINEL = object()

class _Node(object):
    """
    A single node in the trie.
    """
    __slots__ = "_children", "_parent", "_index", "_key"

    def __init__(self, key, parent, index=None):
        self._children = set()
        self._key = key
        self._parent = parent
        self._index = index
        self._parent._children.add(self)

class IndexedtrieKey(object):
    """
    A pair (index, key) acting as an indexedtrie's key
    """
    __slots__ = "index", "key"

    def __init__(self, index, key):
        self.index = index
        self.key = key

    def __repr__(self):
        return "(%d, %s)" % (self.index, self.key)

class indexedtrie(object):
    """
    The indexed trie data-structure.
    """
    __slots__ = "_children", "_indexes", "_values", "_nodescount", "_default_factory"

    def __init__(self, items=None, default_factory=_SENTINEL):
        """
        A list of items can be passed to initialize the indexed trie.
        """
        self._children = set()
        self.setdefault(default_factory)
        self._indexes = []
        self._values = []
        self._nodescount = 0 # keeping track of nodes count is purely informational
        if items is not None:
            for k, v in items:
                if isinstance(k, IndexedtrieKey):
                    self.__setitem__(k.key, v)
                else:
                    self.__setitem__(k, v)

    @classmethod
    def fromkeys(cls, keys, value=_SENTINEL, default_factory=_SENTINEL):
        """
        Build a new indexedtrie from a list of keys.
        """
        obj = cls(default_factory=default_factory)
        for key in keys:
            if value is _SENTINEL:
                if default_factory is not _SENTINEL:
                    obj[key] = obj._default_factory()
                else:
                    obj[key] = None
            else:
                obj[key] = value
        return obj

    @classmethod
    def fromsplit(cls, keys, value=_SENTINEL, default_factory=_SENTINEL):
        """
        Build a new indexedtrie from a splitable object.
        """
        obj = cls(default_factory=default_factory)
        for key in keys.split():
            if value is _SENTINEL:
                if default_factory is not _SENTINEL:
                    obj[key] = obj._default_factory()
                else:
                    obj[key] = None
            else:
                obj[key] = value
        return obj

    def setdefault(self, factory=_SENTINEL):
        """
        """
        if factory is not _SENTINEL:
            # indexed trie will act like a collections.defaultdict except in some cases because the __missing__
            # method is not implemented here (on purpose).
            # That means that simple lookups on a non existing key will return a default value without adding
            # the key, which is the more logical way to do.
            # Also means that if your default_factory is for example "list", you won't be able to create new
            # items with "append" or "extend" methods which are updating the list itself.
            # Instead you have to do something like trie["newkey"] += [...]
            try:
                _ = factory()
            except TypeError:
                # a default value is also accepted as default_factory, even "None"
                self._default_factory = lambda: factory
            else:
                self._default_factory = factory
        else:
            self._default_factory = _SENTINEL

    def copy(self):
        """
        Return a pseudo-shallow copy of the indexedtrie.
        Keys and nodes are deepcopied, but if you store some referenced objects in values, only the references will be copied.
        """
        return self.__class__(self.items(), default_factory=self._default_factory)

    def __len__(self):
        return len(self._indexes)

    def __repr__(self):
        if self._default_factory is not _SENTINEL:
            default = ", default_value=%s" % self._default_factory()
        else:
            default = ""
        return "<%s object at %s: %d items, %d nodes%s>" % (self.__class__.__name__, hex(id(self)), len(self), self._nodescount, default)

    def __str__(self):
        ret = ["%s: %s" % (k, v) for k, v in self.items()]
        return "{%s}" % ", ".join(ret)

    def __iter__(self):
        return self.keys()

    def __contains__(self, key_or_index):
        """
        Return True if the key or index exists in the indexed trie.
        """
        if isinstance(key_or_index, IndexedtrieKey):
            return key_or_index.index >= 0 and key_or_index.index < len(self)
        if isinstance(key_or_index, int):
            return key_or_index >= 0 and key_or_index < len(self)
        if self._seems_valid_key(key_or_index):
            try:
                node = self._get_node(key_or_index)
            except KeyError:
                return False
            else:
                return node._index is not None
        raise TypeError("invalid key type")

    def __getitem__(self, key_or_index):
        """
        """
        if isinstance(key_or_index, IndexedtrieKey):
            return self._values[key_or_index.index]
        if isinstance(key_or_index, int) or isinstance(key_or_index, slice):
            return self._values[key_or_index]
        if self._seems_valid_key(key_or_index):
            try:
                node = self._get_node(key_or_index)
            except KeyError:
                if self._default_factory is _SENTINEL:
                    raise
                else:
                    return self._default_factory()
            else:
                if node._index is None:
                    if self._default_factory is _SENTINEL:
                        raise KeyError
                    else:
                        return self._default_factory()
                else:
                    return self._values[node._index]            
        raise TypeError("invalid key type")

    def __setitem__(self, key_or_index, value):
        """
        """
        if isinstance(key_or_index, IndexedtrieKey):
            self._values[key_or_index.index] = value
        elif isinstance(key_or_index, int):
            self._values[key_or_index] = value
        elif isinstance(key_or_index, slice):
            raise NotImplementedError
        elif self._seems_valid_key(key_or_index):
            try:
                node = self._get_node(key_or_index)
            except KeyError:
                # create a new node
                self._add_node(key_or_index, value)
            else:
                if node._index is None:
                    # if node exists but not indexed, we index it and update the value
                    self._add_to_index(node, value)
                else:
                    # else we update its value
                    self._values[node._index] = value
        else:
            raise TypeError("invalid key type")

    def __delitem__(self, key_or_index):
        """
        """
        if isinstance(key_or_index, IndexedtrieKey):
            node = self._indexes[key_or_index.index]
        elif isinstance(key_or_index, int):
            node = self._indexes[key_or_index]
        elif isinstance(key_or_index, slice):
            raise NotImplementedError
        elif self._seems_valid_key(key_or_index):
            node = self._get_node(key_or_index)
            if node._index is None:
                raise KeyError
        else:
            raise TypeError("invalid key type")
        # switch last index with deleted index (except if deleted index is last index)
        last_node, last_value = self._indexes.pop(), self._values.pop()
        if node._index != last_node._index:
            last_node._index = node._index
            self._indexes[node._index] = last_node
            self._values[node._index] = last_value
        if len(node._children) > 1:
            #case 1: node has more than 1 child, only turn index off
            node._index = None
        elif len(node._children) == 1:
            # case 2: node has 1 child
            child = node._children.pop()
            child._key = node._key + child._key
            child._parent = node._parent
            node._parent._children.add(child)
            node._parent._children.remove(node)
            del(node)
            self._nodescount -= 1
        else:
            # case 3: node has no child, check the parent node
            parent = node._parent
            parent._children.remove(node)
            del(node)
            self._nodescount -= 1
            if hasattr(parent, "_index"):
                if parent._index is None and len(parent._children) == 1:
                    node = parent._children.pop()
                    node._key = parent._key + node._key
                    node._parent = parent._parent
                    parent._parent._children.add(node)
                    parent._parent._children.remove(parent)
                    del(parent)
                    self._nodescount -= 1

    @staticmethod
    def _seems_valid_key(key):
        """
        Return True if "key" can be a valid key (must be subscriptable).
        """
        try:
            _ = key[:0]
        except TypeError:
            return False
        return True

    def keys(self, prefix=None):
        """
        Yield keys stored in the indexedtrie where key is a IndexedtrieKey object.
        If prefix is given, yield only keys of items with key matching the prefix.
        """
        if prefix is None:
            for i, node in enumerate(self._indexes):
                yield IndexedtrieKey(i, self._get_key(node))
        else:
            if self._seems_valid_key(prefix):
                empty = prefix[:0]
                children = [(empty, prefix, child) for child in self._children]
                while len(children):
                    _children = []
                    for key, prefix, child in children:
                        if prefix == child._key[:len(prefix)]:
                            _key = key + child._key
                            _children.extend([(_key, empty, _child) for _child in child._children])
                            if child._index is not None:
                                yield IndexedtrieKey(child._index, _key)
                        elif prefix[:len(child._key)] == child._key:
                            _prefix = prefix[len(child._key):]
                            _key = key + prefix[:len(child._key)]
                            _children.extend([(_key, _prefix, _child) for _child in child._children])
                    children = _children
            else:
                raise ValueError("invalid prefix type")

    def values(self, prefix=None):
        """
        Yield values stored in the indexedtrie.
        If prefix is given, yield only values of items with key matching the prefix.
        """
        if prefix is None:
            for value in self._values:
                yield value
        else:
            for key in self.keys(prefix):
                yield self._values[key.index]

    def items(self, prefix=None):
        """
        Yield (key, value) pairs stored in the indexedtrie where key is a IndexedtrieKey object.
        If prefix is given, yield only (key, value) pairs of items with key matching the prefix.
        """
        for key in self.keys(prefix):
            yield key, self._values[key.index]

    def show_tree(self, node=None, level=0):
        """
        Pretty print the internal trie (recursive function).
        """
        if node is None:
            node = self
        for child in node._children:
            print("-" * level + "<key=%s, index=%s>" % (child._key, child._index))
            if len(child._children):
                self.show_tree(child, level + 1)

    def _get_node(self, key):
        """
        Return the node associated to key or raise a KeyError.
        """
        children = self._children
        while len(children):
            notfound = True
            for child in children:
                if key == child._key:
                    return child
                if child._key == key[:len(child._key)]:
                    children = child._children
                    key = key[len(child._key):]
                    notfound = False
                    break
            if notfound:
                break
        raise KeyError

    def _add_node(self, key, value):
        """
        Add a new key in the trie and updates indexes and values.
        """
        children = self._children
        parent = self
        moved = None
        done = len(children) == 0
        # we want to insert key="abc"
        while not done:
            done = True
            for child in children:
                # assert child._key != key # uncomment if you don't trust me
                if child._key == key[:len(child._key)]:
                    # case 1: child's key is "ab", insert "c" in child's children 
                    parent = child
                    children = child._children
                    key = key[len(child._key):]
                    done = len(children) == 0
                    break
                elif key == child._key[:len(key)]:
                    # case 2: child's key is "abcd", we insert "abc" in place of the child
                    # child's parent will be the inserted node and child's key is now "d"
                    parent = child._parent
                    moved = child
                    parent._children.remove(moved)
                    moved._key = moved._key[len(key):]
                    break
                elif type(key) is type(child._key): # don't mess it up
                    # find longest common prefix
                    prefix = key[:0]
                    for i, c in enumerate(key):
                        if child._key[i] != c:
                            prefix = key[:i]
                            break
                    if prefix:
                        # case 3: child's key is abd, we spawn a new node with key "ab"
                        # to replace child ; child's key is now "d" and child's parent is 
                        # the new created node.
                        # the new node will also be inserted as a child of this node
                        # with key "c"
                        node = _Node(prefix, child._parent)
                        self._nodescount += 1
                        child._parent._children.remove(child)
                        child._key = child._key[len(prefix):]
                        child._parent = node
                        node._children.add(child)
                        key = key[len(prefix):]
                        parent = node
                        break
        # create the new node
        node = _Node(key, parent)
        self._nodescount += 1
        if moved is not None:
            # if we have moved an existing node, update it
            moved._parent = node
            node._children.add(moved)
        self._add_to_index(node, value)

    def _get_key(self, node):
        """
        Rebuild key from a terminal node.
        """
        key = node._key
        while node._parent is not self:
            node = node._parent
            key = node._key + key
        return key

    def _add_to_index(self, node, value):
        """
        Add a new node to the index.
        Also record its value.
        """
        node._index = len(self)
        self._indexes.append(node)
        self._values.append(value)

    def key2index(self, key):
        """
        key -> index
        """
        if self._seems_valid_key(key):
            node = self._get_node(key)
            if node._index is not None:
                return node._index
            raise KeyError
        raise TypeError("invalid key type")

    def index2key(self, index):
        """
        index or IndexedtrieKey -> key.
        """
        if isinstance(index, IndexedtrieKey):
            index = index.index
        elif not isinstance(index, int):
            raise TypeError("index must be an int")
        if index < 0 or index > len(self._indexes):
            raise IndexError
        return self._get_key(self._indexes[index])
© www.soinside.com 2019 - 2024. All rights reserved.