Thursday, February 25, 2010

Scenes from LA Ruby Conference - 2010 - Data Structures

Take Away

Fascinating talk on using probabilistic data structures to save oodles of search time if a limited number of false indicators can be tolerated.

Introduction

Tyler McMullen - Scribd

Different Data Structures

Why?
  • Speed
  • Memory
  • Clarity
Some very interesting structures
  • Bloom Filter
  • BK-tree
  • Splay Tree
  • Trie
Bloom Filters
  • Tests for existence in a set
  • Probabilistic
  • Minimal memory use
Example: 100million strings in a set
Tradition set: 10gb minimum vs 280mb

How does it work?
Binary sequence. Uses hash
In places where occasional false positives are okay

BK-tree

find items within a distance of a target
reduces search space
works inside a metric space

Triangle Inequality
If we know the distace between 2 of 3 points, then we can make assumptions about the distance between the remaining "unmeasured" two points.

Uses:
  • Most often used for spelling corrections
  • Work in any metric space
  • Reduce the search space.
Splay Tree
  • Self-blancing binary tree
  • Brings most accessed items toward root
  • The more uneven the access pattern, the better the performance.
Good for caches, garbage collectors, etc.

Trie

(pronounced "try")
  • O(1) (order 1) on lookup, add, removal
  • Ordered traversals
  • Prefix matchine
  • Excellent memory management.

Useful as an autocompleter.

Interesting, he implemented this as a rack filter.

No comments:

Post a Comment