@hackage tiktoken1.0.3

Haskell implementation of tiktoken

  • Categories

    • License

      BSD-3-Clause

    • Maintainer

      GenuineGabriella@gmail.com

      Lottery factor: 1

      Uploader: GabrielGonzalez

    • Versions

    tiktoken

    This is a Haskell implementation of tiktoken, but just the tokenization logic. In other words, given an existing encoding (like cl100k_base) you can tokenize a string (into smaller strings or token ranks).

    This means that you can't (yet) use this package to create your own new encodings, but you can use it to consume encodings. In particular, this comes in handy for prompt engineering where you want to use as much of the available prompt tokens as possible (which requires accurately counting tokens).

    Encoding speed is ≈2.6-3.1 MB/s on an M1 MacBook Pro (using only one core since this package does not yet support parallel tokenization):

    All
      Encode 10 MB of Wikipedia
        r50k_base:   OK (23.88s)
          3.356 s ± 151 ms
        p50k_base:   OK (10.39s)
          3.445 s ±  31 ms
        p50k_edit:   OK (11.13s)
          3.693 s ± 240 ms
        cl100k_base: OK (11.16s)
          3.685 s ± 143 ms
        o200k_base:  OK (11.01s)
          3.648 s ± 134 ms