Show HN: Chonky – a neural text semantic chunking goes multilingual

Chonky_mmbert_small_multilingual_v1

Chonky is a transformer model that intelligently segments text into meaningful semantic chunks. This model can be used in Retrieval-Augmented Generation (RAG) systems. 🆕 Now multilingual!

Model Description

The model processes text and divides it into semantically coherent segments. These chunks can be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.

⚠️ Note: This model was fine-tuned on sequences of length 1024, although the base mmBERT model supports sequence lengths up to 8192 tokens.

How to Use

A small Python library named chonky has been created for easy interaction with the model. Below is an example usage:

from src.chonky import ParagraphSplitter

# On first run, this will download the transformer model
splitter = ParagraphSplitter(
    model_id="mirth/chonky_mmbert_small_multilingual_1",
    device="cpu"
)

text = (
    "Before college the two main things I worked on, outside of school, were writing and programming. "
    "I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. "
    "My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. "
    "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' "
    "This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, "
    "and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines: "
    "CPU, disk drives, printer, card reader sitting up on a raised floor under bright fluorescent lights."
)

for chunk in splitter(text):
    print(chunk)
    print()

Sample Output:

Before college the two main things I worked on, outside of school, were writing and programming. I didn’t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called ‘data processing.’ This was in 9th grade, so I was 13 or 14. The school district’s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain’s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

Alternative Usage with Standard NER Pipeline

You can also use this model with the standard Named Entity Recognition (NER) pipeline from Hugging Face’s Transformers library:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "mirth/chonky_mmbert_small_multilingual_1"

tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024)

id2label = {
    0: "O",
    1: "separator",
}
label2id = {
    "O": 0,
    "separator": 1,
}

model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
)

pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = (
    "Before college the two main things I worked on, outside of school, were writing and programming. "
    "I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. "
    "My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. "
    "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' "
    "This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, "
    "and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines "
    "CPU, disk drives, printer, card reader sitting up on a raised floor under bright fluorescent lights."
)

print(pipe(text))

Sample Output:

[
  {
    'entity_group': 'separator',
    'score': 0.66304857,
    'word': ' deep',
    'start': 332,
    'end': 337
  }
]

Training Data

The model was trained to split paragraphs using datasets such as MiniPile, BookCorpus, and Project Gutenberg.

Metrics

Performance is measured using token-based F1-score on various English datasets including a Project Gutenberg validation set.

Hardware

This model was fine-tuned on a single NVIDIA H100 GPU over several hours.

https://huggingface.co/mirth/chonky_mmbert_small_multilingual_1

Leave a Reply

Your email address will not be published. Required fields are marked *