Splitter#

In the previous guide on DataLoader, we learned how to load documents. Some of these documents could be quite large, making them not processible for LLM operations, such as Embedding etc.

To split these documents into processible chunks, we use Splitter, and split the Document into a list of processible entities as a Node object.

First, we get an instance of Splitter and configure it to fits our use case, then we use the split method passing list of Document and get list of Node. These nodes have text content which is based on our desired configuration.

For example, we will use the Paul Graham’s essays loaded in the previous DataLoader guide, and split it using sentence-splitter, and get list of Node with desired length of text.

Setup#

Ensure bodhilib is installed.

There are essential Splitter plugins that are packaged along with bodhilib library. We will use sentence splitter from these plugins.

[1]:
!pip install -q bodhilib
[2]:
# Load the Paul Graham essays from data/data-loader directory using `file` DataLoader
import os
from pathlib import Path
from bodhilib import get_data_loader

# Get data directory path and add it to data_loader
current_dir = current_working_directory = Path(os.getcwd())
data_dir = current_dir / ".." / "data" / "data-loader"
data_loader = get_data_loader('file')
data_loader.add_resource(dir=str(data_dir))
docs = data_loader.load()
[3]:
# Get instance of sentence splitter
# Configure sentence splitter to split documents for max length of 300 words, with overlap of 30 words between splits

from bodhilib import get_splitter

splitter = get_splitter("text_splitter", max_len=300, overlap=30)
[4]:
# split the documents using splitter

nodes = splitter.split(docs)
[5]:
# analyze the nodes
import textwrap

print("Total number of nodes:", len(nodes))
print("Content of first node:")
print(textwrap.fill(nodes[0].text, 100))
print("---")
print("Content of last node:")
print(textwrap.fill(nodes[-1].text, 100))
Total number of nodes: 47
Content of first node:
          How to Get New Ideas  January 2023  (Someone fed my essays into GPT to make something that
could answer questions based on them, then asked it where good ideas come from. The answer was ok,
but not what I would have said. This is what I would have said.)  The way to get new ideas is to
notice anomalies: what seems strange, or missing, or broken? You can see anomalies in everyday life
(much of standup comedy is based on this), but the best place to look for them is at the frontiers
of knowledge.  Knowledge grows fractally. From a distance its edges look smooth, but when you learn
enough to get close to one, you'll notice it's full of gaps. These gaps will seem obvious; it will
seem inexplicable that no one has tried x or wondered about y. In the best case, exploring such gaps
yields whole new fractal buds.
---
Content of last node:
and engage directly with their audience is probably a good idea.  [29] It may be helpful always to
walk or run the same route, because that frees attention for thinking. It feels that way to me, and
there is some historical evidence for it.    Thanks to Trevor Blackwell, Daniel Gackle, Pam Graham,
Tom Howard, Patrick Hsu, Steve Huffman, Jessica Livingston, Henry Lloyd-Baker, Bob Metcalfe, Ben
Miller, Robert Morris, Michael Nielsen, Courtenay Pipkin, Joris Poort, Mieke Roos, Rajat Suri, Harj
Taggar, Garry Tan, and my younger son for suggestions and for reading drafts.

🎉 We just split the Document into Nodes, and can now process it for our LLM use-cases.

Next, let’s see how we can embed these nodes using Embedder