DataLoader#

DataLoader component allows you to load data from various data sources using a uniform interface.

To use the DataLoader service, get the instance of data loader using the get_data_loader method, passing it the service identifier service_name.

To configure the DataLoader, you can use the add_resource method. To start loading the data lazily as python generator, you can iterate over the dataloader using the for loop, or if you want to eagerly load all the data synchronously, you can use the load method and return the loaded data as list. The DataLoader.load method reads the config to load the resource and returns it as list of Document object.

Let’s load local files as data and see it in action.

Setup#

Ensure bodhilib is installed. There are bunch of DataLoader plugins that are packaged along with bodhilib library. We will use file data_loader from these plugins.

[1]:
!pip install -q bodhilib
[2]:
# Local data
# We have few essays by Paul Graham to load and test out DataLoader component
! ls -l ../data/data-loader
total 144
-rw-r--r--@ 1 amir36  staff  67153 Oct  4 15:02 pg-great-work.txt
-rw-r--r--@ 1 amir36  staff    821 Oct  4 15:02 pg-new-ideas.txt
[3]:
# Get an instance of local file DataLoader
from bodhilib import get_data_loader

data_loader = get_data_loader('file')
[4]:
import os
from pathlib import Path

# Get data directory path and add it to data_loader
current_dir = current_working_directory = Path(os.getcwd())
data_dir = current_dir / ".." / "data" / "data-loader"
data_loader.add_resource(dir=str(data_dir))
[5]:
# load the data

docs = data_loader.load()
[6]:
# analyze the loaded documents
import json

print("Number of documents:", len(docs))
print("List of filenames loaded:", json.dumps([{'filename': doc.metadata['filename']} for doc in docs], indent=2))
Number of documents: 2
List of filenames loaded: [
  {
    "filename": "pg-new-ideas.txt"
  },
  {
    "filename": "pg-great-work.txt"
  }
]

🎉 We just loaded documents from local using our DataLoader.

Next, let’s see how we can split these document into processible entities using Splitter.