@hackage dataframe-persistent0.3.0.0

Persistent database integration for the dataframe library

dataframe-persistent

Load SQLite into a dataframe. The schema is read for you: at runtime when you're exploring, or at compile time when you want type safety. You don't write a persistent entity, a persistLowerCase block, or any instances.

This README is a runnable scripths notebook. Every Haskell block runs top-to-bottom in one shared session against ./data/chinook.db. Reproduce every output below with scripths docs/base_scripts/base_readme.md -o README.md run from dataframe-persistent/.

Three ways in

Tier You write You get
Runtime readTable db "artists" a DataFrame, types inferred from the schema
Typed $(declareTable db "artists") + readTableTyped @Schema a compile-time schema type; columns checked by col @"Name"
Persistent $(declareEntity db "artists") a full persistent entity: typed Filter DSL, write-back

Tier 0: runtime reads

This is the quickest way in, similar to pandas' read_sql_table / read_sql. You point at a file and a table.

-- cabal: build-depends: dataframe, dataframe-persistent, text
-- cabal: default-extensions: OverloadedStrings, TemplateHaskell, DataKinds
-- cabal: default-extensions: TypeApplications, TypeOperators, FlexibleContexts
import qualified DataFrame as D
import DataFrame.IO.Persistent.Read
import Data.Function ((&))

What tables are in this database?

listTables "./data/chinook.db"

["albums","artists","customers","employees","genres","invoice_items","invoices","media_types","playlist_track","playlists","tracks"]

describeTable shows a table's columns and their inferred types before you load it (it returns a describeColumns-style DataFrame):

D.toMarkdown' <$> describeTable "./data/chinook.db" "artists"
Column Name
Text
Type
Text
SQLite Type
Text
Nullable
Bool
Primary Key
Bool
ArtistId Int INTEGER False True
Name Maybe Text NVARCHAR(120) True False

Load the whole table. The column types (and nullability) come from the schema:

D.toMarkdown' . D.take 5 <$> readTable "./data/chinook.db" "artists"
ArtistId
Int
Name
Maybe Text
1 Just "AC/DC"
2 Just "Accept"
3 Just "Aerosmith"
4 Just "Alanis Morissette"
5 Just "Alice In Chains"

Tables larger than memory? Filter them in the database. Raw SQL works as you'd expect:

D.toMarkdown' <$> readSql "./data/chinook.db" "SELECT * FROM artists WHERE Name LIKE 'A%' LIMIT 5"
ArtistId
Int
Name
Text
1 AC/DC
2 Accept
3 Aerosmith
4 Alanis Morissette
5 Alice In Chains

Or push a WHERE / LIMIT down to a named table with a ReadQuery value:

D.toMarkdown' <$> readTableWith "./data/chinook.db" "artists" (allRows & limit 3)
ArtistId
Int
Name
Maybe Text
1 Just "AC/DC"
2 Just "Accept"
3 Just "Aerosmith"

Tier 1: typed schema

declareTable reads the schema at compile time and emits just the schema type. You read into it with readTableTyped, where the schema is a type argument and the database and table are ordinary values. Column references go through col @"Name", checked against the schema, so a typo or a wrong type is a compile error. Nothing is keyed on a generated function name.

import qualified DataFrame.Typed as DT
import DataFrame.Typed ((.==.))
import DataFrame.IO.Persistent.Schema (declareTable)

The splice brings one thing into scope, the type type ArtistsSchema = '[Column "ArtistId" Int, Column "Name" (Maybe Text)]:

$(declareTable "./data/chinook.db" "artists")

readTableTyped @ArtistsSchema reads any database/table into a TypedDataFrame ArtistsSchema (it validates the schema as it reads). You can bind your own reader:

artists = readTableTyped @ArtistsSchema "./data/chinook.db" "artists"

thaw drops a typed frame back to an untyped DataFrame:

D.toMarkdown' . D.take 5 . DT.thaw <$> artists
ArtistId
Int
Name
Maybe Text
1 Just "AC/DC"
2 Just "Accept"
3 Just "Aerosmith"
4 Just "Alanis Morissette"
5 Just "Alice In Chains"

Column access is checked against the schema. col @"Name" only compiles because "Name" is a column of ArtistsSchema (its element type is Maybe Text):

DT.columnAsList @"Name" . DT.take 3 <$> artists

[Just "AC/DC",Just "Accept",Just "Aerosmith"]

A filter on a column that doesn't exist (or has the wrong type) is a compile error rather than a runtime surprise:

D.toMarkdown' . DT.thaw . DT.filterWhere (DT.col @"Name" .==. DT.lit (Just "Accept")) <$> artists
ArtistId
Int
Name
Maybe Text
2 Just "Accept"

Because the database is a value, reading the same table from two sources to join them is just two calls with the same @ArtistsSchema:

a <- readTableTyped @ArtistsSchema "europe.sqlite" "artists"
b <- readTableTyped @ArtistsSchema "us.sqlite"     "artists"
-- DT.thaw a / DT.thaw b, then DataFrame.innerJoin on "ArtistId", etc.

(readSqlTyped @cols db "SELECT ... JOIN ..." does the same for an arbitrary query. The Postgres section below reads this same ArtistsSchema from a different backend.)

Tier 2: generate a persistent entity

If you want the full persistent experience (the typed Filter DSL, relations, write-back), declareEntity builds the entity from the live schema, so you skip the persistLowerCase block. It needs the usual persistent extensions and unqualified Text / Int64 in scope:

-- cabal: build-depends: persistent, persistent-sqlite
-- cabal: default-extensions: GADTs, TypeFamilies, FlexibleInstances, MultiParamTypeClasses
-- cabal: default-extensions: StandaloneDeriving, DerivingStrategies, GeneralizedNewtypeDeriving
-- cabal: default-extensions: UndecidableInstances
import Data.Int (Int64)
import Data.Text (Text)
import Database.Persist (Filter, SelectOpt (..), (<-.), (==.))
import Database.Persist.Sqlite (runSqlite)
import DataFrame.IO.Persistent.Schema (declareEntity)

This generates the Albums entity (Id, title, artistId), bound to the real albums / AlbumId / Title / ArtistId SQLite names, plus its EntityField constructors:

$(declareEntity "./data/chinook.db" "albums")

Now load it with selectToDataFrame. It's a generic loader: it works for any persistent entity, so the one declareEntity splice is all the boilerplate.

selectToDataFrame
    :: (MonadIO m, PersistEntity r, PersistEntityBackend r ~ SqlBackend)
    => [Filter r]      -- persistent's typed WHERE DSL, e.g. [AlbumsArtistId ==. 1]
    -> [SelectOpt r]   -- ordering / paging,        e.g. [Asc AlbumsTitle, LimitTo 10]
    -> ReaderT SqlBackend m DataFrame

It runs persistent's selectList filters opts and turns the [Entity Albums] into a DataFrame: the entity key becomes an id column and each field becomes a column (named by its Haskell field, e.g. title, artistId), with element types decoded from the stored values. The [Filter Albums] and [SelectOpt Albums] you pass are persistent's own, checked against the entity, so a filter for the wrong field won't compile.

Find AC/DC's albums (ArtistId 1):

D.toMarkdown' <$> runSqlite "./data/chinook.db" (selectToDataFrame [AlbumsArtistId ==. 1] [])
id
Int
title
Text
artistId
Int
1 For Those About To Rock We Salute You 1
4 Let There Be Rock 1

You can use both arguments together: filter and order/page in one query. Here are the albums by AC/DC or Alanis Morissette (ArtistId 1 or 4), sorted by title, capped at 5. (<-. is persistent's "field in list"; an empty filter list loads everything.)

D.toMarkdown' <$> runSqlite "./data/chinook.db"
    (selectToDataFrame [AlbumsArtistId <-. [1, 4]] [Asc AlbumsTitle, LimitTo 5])
id
Int
title
Text
artistId
Int
1 For Those About To Rock We Salute You 1
6 Jagged Little Pill 4
4 Let There Be Rock 1

Hand-off between dataframe and persistent

The ...Conn readers run on a ReaderT SqlBackend m, so a raw read, an entity read, and ordinary persistent queries all compose in one runSqlite transaction (this works on Postgres/MySQL too). Here both reads share one connection and return each frame's (rows, columns):

runSqlite "./data/chinook.db" $ do
    artists <- readTableConn "artists"                       -- raw → DataFrame
    albums  <- selectToDataFrame ([] :: [Filter Albums]) []  -- entity → DataFrame
    pure (D.dimensions artists, D.dimensions albums)

((275,2),(347,3))

PostgreSQL (and other backends)

Everything except the file-path helpers and the compile-time splices is backend-agnostic: the ...Conn readers and selectToDataFrame run on any persistent SqlBackend. Introspection picks PRAGMA for SQLite and information_schema for PostgreSQL/MySQL. The library doesn't depend on persistent-postgresql; you bring the connection.

These examples run against a real PostgreSQL with the same artists / albums data, and the notebook starts it itself. The block below uses GHCi's :! shell escape to spin up a throwaway cluster on port 54329 and load the two tables from the SQLite fixture. It needs postgres and sqlite3 on PATH (e.g. brew install postgresql@16). A matching teardown runs at the end of the section.

:! bash scripts/pg-setup.sh

postgres ready on port 54329 (db chinook: artists + albums)

Open a connection (you supply persistent-postgresql; runPg is the usual runSqlConn wrapper):

-- cabal: build-depends: persistent, persistent-postgresql, transformers, resourcet, monad-logger
import Control.Monad.Trans.Reader (ReaderT)
import Control.Monad.Trans.Resource (ResourceT, runResourceT)
import Control.Monad.Logger (NoLoggingT, runNoLoggingT)
import Database.Persist.Sql (SqlBackend)
import Database.Persist.Postgresql (runSqlConn, withPostgresqlConn)

runPg :: ReaderT SqlBackend (ResourceT (NoLoggingT IO)) a -> IO a
runPg act =
    runNoLoggingT (runResourceT (withPostgresqlConn "host=localhost port=54329 dbname=chinook user=postgres" (runSqlConn act)))

Discovery and reads use the same functions as SQLite, just ...Conn wrapped in runPg:

runPg listTablesConn

["albums","artists"]

D.toMarkdown' <$> runPg (describeTableConn "artists")
Column Name
Text
Type
Text
SQLite Type
Text
Nullable
Bool
Primary Key
Bool
ArtistId Int integer False True
Name Maybe Text text True False
D.toMarkdown' . D.take 5 <$> runPg (readTableConn "artists")
ArtistId
Int
Name
Maybe Text
1 Just "AC/DC"
2 Just "Accept"
3 Just "Aerosmith"
4 Just "Alanis Morissette"
5 Just "Alice In Chains"

The typed reader is backend-agnostic too. The same ArtistsSchema generated from the SQLite file in Tier 1 validates this PostgreSQL read, so it's one schema type across two databases:

D.toMarkdown' . D.take 3 . DT.thaw <$> runPg (readTableTypedConn @ArtistsSchema "artists")
ArtistId
Int
Name
Maybe Text
1 Just "AC/DC"
2 Just "Accept"
3 Just "Aerosmith"

The Albums entity generated from SQLite at compile time is a plain persistent entity, so the same declareEntity splice and selectToDataFrame run unchanged against PostgreSQL:

D.toMarkdown' <$> runPg (selectToDataFrame [AlbumsArtistId <-. [1, 4]] [Asc AlbumsTitle, LimitTo 5])
id
Int
title
Text
artistId
Int
1 For Those About To Rock We Salute You 1
6 Jagged Little Pill 4
4 Let There Be Rock 1

Tear the throwaway cluster back down (again via :!):

:! bash scripts/pg-teardown.sh

postgres stopped and removed

How types are inferred

SQLite type affinities map to Haskell element types as below. NOT NULL columns are non-null, everything else becomes Maybe. You can override a column with typeOverride in DeclareOptions (declareTableWith / declareEntityWith).

SQLite declared type Haskell type
INTEGER, INT... Int
REAL, FLOAT, DOUBLE, NUMERIC, DECIMAL Double
TEXT, VARCHAR, CHAR, CLOB Text
BOOLEAN Bool
BLOB / no declared type ByteString
DATE Day
DATETIME, TIMESTAMP UTCTime
TIME TimeOfDay

Install

build-depends: dataframe, dataframe-persistent, persistent-sqlite

The classic path still works and is unchanged: hand-write a persistent entity with share/persistLowerCase and derivePersistentDataFrame, then load with fromPersistent.