Suitability/Performance of SQLite for large time-series data -

- January 15, 2012

i've got time-series data i'd store in database of format:

group : string
date : date
val1 : number
val2 : number
... valn

this database reads. searches rows belong group within date range (e.g. group = xxx , date >= start , date <= end).

the data-set big. hundreds of millions of rows. sqlite able handle kind of data? appealing thing sqlite is serverless , i'd use if can.

updated answer

i tried putting 100 million record database on ram disk - don't mean in-memory database, mean ram formatted filesystem database on , same query takes 11 seconds rather 147 seconds below!!! may option if, say, queries largely read-only - copy database disk ramdrive in mornings , access nice , fast on ramdrive day, or until reboot, without worrying losing if power goes or machine crashes. run writes against disk-based copy , reflect changes next time copied database ram whenever wanted to.

original answer

i created 3 test databases modelled on needs , inserted 1 million, 10 million , 100 million records in each of them respectively using following code in perl. synthesised group names prefix "group-" , record number, , generated random dates between 1900 , 2000, value data random.

#!/usr/bin/perl use strict; use dbi;  $dsn = "dbi:sqlite:dbname=test.db"; $user = ''; $password = ''; %attr = ( raiseerror => 1, autocommit => 0 );  $dbh = dbi->connect($dsn, $user, $password, \%attr)      or die "can't connect database: $dbi::errstr";      $dbh->do("drop table if exists timeseries;");     $dbh->do("create table timeseries (grp text, date text, val1 integer, val2 integer, val3 integer, val4 integer, primary key(grp,date))");  $sql = qq{ insert timeseries values ( ?, ?, ?, ?, ?, ? ) }; $sth = $dbh->prepare( $sql );  for(my $i=0;$i<100000000;$i++){       # synthesize group       $group=sprintf("group-%d",$i);       $sth->bind_param(1,$group);        # generate random date between 1900-2000       $year=int(rand(100))+1900;       $month=int(rand(12))+1;       $day=int(rand(28)+1);       $date=sprintf("%d-%02d-%02d 00:00:00.0",$year,$month,$day);       $sth->bind_param(2,$date);        $sth->bind_param(3,int(rand(1000000)));       $sth->bind_param(4,int(rand(1000000)));       $sth->bind_param(5,int(rand(1000000)));       $sth->bind_param(6,int(rand(1000000)));       $sth->execute();       if(($i % 1000)==0){printf "$i\n";$dbh->commit();} } $dbh->commit(); $sth->finish(); $dbh->disconnect();

the file sizes come out 1m, 10m , 100m records:

-rw-r--r--  1 mark  staff   103m  4 feb 14:16 1m.db -rw-r--r--  1 mark  staff   1.0g  4 feb 14:18 10m.db -rw-r--r--  1 mark  staff    11g  4 feb 15:10 100m.db

a few random records this:

group-794|1927-12-14 00:00:00.0|233545|700623|848770|61504 group-797|1927-06-13 00:00:00.0|315357|246334|276825|799325 group-840|1927-09-28 00:00:00.0|682335|5651|879688|247996 group-907|1927-05-19 00:00:00.0|148547|595716|516884|820007 group-1011|1927-06-01 00:00:00.0|793543|479096|433073|786200

i tried querying records 1927, this:

time sqlite3 1m.db 'select * timeseries date between "1927-01-01" , "1927-12-31"'

the query times follows:

all records in year 1927 1m record database => 2.7 seconds records in year 1927 10m record database => 14 seconds records in year 1927 100m record database => 147 seconds

those numbers pretty linear me, simplistic test , may want take code , play around more...

p.s. running on decent spec imac ssd.

Search This Blog

Scrio

Suitability/Performance of SQLite for large time-series data -

Comments

Post a Comment

Popular posts from this blog

python - cx_oracle unable to find Oracle Client -

Delphi XE2 Indy10 udp client-server interchange using SendBuffer-ReceiveBuffer -

Qt ActiveX WMI QAxBase::dynamicCallHelper: ItemIndex(int): No such property in -