Suitability/Performance of SQLite for large time-series data -
i've got time-series data i'd store in database of format:
- group : string
- date : date
- val1 : number
- val2 : number
- ... valn
this database reads. searches rows belong group within date range (e.g. group = xxx , date >= start , date <= end).
the data-set big. hundreds of millions of rows. sqlite able handle kind of data? appealing thing sqlite is serverless , i'd use if can.
updated answer
i tried putting 100 million record database on ram disk - don't mean in-memory database, mean ram formatted filesystem database on , same query takes 11 seconds rather 147 seconds below!!! may option if, say, queries largely read-only - copy database disk ramdrive in mornings , access nice , fast on ramdrive day, or until reboot, without worrying losing if power goes or machine crashes. run writes against disk-based copy , reflect changes next time copied database ram whenever wanted to.
original answer
i created 3 test databases modelled on needs , inserted 1 million, 10 million , 100 million records in each of them respectively using following code in perl. synthesised group names prefix "group-" , record number, , generated random dates between 1900 , 2000, value data random.
#!/usr/bin/perl use strict; use dbi; $dsn = "dbi:sqlite:dbname=test.db"; $user = ''; $password = ''; %attr = ( raiseerror => 1, autocommit => 0 ); $dbh = dbi->connect($dsn, $user, $password, \%attr) or die "can't connect database: $dbi::errstr"; $dbh->do("drop table if exists timeseries;"); $dbh->do("create table timeseries (grp text, date text, val1 integer, val2 integer, val3 integer, val4 integer, primary key(grp,date))"); $sql = qq{ insert timeseries values ( ?, ?, ?, ?, ?, ? ) }; $sth = $dbh->prepare( $sql ); for(my $i=0;$i<100000000;$i++){ # synthesize group $group=sprintf("group-%d",$i); $sth->bind_param(1,$group); # generate random date between 1900-2000 $year=int(rand(100))+1900; $month=int(rand(12))+1; $day=int(rand(28)+1); $date=sprintf("%d-%02d-%02d 00:00:00.0",$year,$month,$day); $sth->bind_param(2,$date); $sth->bind_param(3,int(rand(1000000))); $sth->bind_param(4,int(rand(1000000))); $sth->bind_param(5,int(rand(1000000))); $sth->bind_param(6,int(rand(1000000))); $sth->execute(); if(($i % 1000)==0){printf "$i\n";$dbh->commit();} } $dbh->commit(); $sth->finish(); $dbh->disconnect();
the file sizes come out 1m, 10m , 100m records:
-rw-r--r-- 1 mark staff 103m 4 feb 14:16 1m.db -rw-r--r-- 1 mark staff 1.0g 4 feb 14:18 10m.db -rw-r--r-- 1 mark staff 11g 4 feb 15:10 100m.db
a few random records this:
group-794|1927-12-14 00:00:00.0|233545|700623|848770|61504 group-797|1927-06-13 00:00:00.0|315357|246334|276825|799325 group-840|1927-09-28 00:00:00.0|682335|5651|879688|247996 group-907|1927-05-19 00:00:00.0|148547|595716|516884|820007 group-1011|1927-06-01 00:00:00.0|793543|479096|433073|786200
i tried querying records 1927, this:
time sqlite3 1m.db 'select * timeseries date between "1927-01-01" , "1927-12-31"'
the query times follows:
all records in year 1927 1m record database => 2.7 seconds records in year 1927 10m record database => 14 seconds records in year 1927 100m record database => 147 seconds
those numbers pretty linear me, simplistic test , may want take code , play around more...
p.s. running on decent spec imac ssd.
Comments
Post a Comment