PostgreSQL Function to aggregate timestamps with more flexibility

More often than not, I've got some kind of data in a postgresql database that is somehow described by timestamps.

Simple test table like this:

create table testdata (
   id serial,
   d numeric,
   ts timestamp without time zone default now()
);

Using bash I'll populate it a bit with not too unrealistic timeintervals.

while true
do
  psql -c "insert into testdata (d) values (trunc(random()*100));"
  sleep 1.$RANDOM
done

And to make matters "worse" in another window I started a

while true
do
  psql -c "insert into testdata (d) values (trunc(random()*1000));"
  sleep 3.$RANDOM
done

And let it run for a bit...

So now I have some data and I can start working with them, eventually data of this kind would probably end up being fed though matplotlib, or perhaps converted to json and fed to d3 to make fancy graphs. (Everybody likes a fancy graph)

But looking at the data and the standard functions, I've got something like

select date_trunc('second',ts) as gt,
       count(d) as samples,
       min(d) as min,
       max(d) as max,
       round(avg(d)) as avg,
       round(stddev_samp(d)) as devsamp,
       round(stddev_pop(d)) as devpop
       from testdata
       group by gt
       order by gt desc
       limit 10;

Returning an output like

         gt          | samples | min | max | avg | devsamp | devpop
---------------------+---------+-----+-----+-----+---------+--------
 2014-02-23 14:44:07 |       2 |  79 | 543 | 311 |     328 |    232
 2014-02-23 14:44:06 |       1 |  62 |  62 |  62 |         |      0
 2014-02-23 14:44:04 |       1 |  57 |  57 |  57 |         |      0
 2014-02-23 14:44:03 |       2 |  33 |  65 |  49 |      23 |     16
 2014-02-23 14:44:02 |       1 |  36 |  36 |  36 |         |      0
 2014-02-23 14:44:00 |       2 |  56 | 354 | 205 |     211 |    149
 2014-02-23 14:43:59 |       1 |  69 |  69 |  69 |         |      0
 2014-02-23 14:43:58 |       1 |  11 |  11 |  11 |         |      0
 2014-02-23 14:43:57 |       1 | 945 | 945 | 945 |         |      0
 2014-02-23 14:43:56 |       1 |  96 |  96 |  96 |         |      0
(10 rows)

And there's

select date_trunc('minute',ts) as gt,
       count(d) as samples,
       min(d) as min,
       max(d) as max,
       round(avg(d)) as avg,
       round(stddev_samp(d)) as devsamp,
       round(stddev_pop(d)) as devpop
       from testdata
       group by gt
       order by gt desc
       limit 10;

Returning

         gt          | samples | min | max | avg | devsamp | devpop
---------------------+---------+-----+-----+-----+---------+--------
 2014-02-23 14:45:00 |      38 |   6 | 804 | 182 |     236 |    233
 2014-02-23 14:44:00 |      62 |   2 | 879 | 137 |     193 |    191
 2014-02-23 14:43:00 |      49 |   0 | 969 | 107 |     202 |    200
 2014-02-23 14:42:00 |      44 |   0 |  99 |  54 |      29 |     29
 2014-02-23 14:41:00 |      46 |   1 |  99 |  51 |      31 |     31
 2014-02-23 14:40:00 |       4 |   7 |  68 |  48 |      29 |     25
 2014-02-23 14:39:00 |      58 |   3 | 936 | 188 |     282 |    280
 2014-02-23 14:38:00 |      62 |   1 | 908 | 161 |     244 |    242
 2014-02-23 14:37:00 |      62 |  10 | 989 | 185 |     273 |    271
 2014-02-23 14:36:00 |      62 |   6 | 951 | 173 |     251 |    249
(10 rows)

All giving me very varying sample sizes and a slightly awkward data resolution making a graph drawn from it somewhat ugly and jagged.

Next one up is

select date_trunc('hour',ts) as gt,
       count(d) as samples,
       min(d) as min,
       max(d) as max,
       round(avg(d)) as avg,
       round(stddev_samp(d)) as devsamp,
       round(stddev_pop(d)) as devpop
       from testdata
       group by gt
       order by gt desc
       limit 10;

Giving me

         gt          | samples | min | max | avg | devsamp | devpop
---------------------+---------+-----+-----+-----+---------+--------
 2014-02-23 14:00:00 |    1771 |   0 | 989 |  92 |     151 |    151
(1 row)

This will actually work out for the grand overview over time, next one up is 'day' and 'week' and 'month' and 'year' and if I actually want those specific intervals I'm kinda lucky, because that's what I can get.

What I gathered from the net and experiments was a tiny function that aggregates time much more flexibly, allowing me to specify in seconds the size of the aggregation interval, so 60 seconds = 1 minutes, 300 seconds = 5 minutes, 3600 seconds = 1 hour, 86400 seconds = 1 day.

CREATE OR REPLACE FUNCTION round_time(integer,timestamp without time zone)
RETURNS timestamp without time zone AS
$BODY$
SELECT  ('1970-01-01 00:00:00'::timestamp + trunc(extract (epoch from $2)/$1)*$1 * interval '1 second')::timestamp without time zone
$BODY$
LANGUAGE sql VOLATILE;

Using it is fairly straight forward

select round_time(300,ts) as gt,
       count(d) as samples,
       min(d) as min,
       max(d) as max,
       round(avg(d)) as avg,
       round(stddev_samp(d)) as devsamp,
       round(stddev_pop(d)) as devpop
       from testdata
       group by gt
       order by gt desc
       limit 10;

Gives me 5 minutes interval results

         gt          | samples | min | max | avg | devsamp | devpop
---------------------+---------+-----+-----+-----+---------+--------
 2014-02-23 14:50:00 |     161 |   0 | 950 | 168 |     242 |    241
 2014-02-23 14:45:00 |     316 |   0 | 973 | 186 |     249 |    249
 2014-02-23 14:40:00 |     205 |   0 | 969 |  91 |     150 |    150
 2014-02-23 14:35:00 |     286 |   1 | 989 | 180 |     254 |    253
 2014-02-23 14:30:00 |     215 |   0 | 484 |  89 |     103 |    102
 2014-02-23 14:25:00 |     223 |   0 |  99 |  50 |      28 |     28
 2014-02-23 14:20:00 |     227 |   0 |  99 |  51 |      28 |     28
 2014-02-23 14:15:00 |     221 |   0 |  99 |  50 |      30 |     30
 2014-02-23 14:10:00 |     191 |   0 |  99 |  50 |      30 |     30
 2014-02-23 14:05:00 |      60 |   0 |  99 |  48 |      31 |     31
(10 rows)

select round_time(900,ts) as gt,
       count(d) as samples,
       min(d) as min,
       max(d) as max,
       round(avg(d)) as avg,
       round(stddev_samp(d)) as devsamp,
       round(stddev_pop(d)) as devpop
       from testdata
       group by gt
       order by gt desc
       limit 10;

Returns 15 minutes interval results

         gt          | samples | min | max | avg | devsamp | devpop
---------------------+---------+-----+-----+-----+---------+--------
 2014-02-23 14:45:00 |     547 |   0 | 973 | 183 |     253 |    253
 2014-02-23 14:30:00 |     706 |   0 | 989 | 126 |     194 |    194
 2014-02-23 14:15:00 |     671 |   0 |  99 |  51 |      29 |     29
 2014-02-23 14:00:00 |     251 |   0 |  99 |  49 |      31 |     31
(4 rows)

Allowing me to select a much more practical granularity and in the end most likely resulting in much more understandable data output and in the end much prettier graphs.

Oh, and finally, although I don't normally use sub-minute intervals, selecting something like a 10 second inverval.

select round_time(10,ts) as gt,
       count(d) as samples,
       min(d) as min,
       max(d) as max,
       round(avg(d)) as avg,
       round(stddev_samp(d)) as devsamp,
       round(stddev_pop(d)) as devpop
       from testdata
       group by gt
       order by gt desc
       limit 10;

         gt          | samples | min | max | avg | devsamp | devpop
---------------------+---------+-----+-----+-----+---------+--------
 2014-02-23 15:01:10 |      10 |  28 | 833 | 201 |     259 |    245
 2014-02-23 15:01:00 |      10 |  37 | 628 | 154 |     197 |    187
 2014-02-23 15:00:50 |      11 |   1 | 867 | 165 |     275 |    262
 2014-02-23 15:00:40 |      10 |  16 | 797 | 152 |     255 |    242
 2014-02-23 15:00:30 |      10 |   7 | 868 | 276 |     387 |    367
 2014-02-23 15:00:20 |       9 |   6 | 787 | 125 |     250 |    236
 2014-02-23 15:00:10 |      10 |   2 | 583 | 137 |     179 |    170
 2014-02-23 15:00:00 |      10 |  30 | 806 | 172 |     244 |    231
 2014-02-23 14:59:50 |      10 |   0 | 824 | 178 |     287 |    272
 2014-02-23 14:59:40 |      10 |   0 | 808 | 139 |     238 |    226
(10 rows)

I find it kind of amusing that even though it's two seperate processes, sleeping different intervals (1 and 3 seconds), they fairly acurately manage to put 10 samples in every 10 second interval.

Comments