As part of my performance evaluations for my research [1], I designed a comparative experiment aimed at comparing certain operations of our approach with DSpace. I had initially estimated the completion time for the ingest process of the last remaining of my 15 workloads [2] at 7 days, however, today marks day number 11 from the time I started ingesting the 1,638,400 records; notwithstanding the fact that results from this workload are in no way going to change what I’ve been able to infer from results of the other 14 workloads. So I decided to try and figure out why performance has degraded…
Background Info
I have scripts that take advantage of DSpace’s metadata-import utility, and restored to using batch files with 1k records after evaluating varying batch sizes. Incidentally, there is a correlation between the batch size and the ingestion time, but larger batch sizes come at a cost –more RAM.
Findings
My initial guess was that the performance degradation was as a result of larger sized batch files, so I awk’d the now 2.2GB nohup log file to determine how long the processing it taking; drilling down to the hour level –the plots below show daily and hourly ingest times. However, the batch files I’ve been processing seem to have a relatively consistent size so no problem here…
Conclusions
At this point, I am still unsure of why the ingest process is slowing down –I’ve decided to investigate this further when I have the time to, but my guess is the bottleneck is PostgreSQL.
Bibliography
[1] http://people.cs.uct.ac.za/~lphiri
[2] http://lightonphiri.org/blog/randomly-spawning-sample-objects-with-python