As part of my performance evaluations for my research , I designed a comparative experiment aimed at comparing certain operations of our approach with DSpace. I had initially estimated the completion time for the ingest process of the last remaining of my 15 workloads  at 7 days, however, today marks day number 11 from the time I started ingesting the 1,638,400 records; notwithstanding the fact that results from this workload are in no way going to change what I’ve been able to infer from results of the other 14 workloads. So I decided to try and figure out why performance has degraded…
I have scripts that take advantage of DSpace’s metadata-import utility, and restored to using batch files with 1k records after evaluating varying batch sizes. Incidentally, there is a correlation between the batch size and the ingestion time, but larger batch sizes come at a cost –more RAM.
My initial guess was that the performance degradation was as a result of larger sized batch files, so I awk’d the now 2.2GB nohup log file to determine how long the processing it taking; drilling down to the hour level –the plots below show daily and hourly ingest times. However, the batch files I’ve been processing seem to have a relatively consistent size so no problem here…
At this point, I am still unsure of why the ingest process is slowing down –I’ve decided to investigate this further when I have the time to, but my guess is the bottleneck is PostgreSQL.