Too tired to think straight and with very little time left, I came up with a primitive script to randomly spawn workloads for a series of experiments I was running. As it turns out, the natural order of the dataset [1] I am using needs a bit of randomisation –the records in the different SetSpecs have inconsistent structures. Moral of the story here is that free-styling isn’t always approapriate 😉
I used the sample function from the random module… you’ll notice that I recursively swallow the entire dataset (~2million records) into memory and generate a random sample with files equal to the ‘bin’ value. I did mention that I was tired and not thinking straight, however, I checked to see if I was doing the right thing [2].
#!/usr/bin/python import os import random import shutil def spawnrandomworkload(dataset, destination, bin): """Spawns random sets of experiment workloads. keyword arguments: dataset --location of original dataset to spawn destination --base location where workloads will be created """ workloads = (('w1', 100), ('w2', 200), ('w3', 400), ('w4', 800), ('w5', 1600), ('w6', 3200), ('w7', 6400), ('w8', 12800)) swallowdataset = [str(os.path.abspath(os.path.join(root, filename))+":"+os.path.join(destination, bin, os.path.dirname(os.path.relpath(os.path.abspath(os.path.join(root, filename)), dataset)))).split(':',2) for root, dirs, files in os.walk(dataset) for filename in files if filename.endswith('.metadata')] for workload in workloads: if workload[0] == bin: payload = random.sample(swallowdataset, workload[1]) for cargo in payload: if not os.path.exists(cargo[1]): os.makedirs(cargo[1]) shutil.copy2(cargo[0], cargo[1])
In addition, I had a companion bash ‘workhorse’ script do most of the dirty work…
#!/bin/sh for workloads in `seq 1 8` do echo $workloads w=w$workloads echo $w echo Processing directory.... /home/lphiri/datasets/ndltd/random/workload/$w echo Copying contents to... /home/lphiri/datasets/ndltd/random/workload2/$w python -c "import simplyctperformance; simplyctperformance.spawnstructworkload('/home/lphiri/datasets/ndltd/random/workload/$w2', '/home/lphiri/datasets/ndltd/random/workload2', '$w2')"; done
[1] http://lightonphiri.org/blog/metadata-harvesting-via-oai-pmh-using-python
[2] http://stackoverflow.com/a/855455/664424