import altair as alt # our visualization library
import polars as pl # our dataframe library
#pl.Config.set_tbl_rows(24) # allow the dataframe to print 24 rows without clipping
import pymc as pm # the library that we will use to simulate dataSimulating Wound Data is Easier Than You Think
As a modeler, I oftentimes simulate wound care data to test models prior to fitting those same models to real data. Why would one do this? With simulated data one know exactly what the ground truth or parameters are, and can test how close the model gets to estimating the true values. With real data, the ground truth can never truly be known with absolute certainty. It can only be estimated. Hence, simulated data is a playground for models, before they make their way to the complexity of Real World Evidence.
I mentioned simulation of data to an aquaintance the other day, and they said it must be pretty difficult. I assurred him it was very straightforward, and in fact one could generate very realistic longitudinal wound care data do so with a couple of lines of code and some very simple distributional assumptions.
First let’s load the Python libraries we will need.
Next let’s define the number of wounds we want, the starting area of the wounds (drawn from a Gamma distributution), and the innovations of the wounds, or how the wounds change week to week (drawn from a StudentT distribution). We will generate 12 weeks of data for 20 wounds.
number_of_weeks = 11 # this is actually 12 weeks as we have our starting area + 11 more weeks
number_of_wounds = 20
random_seed = 1234 # this allows the simulations to to be replicated, otherwise each draw will be totally random
starting_areas = pm.Gamma.dist(mu=7.5, # the average area
sigma=3.5 # the dispersion paramater
)
wound_trajectories_definition = pm.RandomWalk.dist(init_dist=starting_areas,
innovation_dist=pm.StudentT.dist(nu=3, mu=-.3, sigma=1), # the innovations with a skew toward healing
steps=number_of_weeks
)
wound_trajectories_draws = pm.draw(vars=wound_trajectories_definition,
draws=number_of_wounds,
random_seed=random_seed
)Let’s take a glance is returned by the simulation. We return two arrays of data, which represent the trajectories over 12 weeks of two of the 20 wounds.
print (wound_trajectories_draws[:][:2])[[ 8.31573535 10.08365913 8.60467865 7.61161127 7.803105 4.94285792
5.55018078 4.13138653 4.03536057 2.57860449 2.83068657 0.77546271]
[ 7.35366688 8.09203598 6.65627211 7.15055862 7.69705732 8.88698897
7.66514015 7.5456083 6.85247688 3.65505568 2.75086306 2.54986529]]
Looking at array data is not terribly pleasant, so let’s convert this to a tabular format.
df = (pl
.DataFrame(wound_trajectories_draws, orient='row') # load the array data into a dataframe
.with_row_index(name='wound_id') # generate the wound_id
.with_columns(wound_id = pl.concat_str([pl.lit('wound_'), pl.col.wound_id.cast(pl.String).str.zfill(3)])) # clean up the wound_id column
.melt(id_vars=['wound_id'], variable_name='time', value_name='area') # make the data long
.with_columns(pl.col.time.str.split('_').list.last().cast(pl.Int16).add(1)) # clean up the time column
.with_columns(pl.when(pl.col.area.le(0)).then(0).otherwise(pl.col.area).alias('area')) # clean up area column for when wound heals
.sort(by=['wound_id', 'time']) # some sorting to make sense of things
)
df.to_pandas().head(24) # show the same 2 wounds as in the array above| wound_id | time | area | |
|---|---|---|---|
| 0 | wound_000 | 1 | 8.315735 |
| 1 | wound_000 | 2 | 10.083659 |
| 2 | wound_000 | 3 | 8.604679 |
| 3 | wound_000 | 4 | 7.611611 |
| 4 | wound_000 | 5 | 7.803105 |
| 5 | wound_000 | 6 | 4.942858 |
| 6 | wound_000 | 7 | 5.550181 |
| 7 | wound_000 | 8 | 4.131387 |
| 8 | wound_000 | 9 | 4.035361 |
| 9 | wound_000 | 10 | 2.578604 |
| 10 | wound_000 | 11 | 2.830687 |
| 11 | wound_000 | 12 | 0.775463 |
| 12 | wound_001 | 1 | 7.353667 |
| 13 | wound_001 | 2 | 8.092036 |
| 14 | wound_001 | 3 | 6.656272 |
| 15 | wound_001 | 4 | 7.150559 |
| 16 | wound_001 | 5 | 7.697057 |
| 17 | wound_001 | 6 | 8.886989 |
| 18 | wound_001 | 7 | 7.665140 |
| 19 | wound_001 | 8 | 7.545608 |
| 20 | wound_001 | 9 | 6.852477 |
| 21 | wound_001 | 10 | 3.655056 |
| 22 | wound_001 | 11 | 2.750863 |
| 23 | wound_001 | 12 | 2.549865 |
Great. This is makes the wound data easier to read. Let’s go one further and visualize the trajectories for for all 20 wounds.
The simulation of authentic looking data was accomplished in less than 15 lines of code. Further complexity could be added to insert censoring (interval, left, or right), truncation, lower and upper bounds of starting area, etc.
Simulation is an absolute must in the toolkit of any wound-care researcher.