Having spent a decent amount of time watching both the r and pandas tags on SO, the impression that I get is that pandas
questions are less likely to contain reproducible data. This is something that the R community has been pretty good about encouraging, and thanks to guides like this, newcomers are able to get some help on putting together these examples. People who are able to read these guides and come back with reproducible data will often have much better luck getting answers to their questions.
How can we create good reproducible examples for pandas
questions? Simple dataframes can be put together, e.g.:
import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'],
'income': [40000, 50000, 42000]})
But many example datasets need more complicated structure, e.g.:
datetime
indices or data- Multiple categorical variables (is there an equivalent to R’s
expand.grid()
function, which produces all possible combinations of some given variables?) - MultiIndex data
For datasets that are hard to mock up using a few lines of code, is there an equivalent to R’s dput()
that allows you to generate copy-pasteable code to regenerate your datastructure?
The Good:
or make it “copy and pasteable” using
pd.read_clipboard(sep=r'\s\s+')
.Test it yourself to make sure it works and reproduces the issue.
df = df.head()
? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.But every rule has an exception, the obvious one being for performance issues (in which case definitely use
%timeit
and possibly%prun
to profile your code), where you should generate:Consider using
np.random.seed
so we have the exact same frame. Having said that, “make this code fast for me” is not strictly on topic for the site.df.to_dict
is often useful, with the differentorient
options for different cases. In the example above, I could have grabbed the data and columns fromdf.to_dict('split')
.Explain where the numbers come from:
But say what’s incorrect:
Aside: the answer here is to use
df.groupby('A', as_index=False).sum()
.pd.to_datetime
to them for good measure.Sometimes this is the issue itself: they were strings.
The Bad:
The correct way is to include an ordinary DataFrame with a
set_index
call:Be specific about how you got the numbers (what are they)… double check they’re correct.
On that note, you might also want to include the version of Python, your OS, and any other libraries. You could use
pd.show_versions()
or thesession_info
package (which shows loaded libraries and Jupyter/IPython environment).The Ugly:
Most data is proprietary, we get that. Make up similar data and see if you can reproduce the problem (something small).
Essays are bad; it’s easier with small examples.
Please, we see enough of this in our day jobs. We want to help, but not like this…. Cut the intro, and just show the relevant DataFrames (or small versions of them) in the step which is causing you trouble.