Downloading public datasets

methylprep provides methods to use public data in a variety of formats.

  • idat
  • processed tab delimited (txt)
  • processed csv
  • processed xlsx
  • pickled dataframes (pkl) created using methylprep process or run_pipeline
    • dataframe format should have probe names as columns or rows, and sample probe values in the other dimension.
    • dataframe for meta data can store any values for samples, so long as one of those characteristics, the sample name, matches the Sentrix_Position sample name that is the default output of Illumina arrays.

download from GEO

(base) $ python -m methylprep download -i GSE122126 -d GEO/GSE122126 GSE122126_family.xml
GSE122126:   3%|█▉                                                            | 12.3M/407M [00:07<05:57, 1.10Mb/s] GSE122126_family.xml GSE122126_family.xml
GSE122126:   7%|████▎                                                          | 121M/1.77G [01:24<42:48, 644kb/s]

If you choose a dataset that lacks raw idat files, it will warn you.

(base) $ python -m methylprep download -i GSE123211 -d GEO/GSE123211[!] Geo data set GSE123211 probably does NOT contain usable raw data (in .idat format). Not downloading. failed to download successfully.

If you want to use the author’s processed data instead of reprocessing it yourself, download the .gz file using a web browser, then gunzip it to create a txt | pkl | xlsx | csv file, and then load that using methylprep.read_geo.

loading processed GEO data

import methylprep
import methylcheck
from pathlib import Path

df = methylprep.read_geo(Path('~/Downloads', 'GSE115278_Matrix_processed.txt'))
# or
df = methylprep.read_geo(Path('~/Downloads', 'GSE111165_data_processed_detection_p_val_EPIC.csv'))