Downloading public datasets¶
methylprep provides methods to use public data in a variety of formats.
- processed tab delimited (
- pickled dataframes (
pkl) created using methylprep process or run_pipeline
- dataframe format should have probe names as columns or rows, and sample probe values in the other dimension.
- dataframe for meta data can store any values for samples, so long as one of those characteristics, the sample name, matches the Sentrix_Position sample name that is the default output of Illumina arrays.
download from GEO¶
(base) $ python -m methylprep download -i GSE122126 -d GEO/GSE122126 INFO:methylprep.download.geo:Downloading GSE122126_family.xml GSE122126: 3%|█▉ | 12.3M/407M [00:07<05:57, 1.10Mb/s] INFO:methylprep.download.geo:Downloaded GSE122126_family.xml INFO:methylprep.download.geo:Unpacking GSE122126_family.xml GSE122126: 7%|████▎ | 121M/1.77G [01:24<42:48, 644kb/s]
If you choose a dataset that lacks raw idat files, it will warn you.
(base) $ python -m methylprep download -i GSE123211 -d GEO/GSE123211 ERROR:methylprep.download.process_data:[!] Geo data set GSE123211 probably does NOT contain usable raw data (in .idat format). Not downloading. ERROR:methylprep.download.process_data:Series failed to download successfully.
If you want to use the author’s processed data instead of reprocessing it yourself,
.gz file using a web browser, then
gunzip it to create a
txt | pkl | xlsx | csv file,
and then load that using
loading processed GEO data¶
import methylprep import methylcheck from pathlib import Path df = methylprep.read_geo(Path('~/Downloads', 'GSE115278_Matrix_processed.txt')) # or df = methylprep.read_geo(Path('~/Downloads', 'GSE111165_data_processed_detection_p_val_EPIC.csv')) methylcheck.beta_density_plot(df)