Here we will provide a few examples of how to use various
methylprep functions. We’ll focus on examples using the CLI, as that is the recommended interface for
methylprep, but there’s also a section near the end of the tutorial that demonstrates how to run
methlyprep from an IDE.
Note: sample sheets are recommended but not necessary to run
methylprep. We will cover a few ways to work with a data set without a sample sheet.
If you haven’t done so already, run this command from your terminal to install methylprep:
>>> pip install methylprep
Downloading from GEO¶
The first step in this tutorial will be using
methylprep to acquire a dataset from GEO.
methylize work best when users include a sample sheet along with the IDAT files. The meta data contained in sample sheets is useful when running QC or analyses.
When downloading from a GEO dataset,
download will attempt to find and download the associated sample sheet. If there is none,
methylprep will automatically generate one. Users may make one of their own as well. The Illumina sample sheet is the standard format.
Note: if you already know you’d like to work with pre-processed GEO data (like beta values), check the beta_bake section of the special cases tutorial.
Our GEO dataset¶
For our tutorial, we will download GEO data from this experiment where researchers examined differential methylation among high and low grade gliomas (an aggressive form of brain tumors). This is a relatively small dataset (16 samples) of surgical resections processed using the Infinium MethylationEPIC assay.
The authors analyzed methylation data from 6 II-grade, 2 III-grade and 8 IV-grade gliomas from individual surgical resections. Stage I and II gliomas are defined as low grade, while stage III and IV are defined as high grade.
We will run the
download command with the
-v option to get more information about what processes are running. This should be run from the command line.
>>> python -m methylprep -v download -i GSE147391 -d <filepath> Where ``<filepath>`` is the directory you would like the processed data and output files to be stored.
methylprep will search the GEO database and find the dataset we indicated. It will also find the file with associated metadata and parse through it, indicating any extra fields in the sample sheet. In our case, there are 5 additional fields, which will come in handy in later analysis:
INFO:methylprep.processing.pipeline:Found 5 additional fields in sample_sheet: source | histological diagnosis --> histological_diagnosis | gender | description | Sample_ID
methylprep will begin downloading, unpacking, and processing the IDAT files after downloading the file manifest and the parsing sample sheet. The automatic processing can be turned off with the
Processing time depends on the type of array (450k arrays process faster than EPIC arrays) and on the size of the dataset. Datasets with more than 100 samples will, by default, be chunked into batches of 100 for more efficient processing. Batch size is adjustable with the
After the files are processed, we’re ready to load the files into
methylcheck for QC. See
methylcheck documentation for instructions and more details.
Processing Your Own Data¶
It is often the case that users have their own idat files that they would like to process. Instead of using the
download command, we will use the
process command. This is the main workhorse of the
methylprep package, and includes extra options for customization that the
download command lacks.
Users should note that the sample sheet is optional, but the **manifest file** is not. Make sure there is a manifest file included with the IDAT data, especially if you are analyzing data from a custom array, as
methlyprep will use the manifest file to determine what probes are included in the custom array, which control probes are present, etc.
Note: If users are interested in processing their own files, the folder containing the IDATs, manifest, and sample sheet needs to be unzipped before being processed.
Once again, we’ll run this from the command line:
>>> python -m methylprep process -d <filepath> --all
<filepath> specifies where the manifest and IDAT files (and sample sheet, if any) are stored.
--all option at the end tells
methylprep to save output for ALL of the associated processing steps.
By default, the output is usually:
These files will be saved in the same directory as the data.
methylprep will also create two folders to store processed .csv files. If users are interested in the probes that failed poobah, they are included in the .csv file outputs.
The default output is all that is needed to run qc reports from
methylcheck. However, other useful outputs like the poobal_values.pkl can optionally be included in the qc functions for greater details on sample quality.
Using methylprep from a Jupyter Notebook¶
methylprep also offers a scikit-learn style interface for users to run within jupyter notebooks or a similar IDE. We recommend using the CLI for
methylprep–IDEs tend to process more slowly, especially for large batches–but users are able to get most of the package’s functionality from within an IDE.
This command is the IDE equivalent of
methylprep process. The output, if left to default, is a set of data containers. Alternatively, users may specify
m_values=True to get a dataframe of their chosen values. We also recommend setting
export=True to get output files saved as well. That way you can easily load the data in the future instead of running this command every time you open the notebook. The only required argument for this function is the directory where the raw data is stored.
run_pipeline will take almost all of the same arguments that
process will take from the CLI.
from methylprep import run_pipeline from pathlib import Path filepath = Path('~/tutorial/GPL21145') data_containers = run_pipeline(filepath, export=True)
For more information on optional arguments, exports, etc, run
Users may specify [‘all’] for both the ‘steps’ and ‘exports’ arguments to get the same functionality as the CLI code
methlyprep process -d <filepath> --all. Alternatively, if users are interested in customizing their pipeline, they may list the steps they’d like to include and leave out any they would like to skip.
import methylprep from pathlib import Path filepath = Path('~/tutorial') # choose any combination of: # ['infer_channel_switch', 'poobah', 'quality_mask', 'noob', 'dye_bias'] mysteps = ['all'] # choose any combination of: # ['csv', 'poobah', 'meth', 'unmeth', 'noob_meth', 'noob_unmeth', 'sample_sheet_meta_data', 'mouse', 'control'] myexports = ['all'] # make_pipeline will auto-run the pipeline # make_pipeline will also take the same kwargs as run_pipeline methylprep.make_pipeline(data_dir=filepath, steps=mysteps, exports=myexports, estimator=None)
For more information on this function, run