Module contents

methylprep.get_manifest(raw_datasets, array_type=None, manifest_filepath=None)

Generates a SampleSheet instance for a given directory of processed data.

Arguments:
raw_datasets {list(RawDataset)} – Collection of RawDataset instances that
require a manifest file for the related array_type.
Keyword Arguments:
array_type {ArrayType} – The type of array to process. If not provided, it
will be inferred from the number of probes in the IDAT file. (default: {None})
manifest_filepath {path-like} – Path to the manifest file. If not provided,
it will be inferred from the array_type and downloaded if necessary (default: {None})
Returns:
[Manifest] – A Manifest instance.
methylprep.get_raw_datasets(sample_sheet, sample_name=None, from_s3=None, meta_only=False)

Generates a collection of RawDataset instances for the samples in a sample sheet.

Arguments:
sample_sheet {SampleSheet} – The SampleSheet from which the data originates.
Keyword Arguments:
sample_name {string} – Optional: one sample to process from the sample_sheet. (default: {None}) from_s3 {zip_reader} – pass in a S3ZipReader object to extract idat files from a zipfile hosted on s3. meta_only {True/False} – doesn’t read idat files, only parses the meta data about them. (RawMetaDataset is same as RawDataset but has no idat probe values stored in object, because not needed in pipeline)
Raises:
ValueError: If the number of probes between raw datasets differ.
Returns:
[RawDataset] – A RawDataset instance.
methylprep.run_pipeline(data_dir, array_type=None, export=False, manifest_filepath=None, sample_sheet_filepath=None, sample_name=None, betas=False, m_value=False, make_sample_sheet=False, batch_size=None, save_uncorrected=False, meta_data_frame=True, bit='float64')

The main CLI processing pipeline. This does every processing step and returns a data set.

Arguments:
data_dir [required]
path where idat files can be found, and samplesheet csv.
array_type [default: autodetect]
27k, 450k, EPIC, EPIC+ If omitted, this will autodetect it.
export [default: False]
if True, exports a CSV of the processed data for each idat file in sample.
betas
if True, saves a pickle (beta_values.pkl) of beta values for all samples
m_value
if True, saves a pickle (m_values.pkl) of beta values for all samples
manifest_filepath [optional]
if you want to provide a custom manifest, provide the path. Otherwise, it will download the appropriate one for you.
sample_sheet_filepath [optional]
it will autodetect if ommitted.
sample_name [optional, list]
if you don’t want to process all samples, you can specify individual as a list. if sample_names are specified, this will not also do batch sizes (large batches must process all samples)
make_sample_sheet [optional]
if True, generates a sample sheet from idat files called ‘samplesheet.csv’, so that processing will work. From CLI pass in “–no_sample_sheet” to trigger sample sheet auto-generation.
batch_size [optional]
if set to any integer, samples will be processed and saved in batches no greater than the specified batch size. This will yield multiple output files in the format of “beta_values_1.pkl … beta_values_N.pkl”.
bit [optional]
Change the processed beta or m_value data_type from float64 to float16 or float32. This will make files smaller, often with no loss in precision. float16 files can be about 25% smaller.
Returns:

By default, if called as a function, a list of SampleDataContainer objects is returned.

betas
if True, will return a single data frame of betavalues instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.
m_value
if True, will return a single data frame of m_factor values instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.

if batch_size is set to more than 200 samples, nothing is returned but all the files are saved. You can recreate the output by loading the files.

Processing note:
The sample_sheet parser will ensure every sample has a unique name and assign one (e.g. Sample1) if missing, or append a number (e.g. _1) if not unique. This may cause sample_sheets and processed data in dataframes to not match up. Will fix in future version.
methylprep.get_sample_sheet(dir_path, filepath=None)

Generates a SampleSheet instance for a given directory of processed data.

Arguments:
dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files.
Keyword Arguments:
filepath {string or path-like} – path of the sample sheet file if provided, otherwise
one will try to be found. (default: {None})
Returns:
[SampleSheet] – A SampleSheet instance.
methylprep.consolidate_values_for_sheet(data_containers, postprocess_func_colname='beta_value', bit='float64')

with a data_containers (list of processed SampleDataContainer objects), this will transform results into a single dataframe with all of the function values, with probe names in rows, and sample beta values for probes in columns.

Input:
data_containers – the output of run_pipeline() is this, a list of data_containers.
Arguments for postprocess_func_colname:
calculate_beta_value –> ‘beta_value’ calculate_m_value –> ‘m_value’ calculate_copy_number –> ‘cm_value’

note: these functions are hard-coded in pipeline.py as part of process_all() step.

Options:
bit (float16, float32, float64) – change the default data type from float64
to another type to save disk space. float16 works fine, but might not be compatible with all numnpy/pandas functions, or with outside packages, so float64 is default. This is specified from methylprep process command line.
methylprep.run_series(id, path, dict_only=False, batch_size=100, clean=True, abort_if_no_idats=True)

Downloads the IDATs and metadata for a series then generates one metadata dictionary and one beta value matrix for each platform in the series

Arguments:
id [required]
the series ID (can be a GEO or ArrayExpress ID)
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, downloads idat files and meta data and creates data dictionaries for each platform
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
clean
if True, removes intermediate processing files
methylprep.run_series_list(list_file, path, dict_only=False, batch_size=100)

Downloads the IDATs and metadata for a list of series, creating metadata dictionaries and dataframes of sample beta_values

Arguments:
list_file [required]
the name of the file containing a list of GEO_IDS and/or Array Express IDs to download and process. This file must be located in the directory data is downloaded to. Each line of the file should contain the name of one data series ID.
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, only downloads data and creates dictionaries for each platform
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
methylprep.convert_miniml(geo_id, data_dir='.', merge=True, download_it=True, extract_controls=False, require_keyword=None, sync_idats=False)
This scans the datadir for an xml file with the geo_id in it.
Then it parses it and saves the useful stuff to a dataframe called “sample_sheet_meta_data.pkl”. DOES NOT REQUIRE idats.
Arguments:
merge:
If merge==True and there is a file with ‘samplesheet’ in the folder, and that sheet has GSM_IDs, merge that data into this samplesheet. Useful for when you have idats and want one combined samplesheet for the dataset.
download_it:
if miniml file not in data_dir path, it will download it from web.
extract_controls [experimental]:
if you only want to retain samples from the whole set that have certain keywords, such as “control” or “blood”, this experimental flag will rewrite the samplesheet with only the parts you want, then feed that into run_pipeline with named samples.
require_keyword [experimental]:
another way to eliminate samples from samplesheets, before passing into the processor. if specified, this keyword needs to appear somewhere in the values of a samplesheet.
sync_idats:
If flagged, this will search data_dir for idats and remove any of those that are not found in the filtered samplesheet. Requires you to run the download function first to get all idats, before you run this meta_data function.

Attempts to also read idat filenames, if they exist, but won’t fail if they don’t.

methylprep.load(filepath='.', format='beta_values', file_stem='', verbose=False, silent=False)
When methylprep processes large datasets, you use the ‘batch_size’ option to keep memory and file size

more manageable. Use the load helper function to quickly load and combine all of those parts into a single data frame of beta-values or m-values.

Doing this with pandas is about 8 times slower than using numpy in the intermediate step.

If no arguments are supplied, it will load all files in current directory that have a ‘beta_values_X.pkl’ pattern.

Arguments:
filepath:
Where to look for all the pickle files of processed data.
format:
‘beta_values’, ‘m_value’, or some other custom file pattern.
file_stem (string):
By default, methylprep process with batch_size creates a bunch of generically named files, such as ‘beta_values_1.pkl’, ‘beta_values_2.pkl’, ‘beta_values_3.pkl’, and so on. IF you rename these or provide a custom name during processing, provide that name here. (i.e. if your pickle file is called ‘GSE150999_beta_values_X.pkl’, then your file_stem is ‘GSE150999_’)
verbose:
outputs more processing messages.
silent:
suppresses all processing messages, even warnings.
methylprep.load_both(filepath='.', format='beta_values', file_stem='', verbose=False, silent=False)
Loads any pickled files in the given filepath that match specified format,

plus the associated meta data frame. Returns TWO objects (data, meta) as dataframes for analysis.

If meta_data files are found in multiple folders, it will read them all and try to match to the samples in the beta_values pickles by sample ID.

Arguments:
filepath:
Where to look for all the pickle files of processed data.
format:
‘beta_values’, ‘m_value’, or some other custom file pattern.
file_stem (string):
By default, methylprep process with batch_size creates a bunch of generically named files, such as ‘beta_values_1.pkl’, ‘beta_values_2.pkl’, ‘beta_values_3.pkl’, and so on. IF you rename these or provide a custom name during processing, provide that name here. (i.e. if your pickle file is called ‘GSE150999_beta_values_X.pkl’, then your file_stem is ‘GSE150999_’)
verbose:
outputs more processing messages.
silent:
suppresses all processing messages, even warnings.
methylprep.read_geo(filepath, verbose=False, debug=False)
Use to load preprocessed GEO data into methylcheck. Attempts to find the sample beta/M_values

in the CSV/TXT/XLSX file and turn it into a clean dataframe, with probe ids in the index/rows.

  • reads a downloaded file, either in csv, xlsx, pickle, txt
  • looks for /d_RxxCxx patterned headings and an probe index
  • sets index in df to probes
  • sets columns to sample names
  • forces probe values to be floats, if strings/mixed
  • if filename has ‘intensit’ or ‘signal’ in it, this converts to betas and saves even if filename doesn’t match, if columns have Methylated in them, it will convert and save
  • detect multi-line headers and adjusts dataframe columns accordingly
  • returns the usable dataframe
TODO:
  • handle files with .Signal_A and .Signal_B instead of Meth/Unmeth

if debug=True: does nothing.