Module contents

methylprep.processing
methylprep.run_pipeline(data_dir[, …]) The main CLI processing pipeline.
methylprep.files.create_sample_sheet(dir_path) Creates a samplesheet.csv file from the .IDAT files of a GEO series directory
methylprep.download
methylprep.run_series(id, path[, dict_only, …]) Downloads the IDATs and metadata for a series then generates one metadata dictionary and one beta value matrix for each platform in the series
methylprep.read_geo(filepath[, verbose, …]) Use to load preprocessed GEO data into methylcheck. Attempts to find the sample beta/M_values
methylprep.build_composite_dataset(…[, …]) A wrapper function for convert_miniml() to download a list of GEO datasets and process only those samples that meet criteria.
methylprep.models
methylprep.files
methylprep.get_manifest(raw_datasets, array_type=None, manifest_filepath=None)

Generates a SampleSheet instance for a given directory of processed data.

Arguments:
raw_datasets {list(RawDataset)} – Collection of RawDataset instances that
require a manifest file for the related array_type.
Keyword Arguments:
array_type {ArrayType} – The type of array to process. If not provided, it
will be inferred from the number of probes in the IDAT file. (default: {None})
manifest_filepath {path-like} – Path to the manifest file. If not provided,
it will be inferred from the array_type and downloaded if necessary (default: {None})
Returns:
[Manifest] – A Manifest instance.
methylprep.get_raw_datasets(sample_sheet, sample_name=None, from_s3=None, meta_only=False)

Generates a collection of RawDataset instances for the samples in a sample sheet.

Arguments:
sample_sheet {SampleSheet} – The SampleSheet from which the data originates.
Keyword Arguments:
sample_name {string} – Optional: one sample to process from the sample_sheet. (default: {None}) from_s3 {zip_reader} – pass in a S3ZipReader object to extract idat files from a zipfile hosted on s3. meta_only {True/False} – doesn’t read idat files, only parses the meta data about them. (RawMetaDataset is same as RawDataset but has no idat probe values stored in object, because not needed in pipeline)
Raises:
ValueError: If the number of probes between raw datasets differ.
Returns:
[RawDataset] – A RawDataset instance.
methylprep.run_pipeline(data_dir, array_type=None, export=False, manifest_filepath=None, sample_sheet_filepath=None, sample_name=None, betas=False, m_value=False, make_sample_sheet=False, batch_size=None, save_uncorrected=False, save_control=False, meta_data_frame=True, bit='float32', poobah=False, export_poobah=False)

The main CLI processing pipeline. This does every processing step and returns a data set.

Arguments:
data_dir [required]
path where idat files can be found, and samplesheet csv.
array_type [default: autodetect]
27k, 450k, EPIC, EPIC+ If omitted, this will autodetect it.
export [default: False]
if True, exports a CSV of the processed data for each idat file in sample.
betas
if True, saves a pickle (beta_values.pkl) of beta values for all samples
m_value
if True, saves a pickle (m_values.pkl) of beta values for all samples
Note on meth/unmeth:
if either betas or m_value is True, this will also save two additional files: ‘meth_values.pkl’ and ‘unmeth_values.pkl’ with the same dataframe structure, representing raw, uncorrected meth probe intensities for all samples. These are useful in some methylcheck functions and load/produce results 100X faster than loading from processed CSV output.
manifest_filepath [optional]
if you want to provide a custom manifest, provide the path. Otherwise, it will download the appropriate one for you.
sample_sheet_filepath [optional]
it will autodetect if ommitted.
sample_name [optional, list]
if you don’t want to process all samples, you can specify individual as a list. if sample_names are specified, this will not also do batch sizes (large batches must process all samples)
make_sample_sheet [optional]
if True, generates a sample sheet from idat files called ‘samplesheet.csv’, so that processing will work. From CLI pass in “–no_sample_sheet” to trigger sample sheet auto-generation.
batch_size [optional]
if set to any integer, samples will be processed and saved in batches no greater than the specified batch size. This will yield multiple output files in the format of “beta_values_1.pkl … beta_values_N.pkl”.
save_uncorrected [optional]
if True, adds two additional columns to the processed.csv per sample (meth and unmeth). does not apply noob correction to these values.
save_control [optional]
if True, adds all Control and SnpI type probe values to a separate pickled dataframe, with probes in rows and sample_name in the first column. These non-CpG probe names are excluded from processed data and must be stored separately.
bit [optional]
Change the processed beta or m_value data_type from float64 to float16 or float32. This will make files smaller, often with no loss in precision, if it works. sometimes using float16 will cause an overflow error and files will have “inf” instead of numbers. Use float32 instead.
poobah [False]
If specified as True, the pipeline will run Sesame’s p-value probe detection method (poobah) on samples to remove probes that fail the signal/noise ratio on their fluorescence channels. These will appear as NaNs in the resulting dataframes (beta_values.pkl or m_values.pkl). All probes, regardless of p-value cutoff, will be retained in CSVs, but there will be a ‘poobah_pval’ column in CSV files that methylcheck.load uses to exclude failed probes upon import at a later step.
Returns:

By default, if called as a function, a list of SampleDataContainer objects is returned.

betas
if True, will return a single data frame of betavalues instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.
m_value
if True, will return a single data frame of m_factor values instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.

if batch_size is set to more than 200 samples, nothing is returned but all the files are saved. You can recreate the output by loading the files.

Processing note:
The sample_sheet parser will ensure every sample has a unique name and assign one (e.g. Sample1) if missing, or append a number (e.g. _1) if not unique. This may cause sample_sheets and processed data in dataframes to not match up. Will fix in future version.
methylprep.get_sample_sheet(dir_path, filepath=None)

Generates a SampleSheet instance for a given directory of processed data.

Arguments:
dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files.
Keyword Arguments:
filepath {string or path-like} – path of the sample sheet file if provided, otherwise
one will try to be found. (default: {None})
Returns:
[SampleSheet] – A SampleSheet instance.
methylprep.consolidate_values_for_sheet(data_containers, postprocess_func_colname='beta_value', bit='float32', poobah=False)

with a data_containers (list of processed SampleDataContainer objects), this will transform results into a single dataframe with all of the function values, with probe names in rows, and sample beta values for probes in columns.

Input:
data_containers – the output of run_pipeline() is this, a list of data_containers.
Arguments for postprocess_func_colname:
calculate_beta_value –> ‘beta_value’ calculate_m_value –> ‘m_value’ calculate_copy_number –> ‘cm_value’

note: these functions are hard-coded in pipeline.py as part of process_all() step.

Options:
bit (float16, float32, float64) – change the default data type from float32
to another type to save disk space. float16 works fine, but might not be compatible with all numnpy/pandas functions, or with outside packages, so float32 is default. This is specified from methylprep process command line.
poobah
If true, filters by the poobah_pval column. (beta m_val pass True in for this.)
methylprep.run_series(id, path, dict_only=False, batch_size=100, clean=True, abort_if_no_idats=True)

Downloads the IDATs and metadata for a series then generates one metadata dictionary and one beta value matrix for each platform in the series

Arguments:
id [required]
the series ID (can be a GEO or ArrayExpress ID)
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, downloads idat files and meta data and creates data dictionaries for each platform
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
clean
if True, removes intermediate processing files
methylprep.run_series_list(list_file, path, dict_only=False, batch_size=100)

Downloads the IDATs and metadata for a list of series, creating metadata dictionaries and dataframes of sample beta_values

Arguments:
list_file [required]
the name of the file containing a list of GEO_IDS and/or Array Express IDs to download and process. This file must be located in the directory data is downloaded to. Each line of the file should contain the name of one data series ID.
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, only downloads data and creates dictionaries for each platform
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
methylprep.convert_miniml(geo_id, data_dir='.', merge=True, download_it=True, extract_controls=False, require_keyword=None, sync_idats=False)
This scans the datadir for an xml file with the geo_id in it.
Then it parses it and saves the useful stuff to a dataframe called “sample_sheet_meta_data.pkl”. DOES NOT REQUIRE idats.
CLI version:
python -m meta_data -i GSExxxxx -d <my_folder>
Arguments:
merge:
If merge==True and there is a file with ‘samplesheet’ in the folder, and that sheet has GSM_IDs, merge that data into this samplesheet. Useful for when you have idats and want one combined samplesheet for the dataset.
download_it:
if miniml file not in data_dir path, it will download it from web.
extract_controls [experimental]:
if you only want to retain samples from the whole set that have certain keywords, such as “control” or “blood”, this experimental flag will rewrite the samplesheet with only the parts you want, then feed that into run_pipeline with named samples.
require_keyword [experimental]:
another way to eliminate samples from samplesheets, before passing into the processor. if specified, this keyword needs to appear somewhere in the values of a samplesheet.
sync_idats:
If flagged, this will search data_dir for idats and remove any of those that are not found in the filtered samplesheet. Requires you to run the download function first to get all idats, before you run this meta_data function.

Attempts to also read idat filenames, if they exist, but won’t fail if they don’t.

methylprep.read_geo(filepath, verbose=False, debug=False, as_beta=True, column_pattern=None, test_only=False, rename_probe_column=True, decimals=3)
Use to load preprocessed GEO data into methylcheck. Attempts to find the sample beta/M_values

in the CSV/TXT/XLSX file and turn it into a clean dataframe, with probe ids in the index/rows. Version 3 (introduced June 2020)

  • reads a downloaded file, either in csv, xlsx, pickle, txt
  • looks for /d_RxxCxx patterned headings and an probe index
  • sets index in df to probes
  • sets columns to sample names
  • forces probe values to be floats, if strings/mixed
  • if filename has ‘intensit’ or ‘signal’ in it, this converts to betas and saves even if filename doesn’t match, if columns have Methylated in them, it will convert and save
  • detect multi-line headers and adjusts dataframe columns accordingly
  • returns the usable dataframe

as_beta == True – converts meth/unmeth into a df of sample betas. column_pattern=None (Sample21 | Sample_21 | Sample 21) – some string of characters that precedes the number part of each sample in the columns of the file to be ingested.

FIXED:

[x] handle files with .Signal_A and .Signal_B instead of Meth/Unmeth [x] BUG: can’t parse matrix_… files if uses underscores instead of spaces around sample numbers, or where sampleXXX has no separator. [x] handle processed files with sample_XX [x] returns IlmnID as index/probe column, unless ‘rename_probe_column’ == False [x] pass in sample_column names from header parser so that logic is in one place

(makes the output much larger, so add kwarg to exclude this)

[x] demicals (default 3) – round all probe beta/intensity/p values returned to this number of decimal places. [x] bug: can only recognize beta samples if ‘sample’ in column name, or sentrix_id pattern matches columns.

need to expand this to handle arbitrary sample naming styles (limited to one column per sample patterns)
TODO:
[-] BUG: meth_unmeth_pval works as_beta but not returning full data yet [-] multiline header not working with all files yet.
notes:
this makes inferences based on strings in the filename, and based on the column names.
methylprep.detect_header_pattern(test, filename, return_sample_column_names=False)

test is a dataframe with first 100 rows of the data set, and all columns. makes all the assumptions easier to read in one place.

betas non-normalized matrix_processed matrix_signal series_matrix methylated_signal_intensities and unmethylated_signal_intensities _family

TODO: GSM12345-tbl-1.txt type files (in _family.tar.gz packages) are possible, but needs more work.

  • numbered samples handled differently from sample_ids in columns
  • won’t detect columns with no separators in strings
methylprep.build_composite_dataset(geo_id_list, data_dir, merge=True, download_it=True, extract_controls=False, require_keyword=None, sync_idats=True, betas=False, m_value=False, export=False)

A wrapper function for convert_miniml() to download a list of GEO datasets and process only those samples that meet criteria. Specifically - grab the “control” or “normal” samples from a bunch of experiments for one tissue type (e.g. “blood”), process them, and put all the resulting beta_values and/or m_values pkl files in one place, so that you can run methylize.load_both() to create a combined reference dataset for QC, analysis, or meta-analysis.

Arguments:
geo_id_list (required):
A list of GEO “GSEnnn” ids. From command line, pass these in as separate values
data_dir:
folder to save data
merge (True):
If merge==True and there is a file with ‘samplesheet’ in the folder, and that sheet has GSM_IDs, merge that data into this samplesheet. Useful for when you have idats and want one combined samplesheet for the dataset.
download_it (True):
if miniml file not in data_dir path, it will download it from web.
extract_controls (True)):
if you only want to retain samples from the whole set that have certain keywords, such as “control” or “blood”, this experimental flag will rewrite the samplesheet with only the parts you want, then feed that into run_pipeline with named samples.
require_keyword (None):
another way to eliminate samples from samplesheets, before passing into the processor. if specified, the “keyword” string passed in must appear somewhere in the values of a samplesheet for sample to be downloaded, processed, retained.
sync_idats:
If flagged, this will search data_dir for idats and remove any of those that are not found in the filtered samplesheet. Requires you to run the download function first to get all idats, before you run this meta_data function.
betas:
process beta_values
m_value:
process m_values
  • Attempts to also read idat filenames, if they exist, but won’t fail if they don’t.
  • removes unneeded files as it goes, but leaves the xml MINiML file and folder there as a marker if a geo dataset fails to download. So it won’t try again on resume.

processing

class methylprep.processing.RawDataset(sample, green_idat, red_idat)

Wrapper for a sample and its pair of raw IdatDataset values.

Arguments:
sample {Sample} – A Sample parsed from the sample sheet. green_idat {IdatDataset} – The sample’s GREEN channel IdatDataset. red_idat {IdatDataset} – The sample’s RED channel IdatDataset.
Raises:
ValueError: If the IDAT file pair have differing number of probes. TypeError: If an invalid Channel is provided when parsing an IDAT file.
filter_oob_probes(channel, manifest, idat_dataset)

this is the step where it appears that illumina_id (internal probe numbers) are matched to the AddressA_ID / B_IDs from manifest, which allows for ‘cgXXXXXXX’ probe names to be used later.

get_oob_controls(manifest)

Out-of-bound controls are the mean intensity values for the channel in the opposite channel’s probes

class methylprep.processing.SampleDataContainer(raw_dataset, manifest, retain_uncorrected_probe_intensities=False, bit='float32', pval=False)

Wrapper that provides easy access to slices of data for a Sample, its RawDataset, and the pre-configured MethylationDataset subsets of probes.

Arguments:

raw_dataset {RawDataset} – A sample’s RawDataset for a single well on the processed array. manifest {Manifest} – The Manifest for the correlated RawDataset’s array type. bit (default: float64) – option to store data as float16 or float32 to save space. pval (default: False) – whether to apply p-value-detection algorithm to remove

unreliable probes (based on signal/noise ratio of fluoresence) uses the sesame method (pOOBah) based on out of band background levels

Jan 2020: added .snp_(un)methylated property. used in postprocess.consolidate_crontrol_snp() Mar 2020: added p-value detection option Mar 2020: added mouse probe post-processing separation

preprocess()

combines the methylated and unmethylated columns from the SampleDataContainer.

process_all()

Runs all pre and post-processing calculations for the dataset.

process_beta_value(input_dataframe)

Calculate Beta value from methylation data

process_copy_number(input_dataframe)

Calculate copy number value from methylation data

process_m_value(input_dataframe)

Calculate M value from methylation data

methylprep.processing.get_manifest(raw_datasets, array_type=None, manifest_filepath=None)

Generates a SampleSheet instance for a given directory of processed data.

Arguments:
raw_datasets {list(RawDataset)} – Collection of RawDataset instances that
require a manifest file for the related array_type.
Keyword Arguments:
array_type {ArrayType} – The type of array to process. If not provided, it
will be inferred from the number of probes in the IDAT file. (default: {None})
manifest_filepath {path-like} – Path to the manifest file. If not provided,
it will be inferred from the array_type and downloaded if necessary (default: {None})
Returns:
[Manifest] – A Manifest instance.
methylprep.processing.get_raw_datasets(sample_sheet, sample_name=None, from_s3=None, meta_only=False)

Generates a collection of RawDataset instances for the samples in a sample sheet.

Arguments:
sample_sheet {SampleSheet} – The SampleSheet from which the data originates.
Keyword Arguments:
sample_name {string} – Optional: one sample to process from the sample_sheet. (default: {None}) from_s3 {zip_reader} – pass in a S3ZipReader object to extract idat files from a zipfile hosted on s3. meta_only {True/False} – doesn’t read idat files, only parses the meta data about them. (RawMetaDataset is same as RawDataset but has no idat probe values stored in object, because not needed in pipeline)
Raises:
ValueError: If the number of probes between raw datasets differ.
Returns:
[RawDataset] – A RawDataset instance.
methylprep.processing.preprocess_noob(data_container)

the main preprocessing function. Applies background-subtraction and NOOB. Sets data_container.methylated and unmethylated values for sample.

methylprep.processing.run_pipeline(data_dir, array_type=None, export=False, manifest_filepath=None, sample_sheet_filepath=None, sample_name=None, betas=False, m_value=False, make_sample_sheet=False, batch_size=None, save_uncorrected=False, save_control=False, meta_data_frame=True, bit='float32', poobah=False, export_poobah=False)

The main CLI processing pipeline. This does every processing step and returns a data set.

Arguments:
data_dir [required]
path where idat files can be found, and samplesheet csv.
array_type [default: autodetect]
27k, 450k, EPIC, EPIC+ If omitted, this will autodetect it.
export [default: False]
if True, exports a CSV of the processed data for each idat file in sample.
betas
if True, saves a pickle (beta_values.pkl) of beta values for all samples
m_value
if True, saves a pickle (m_values.pkl) of beta values for all samples
Note on meth/unmeth:
if either betas or m_value is True, this will also save two additional files: ‘meth_values.pkl’ and ‘unmeth_values.pkl’ with the same dataframe structure, representing raw, uncorrected meth probe intensities for all samples. These are useful in some methylcheck functions and load/produce results 100X faster than loading from processed CSV output.
manifest_filepath [optional]
if you want to provide a custom manifest, provide the path. Otherwise, it will download the appropriate one for you.
sample_sheet_filepath [optional]
it will autodetect if ommitted.
sample_name [optional, list]
if you don’t want to process all samples, you can specify individual as a list. if sample_names are specified, this will not also do batch sizes (large batches must process all samples)
make_sample_sheet [optional]
if True, generates a sample sheet from idat files called ‘samplesheet.csv’, so that processing will work. From CLI pass in “–no_sample_sheet” to trigger sample sheet auto-generation.
batch_size [optional]
if set to any integer, samples will be processed and saved in batches no greater than the specified batch size. This will yield multiple output files in the format of “beta_values_1.pkl … beta_values_N.pkl”.
save_uncorrected [optional]
if True, adds two additional columns to the processed.csv per sample (meth and unmeth). does not apply noob correction to these values.
save_control [optional]
if True, adds all Control and SnpI type probe values to a separate pickled dataframe, with probes in rows and sample_name in the first column. These non-CpG probe names are excluded from processed data and must be stored separately.
bit [optional]
Change the processed beta or m_value data_type from float64 to float16 or float32. This will make files smaller, often with no loss in precision, if it works. sometimes using float16 will cause an overflow error and files will have “inf” instead of numbers. Use float32 instead.
poobah [False]
If specified as True, the pipeline will run Sesame’s p-value probe detection method (poobah) on samples to remove probes that fail the signal/noise ratio on their fluorescence channels. These will appear as NaNs in the resulting dataframes (beta_values.pkl or m_values.pkl). All probes, regardless of p-value cutoff, will be retained in CSVs, but there will be a ‘poobah_pval’ column in CSV files that methylcheck.load uses to exclude failed probes upon import at a later step.
Returns:

By default, if called as a function, a list of SampleDataContainer objects is returned.

betas
if True, will return a single data frame of betavalues instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.
m_value
if True, will return a single data frame of m_factor values instead of a list of SampleDataContainer objects. Format is a “wide matrix”: columns contain probes and rows contain samples.

if batch_size is set to more than 200 samples, nothing is returned but all the files are saved. You can recreate the output by loading the files.

Processing note:
The sample_sheet parser will ensure every sample has a unique name and assign one (e.g. Sample1) if missing, or append a number (e.g. _1) if not unique. This may cause sample_sheets and processed data in dataframes to not match up. Will fix in future version.
methylprep.processing.consolidate_values_for_sheet(data_containers, postprocess_func_colname='beta_value', bit='float32', poobah=False)

with a data_containers (list of processed SampleDataContainer objects), this will transform results into a single dataframe with all of the function values, with probe names in rows, and sample beta values for probes in columns.

Input:
data_containers – the output of run_pipeline() is this, a list of data_containers.
Arguments for postprocess_func_colname:
calculate_beta_value –> ‘beta_value’ calculate_m_value –> ‘m_value’ calculate_copy_number –> ‘cm_value’

note: these functions are hard-coded in pipeline.py as part of process_all() step.

Options:
bit (float16, float32, float64) – change the default data type from float32
to another type to save disk space. float16 works fine, but might not be compatible with all numnpy/pandas functions, or with outside packages, so float32 is default. This is specified from methylprep process command line.
poobah
If true, filters by the poobah_pval column. (beta m_val pass True in for this.)
methylprep.processing.get_array_type(raw_datasets)

provide a list of raw_datasets and it will return the array type by counting probes

methylprep.processing.read_geo(filepath, verbose=False, debug=False, as_beta=True, column_pattern=None, test_only=False, rename_probe_column=True, decimals=3)
Use to load preprocessed GEO data into methylcheck. Attempts to find the sample beta/M_values

in the CSV/TXT/XLSX file and turn it into a clean dataframe, with probe ids in the index/rows. Version 3 (introduced June 2020)

  • reads a downloaded file, either in csv, xlsx, pickle, txt
  • looks for /d_RxxCxx patterned headings and an probe index
  • sets index in df to probes
  • sets columns to sample names
  • forces probe values to be floats, if strings/mixed
  • if filename has ‘intensit’ or ‘signal’ in it, this converts to betas and saves even if filename doesn’t match, if columns have Methylated in them, it will convert and save
  • detect multi-line headers and adjusts dataframe columns accordingly
  • returns the usable dataframe

as_beta == True – converts meth/unmeth into a df of sample betas. column_pattern=None (Sample21 | Sample_21 | Sample 21) – some string of characters that precedes the number part of each sample in the columns of the file to be ingested.

FIXED:

[x] handle files with .Signal_A and .Signal_B instead of Meth/Unmeth [x] BUG: can’t parse matrix_… files if uses underscores instead of spaces around sample numbers, or where sampleXXX has no separator. [x] handle processed files with sample_XX [x] returns IlmnID as index/probe column, unless ‘rename_probe_column’ == False [x] pass in sample_column names from header parser so that logic is in one place

(makes the output much larger, so add kwarg to exclude this)

[x] demicals (default 3) – round all probe beta/intensity/p values returned to this number of decimal places. [x] bug: can only recognize beta samples if ‘sample’ in column name, or sentrix_id pattern matches columns.

need to expand this to handle arbitrary sample naming styles (limited to one column per sample patterns)
TODO:
[-] BUG: meth_unmeth_pval works as_beta but not returning full data yet [-] multiline header not working with all files yet.
notes:
this makes inferences based on strings in the filename, and based on the column names.
methylprep.processing.detect_header_pattern(test, filename, return_sample_column_names=False)

test is a dataframe with first 100 rows of the data set, and all columns. makes all the assumptions easier to read in one place.

betas non-normalized matrix_processed matrix_signal series_matrix methylated_signal_intensities and unmethylated_signal_intensities _family

TODO: GSM12345-tbl-1.txt type files (in _family.tar.gz packages) are possible, but needs more work.

  • numbered samples handled differently from sample_ids in columns
  • won’t detect columns with no separators in strings

models

class methylprep.models.ArrayType

This class stores meta data about array types, such as numbers of probes of each type, and how to guess the array from probes in idat files.

num_probes

used to load normal cg+ch probes from start of manifest until this point.

class methylprep.models.Channel

idat probes measure either a red or green fluorescence. This specifies which to return within idat.py: red_idat or green_idat.

class methylprep.models.ControlType

An enumeration.

class methylprep.models.Probe(address, illumina_id, probe_type)

this doesn’t appear to be instantiated anywhere in methylprep

class methylprep.models.ProbeAddress

AddressA_ID and AddressB_ID are columns in the manifest csv that contain internal Illumina probe identifiers.

Type II probes use AddressA_ID; Type I uses both, because there are two probes, two colors.

probe intensities in .idat files are keyed to one of these ids, but processed data is always keyed to the IlmnID probe “names” – so this is used in converting between IDs. It is used to define probe sets below in this probes.py

class methylprep.models.ProbeSubset(data_channel, probe_address, probe_channel, probe_type)

used below in probes.py to define sub-sets of probes: foreground-(red|green|all), or (un)methylated probes

class methylprep.models.ProbeType

probes can either be type I or type II for CpG or Snp sequences. Control probes are used for background correction in different fluorescence ranges and staining efficiency. Type I probes record EITHER a red or a green value. Type II probes record both values together. NOOB uses the red fluorescence on a green probe and vice versa to calculate background fluorescence.

class methylprep.models.Sample(data_dir, sentrix_id, sentrix_position, **addl_fields)

Object representing a row in a SampleSheet file

Arguments:
data_dir {string or path-like} – Base directory of the sample sheet and associated IDAT files. sentrix_id {string} – The slide number of the processed array. sentrix_position {string} – The position on the processed slide.
Keyword Arguments:

addl_fields {} – Additional metadata describing the sample. including experiment subject meta data:

name (sample name, unique id) Sample_Type Control GSM_ID (same as sample name if using GEO public data)

array meta data:

group plate pool well
alternate_base_filename

GEO data sets using this file name convention.

get_export_filepath()

Called by run_pipeline to find the folder/filename to export data as CSV, but CSV file doesn’t exist yet.

get_file_s3(zip_reader, extension, suffix=None)

replaces get_filepath, but for s3 context. Since these files are compressed within a single zipfile in the bucket, they don’t resolve to PurePaths.

get_filepath(extension, suffix=None, verify=True)

builds the filepath based on custom file extensions and suffixes during processing.

Params (verify):
tests whether file exists, either in data_dir or somewhere in recursive search path of data_dir.
Export:
uses this later to fetch the place where a file ought to be created – but doesn’t exist yet, so use verify=False.
Notes:
_suffix – used to create the <file>_processed files.
class methylprep.models.MethylationDataset(raw_dataset, manifest, probe_subsets)

Wrapper for a collection of methylated or unmethylated probes and their mean intensity values, providing common functionality for the subset of probes.

Arguments:
raw_dataset {RawDataset} – A sample’s RawDataset for a single well on the processed array. manifest {Manifest} – The Manifest for the correlated RawDataset’s array type. probe_subsets {list(ProbeSubset)} – Collection of ProbeSubsets that correspond to the probe type (methylated or unmethylated).
classmethod methylated(raw_dataset, manifest)

convenience method that feeds in a pre-defined list of methylated CpG locii probes

classmethod snp_methylated(raw_dataset, manifest)

convenience method that feeds in a pre-defined list of methylated Snp locii probes

classmethod snp_unmethylated(raw_dataset, manifest)

convenience method that feeds in a pre-defined list of UNmethylated Snp locii probes

classmethod unmethylated(raw_dataset, manifest)

convenience method that feeds in a pre-defined list of UNmethylated CpG locii probes

files

class methylprep.files.IdatDataset(filepath_or_buffer, channel, idat_id='IDAT', idat_version=3)

Validates and parses an Illumina IDAT file.

Arguments:
filepath_or_buffer {file-like} – the IDAT file to parse. channel {Channel} – the fluorescent channel (Channel.RED or Channel.GREEN) that produced the IDAT dataset.
Keyword Arguments:
idat_id {string} – expected IDAT file identifier (default: {DEFAULT_IDAT_FILE_ID}) idat_version {integer} – expected IDAT version (default: {DEFAULT_IDAT_VERSION})
Raises:
ValueError: The IDAT file has an incorrect identifier or version specifier.
read(idat_file)

Reads the IDAT file and parses the appropriate sections. Joins the mean probe intensity values with their Illumina probe ID.

Arguments:
idat_file {file-like} – the IDAT file to process.
Returns:
DataFrame – mean probe intensity values indexed by Illumina ID.
class methylprep.files.Manifest(array_type, filepath_or_buffer=None, on_lambda=False)

Provides an object interface to an Illumina array manifest file.

Arguments:
array_type {ArrayType} – The type of array to process. values are styled like ArrayType.ILLUMINA_27K, ArrayType.ILLUMINA_EPIC
Keyword Arguments:
filepath_or_buffer {file-like} – a pre-existing manifest filepath (default: {None})
Raises:
ValueError: The sample sheet is not formatted properly or a sample cannot be found.
static download_default(array_type, on_lambda=False)

Downloads the appropriate manifest file if one does not already exist.

Arguments:
array_type {ArrayType} – The type of array to process.
Returns:
[PurePath] – Path to the manifest file.
get_loci_count()

Returns the number of unique loci/identifiers in the manifest

get_loci_names()

Returns the list of unique loci/identifiers in the manifest

get_probe_details(probe_type, channel=None)

given a probe type (I, II, SnpI, SnpII, Control) and a channel (Channel.RED | Channel.GREEN), This will return info needed to map probes to their names (e.g. cg0031313 or rs00542420), which are NOT in the idat files.

read_control_probes(manifest_file)

Unlike other probes, control probes have no IlmnID because they’re not locus-specific. they also use arbitrary columns, ignoring the header at start of manifest file.

read_mouse_probes(manifest_file)

ILLUMINA_MOUSE contains unique probes whose names begin with ‘mu’ and ‘rs’ for ‘murine’ and ‘repeat-sequences’, respectively. This creates a dataframe of these probes, which are not processed like normal cg/ch probes.

read_snp_probes(manifest_file)

Unlike cpg and control probes, these rs probes are NOT sequential in all arrays.

static seek_to_start(manifest_file)

find the start of the data part of the manifest. first left-most column must be “IlmnID” to be found.

class methylprep.files.SampleSheet(filepath_or_buffer, data_dir)

Validates and parses an Illumina sample sheet file.

Arguments:
filepath_or_buffer {file-like} – the sample sheet file to parse. dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files.
Raises:
ValueError: The sample sheet is not formatted properly or a sample cannot be found.
build_samples()

Builds Sample objects from the processed sample sheet rows.

Added to Sample as class_method: if the idat file is not in the same folder, (check if exists!) looks recursively for that filename and updates the data_dir for that Sample.

contains_column(column_name)

helper function to determine if sample_sheet contains a specific column, such as GSM_ID. SampleSheet must already have __data_frame in it.

get_sample(sample_name)

scans all samples for one matching sample_name, if provided. If no sample_name, then it returns all samples.

get_samples()

Retrieves Sample objects from the processed sample sheet rows, building them if necessary.

methylprep.files.get_sample_sheet(dir_path, filepath=None)

Generates a SampleSheet instance for a given directory of processed data.

Arguments:
dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files.
Keyword Arguments:
filepath {string or path-like} – path of the sample sheet file if provided, otherwise
one will try to be found. (default: {None})
Returns:
[SampleSheet] – A SampleSheet instance.
methylprep.files.get_sample_sheet_s3(zip_reader)

reads a zipfile and considers all filenames with ‘sample_sheet’ but will test all csv. the zip_reader is an amazon S3ZipReader object capable of reading the zipfile header.

methylprep.files.create_sample_sheet(dir_path, matrix_file=False, output_file='samplesheet.csv', sample_type='', sample_sub_type='')

Creates a samplesheet.csv file from the .IDAT files of a GEO series directory

Arguments:

dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files. matrix_file {boolean} – Whether or not a Series Matrix File should be searched for names. (default: {False})

========== | ========= | ==== | ======= parameter | required | type | effect ========== | ========= ==== | ======= sample_type | optional | string | label all samples in created sheet as this type (i.e. blood, saliva, tumor cells) sample_sub_type | optional | string | further detail sample type for batch controls | optional | list of sample_names | assign all samples in controls list to be “control samples”, not treatment samples. ========== | ========= | ==== | =======

Note:
Because sample_names are only generated from Matrix files, this method won’t let you assign controls to samples from CLI. Would require all sample names be passed in from CLI as well, a pretty messy endeavor.
Raises:
FileNotFoundError: The directory could not be found.
methylprep.files.find_sample_sheet(dir_path)

Find sample sheet file for Illumina methylation array.

Notes:
looks for csv files in {dir_path}. If more than one csv file found, returns the one that has “sample_sheet” or ‘samplesheet’ in its name. Otherwise, raises error.
Arguments:
dir_path {string or path-like} – Base directory of the sample sheet and associated IDAT files.
Raises:
FileNotFoundError: [description] Exception: [description]
Returns:
[string] – Path to sample sheet in base directory

geo download

methylprep.download.run_series(id, path, dict_only=False, batch_size=100, clean=True, abort_if_no_idats=True)

Downloads the IDATs and metadata for a series then generates one metadata dictionary and one beta value matrix for each platform in the series

Arguments:
id [required]
the series ID (can be a GEO or ArrayExpress ID)
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, downloads idat files and meta data and creates data dictionaries for each platform
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
clean
if True, removes intermediate processing files
methylprep.download.run_series_list(list_file, path, dict_only=False, batch_size=100)

Downloads the IDATs and metadata for a list of series, creating metadata dictionaries and dataframes of sample beta_values

Arguments:
list_file [required]
the name of the file containing a list of GEO_IDS and/or Array Express IDs to download and process. This file must be located in the directory data is downloaded to. Each line of the file should contain the name of one data series ID.
path [required]
the path to the directory to download the data to. It is assumed a dictionaries and beta values directory has been created for each platform (and will create one for each if not)
dict_only
if True, only downloads data and creates dictionaries for each platform
batch_size
the batch_size to use when processing samples (number of samples run at a time). By default is set to the constant 100.
methylprep.download.convert_miniml(geo_id, data_dir='.', merge=True, download_it=True, extract_controls=False, require_keyword=None, sync_idats=False)
This scans the datadir for an xml file with the geo_id in it.
Then it parses it and saves the useful stuff to a dataframe called “sample_sheet_meta_data.pkl”. DOES NOT REQUIRE idats.
CLI version:
python -m meta_data -i GSExxxxx -d <my_folder>
Arguments:
merge:
If merge==True and there is a file with ‘samplesheet’ in the folder, and that sheet has GSM_IDs, merge that data into this samplesheet. Useful for when you have idats and want one combined samplesheet for the dataset.
download_it:
if miniml file not in data_dir path, it will download it from web.
extract_controls [experimental]:
if you only want to retain samples from the whole set that have certain keywords, such as “control” or “blood”, this experimental flag will rewrite the samplesheet with only the parts you want, then feed that into run_pipeline with named samples.
require_keyword [experimental]:
another way to eliminate samples from samplesheets, before passing into the processor. if specified, this keyword needs to appear somewhere in the values of a samplesheet.
sync_idats:
If flagged, this will search data_dir for idats and remove any of those that are not found in the filtered samplesheet. Requires you to run the download function first to get all idats, before you run this meta_data function.

Attempts to also read idat filenames, if they exist, but won’t fail if they don’t.

methylprep.download.build_composite_dataset(geo_id_list, data_dir, merge=True, download_it=True, extract_controls=False, require_keyword=None, sync_idats=True, betas=False, m_value=False, export=False)

A wrapper function for convert_miniml() to download a list of GEO datasets and process only those samples that meet criteria. Specifically - grab the “control” or “normal” samples from a bunch of experiments for one tissue type (e.g. “blood”), process them, and put all the resulting beta_values and/or m_values pkl files in one place, so that you can run methylize.load_both() to create a combined reference dataset for QC, analysis, or meta-analysis.

Arguments:
geo_id_list (required):
A list of GEO “GSEnnn” ids. From command line, pass these in as separate values
data_dir:
folder to save data
merge (True):
If merge==True and there is a file with ‘samplesheet’ in the folder, and that sheet has GSM_IDs, merge that data into this samplesheet. Useful for when you have idats and want one combined samplesheet for the dataset.
download_it (True):
if miniml file not in data_dir path, it will download it from web.
extract_controls (True)):
if you only want to retain samples from the whole set that have certain keywords, such as “control” or “blood”, this experimental flag will rewrite the samplesheet with only the parts you want, then feed that into run_pipeline with named samples.
require_keyword (None):
another way to eliminate samples from samplesheets, before passing into the processor. if specified, the “keyword” string passed in must appear somewhere in the values of a samplesheet for sample to be downloaded, processed, retained.
sync_idats:
If flagged, this will search data_dir for idats and remove any of those that are not found in the filtered samplesheet. Requires you to run the download function first to get all idats, before you run this meta_data function.
betas:
process beta_values
m_value:
process m_values
  • Attempts to also read idat filenames, if they exist, but won’t fail if they don’t.
  • removes unneeded files as it goes, but leaves the xml MINiML file and folder there as a marker if a geo dataset fails to download. So it won’t try again on resume.
methylprep.download.search(keyword)
CLI/cron function to check for new datasets.
set up as a weekly cron. uses a local storage file to compare with old datasets in <pattern>_meta.csv. saves the dates of each dataset from GEO; calculates any new ones as new rows. updates csv.
options:
pass in -k keyword verbose (True|False) — reports to page; saves csv too
returns:
saves a CSV to disk and returns a dataframe of results