dnarecords.writer
DNARecords available writers.
Module Contents
Classes
Genomics data (vcf, bgen, etc.) to tfrecords or parquet. Sample and variant wise. |
- class dnarecords.writer.DNARecordsWriter(expr: Expression, block_size: int = int(1000000.0), staging: str = '/tmp/dnarecords/staging')[source]
Genomics data (vcf, bgen, etc.) to tfrecords or parquet. Sample and variant wise.
Core class to go from genomics data to tfrecords or parquet files ready to use with Deep Learning frameworks like Tensorflow or Pytorch.
Able to generate dna records variant wise or sample wise (i.e. transposing the matrix).
Takes advantage of sparsity (very convenient to save space and computation, specially with Deep Learning).
Scales automatically to any sized dataset. Tested on UKBB.
Example
import dnarecords as dr hl = dr.helper.DNARecordsUtils.init_hail() hl.utils.get_1kg('/tmp/1kg') mt = hl.read_matrix_table('/tmp/1kg/1kg.mt') mt = mt.annotate_entries(dosage=hl.pl_dosage(mt.PL)) output = '/tmp/dnarecords/output' writer = dr.writer.DNARecordsWriter(mt.dosage) writer.write(output, sparse=True, sample_wise=True, variant_wise=True, tfrecord_format=True, parquet_format=True, write_mode='overwrite', gzip=True) reader = dr.reader.DNASparkReader(output) reader.sample_wise_dnarecords().show(2) reader.variant_wise_dnarecords().show(2)
+---+--------------------+--------------------+----------------+ |key| chr1_indices| chr1_values|chr1_dense_shape| ... +---+--------------------+--------------------+----------------+ | 26|[0, 2, 4, 5, 6, 7...|[0.33607214002352...| 909| ... | 29|[0, 1, 2, 3, 4, 5...|[0.20076008098505...| 909| ... +---+--------------------+--------------------+----------------+ only showing top 1 row +--------------------+--------------------+----+-----------+ | indices| values| key|dense_shape| +--------------------+--------------------+----+-----------+ |[0, 1, 2, 3, 4, 5...|[0.9984177, 0.007...|3506| 10880| |[0, 1, 2, 3, 4, 5...|[0.11181577, 0.01...|3764| 10880| +--------------------+--------------------+----+-----------+ only showing top 2 rows ...
- Parameters
expr – a Hail expression. Currently, ony expressions coercible to numeric are supported
block_size – entries per block in the internal operations
staging – path to staging directory to use for intermediate data. Default: /tmp/dnarecords/staging.
- write(self, output: str, sparse: bool = True, sample_wise: bool = True, variant_wise: bool = False, tfrecord_format: bool = True, parquet_format: bool = False, write_mode: str = 'error', gzip: bool = True) None [source]
DNARecords spark writer.
Writes a DNARecords dataset based on the Hail expr provided in the class constructor.
- Return type
Dict[str, DataFrame]
- Parameters
output – path to the output location of the DNARecords.
sparse – generate sparse data (filtering out any zero values). Default: True.
sample_wise – generate DNARecords in a sample wise fashion (i.e. transposing the matrix, one column -> one record). Default: True.
variant_wise – generate DNARecords in a variant wise fashion (i.e. one row -> one record) Default: False.
tfrecord_format – generate tfrecords output files. Default: True.
parquet_format – generate parquet output files. Default: False.
write_mode – spark write mode parameter (‘error’, ‘overwrite’, etc.). Default: ‘error’.
gzip – gzip the output files. Default: True.
- Returns
A dictionary with DataFrames for each generated output.