r - Break region into smaller regions based on cutoff -


this assume simple programming issue, i've been struggling it. because don't know right words use, perhaps?

given set of "ranges" (in form of 1-a set of numbers below, 2-iranges, or 3-genomicranges), i'd split set of smaller ranges.

example beginning:

chr    start     end 1        1        10000 2        1        5000 

example size of breaks: 2000

new dataset:

chr    start    end 1        1       2000 1        2001    4000 1        4001    6000 1        6001    8000 1        8001    10000 2        1       2000 2        2001    4000 2        4001    5000 

i'm doing in r. know generate these seq, i'd able based on list/df of regions instead of having manually every time have new list of regions.

here's example i've made using seq:

given 22 chromosomes, loop through them , break each pieces

# initialize df regions <- data.frame(chromosome = c(), start = c(), end = c()) # each row, following for(i in 1:nrow(chromosomes)){      # create sequence minimum start max end value      breks <- seq(min(chromosomes$start[chromosomes$chromosome == i]), max(chromosomes$end[chromosomes$chromosome == i]), by=2000000)       # put dataframe      database <- data.frame(chromosome = i, start = breks, end = c(breks[2:length(breks)]-1, max(chromosomes$end[chromosomes$chromosome == i])))       # bind have      regions <- rbind(regions, database)      rm(database) } 

this works fine, i'm wondering if there built package one-liner or more flexible, has limitations.

using r / bioconductor package genomicranges, here initial ranges

library(genomicranges) rngs = granges(1:2, iranges(1, c(10000, 5000))) 

and create sliding window across genome, generated first list (one set of tiles per chromosome) , unlisted format have in question

> windows = slidingwindows(rngs, width=2000, step=2000) > unlist(windows) granges object 8 ranges , 0 metadata columns:       seqnames        ranges strand          <rle>     <iranges>  <rle>   [1]        1 [   1,  2000]      *   [2]        1 [2001,  4000]      *   [3]        1 [4001,  6000]      *   [4]        1 [6001,  8000]      *   [5]        1 [8001, 10000]      *   [6]        2 [   1,  2000]      *   [7]        2 [2001,  4000]      *   [8]        2 [4001,  5000]      *    -------   seqinfo: 2 sequences unspecified genome; no seqlengths 

coerce / data.frame as(df, "granges") or as(unlist(tiles), "data.frame").

find @ ?"slidingwindows,genomicranges-method" (tab completion friend, ?"slidingw<tab>).

embarrassingly, seems implemented in 'devel' version of genomicranges (v. 1.25.93?); tile similar rounds width of ranges approximately equal while spanning width of granges. here poor-man's version

windows <- function(gr, width, withmcols=false) {     starts <- map(seq, start(rngs), end(rngs), by=width)     ends <- map(function(starts, len) c(tail(starts, -1) - 1l, len),                 starts, end(gr))     seq <- rep(seqnames(gr), lengths(starts))     strand <- rep(strand(gr), lengths(starts))     result <- granges(seq, iranges(unlist(starts), unlist(ends)), strand)     seqinfo(result) <- seqinfo(gr)     if (withmcols) {         idx <- rep(seq_len(nrow(gr)), lengths(starts))         mcols(result) = mcols(gr)[idx,,drop=false]     }     result } 

invoked as

> windows(rngs, 2000) 

if approach useful, consider asking follow-up questions on bioconductor support site.


Comments

Popular posts from this blog

javascript - Thinglink image not visible until browser resize -

firebird - Error "invalid transaction handle (expecting explicit transaction start)" executing script from Delphi -

Sound is not coming out while implementing Text-to-speech in Android activity -