Lazy Loading a Bioinformatic SAM record Announcing the arrival of Valued Associate #679: Cesar...
My admission is revoked after accepting the admission offer
Coin Game with infinite paradox
What does the black goddess statue do and what is it?
A journey... into the MIND
What *exactly* is electrical current, voltage, and resistance?
Why is arima in R one time step off?
Why did Israel vote against lifting the American embargo on Cuba?
Are these square matrices always diagonalisable?
Does Prince Arnaud cause someone holding the Princess to lose?
Why is water being consumed when my shutoff valve is closed?
Will I be more secure with my own router behind my ISP's router?
Where/What are Arya's scars from?
What is /etc/mtab in Linux?
What were wait-states, and why was it only an issue for PCs?
What do you call an IPA symbol that lacks a name (e.g. ɲ)?
When speaking, how do you change your mind mid-sentence?
Is there a verb for listening stealthily?
Marquee sign letters
Why I cannot instantiate a class whose constructor is private in a friend class?
What's the difference between using dependency injection with a container and using a service locator?
Where to find documentation for `whois` command options?
Was there ever a LEGO store in Miami International Airport?
What is the purpose of the side handle on a hand ("eggbeater") drill?
Getting AggregateResult variables from Execute Anonymous Window
Lazy Loading a Bioinformatic SAM record
Announcing the arrival of Valued Associate #679: Cesar Manara
Unicorn Meta Zoo #1: Why another podcast?Lazy class instantiation in PythonLazy loading with __getMax heap in JavaQuerying Facebook for details of a user's OAuth tokenUnderstanding lazy sequence in ClojureBinary Puzzle Solver - 10000 questionsSimple Java program - Coding bat sumNumbersLeetcode: String to Integer (atoi)Lazy split and semi-lazy splitLazy-loading iframes as they scroll into view
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}
$begingroup$
I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:
SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1
Each piece of information separated by a tab is it's own field and corresponds to some type of data.
Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.
Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.
Here's my attempt:
/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record {
private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;
private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";
/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read) {
this.read = read;
}
public String getRead() {
return read;
}
/**
* {@inheritDoc}
*/
@Override
public String getId() {
if(id == null){
id = XsamReadQueries.findID(read);
}
return id;
}
/**
* {@inheritDoc}
*/
@Override
public int getFlag() throws NumberFormatException {
if(flag == -1) {
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
}
return flag;
}
/**
* {@inheritDoc}
*/
@Override
public String getReferenceName() {
if(referenceName == null){
referenceName = XsamReadQueries.findReferneceName(read);
}
return referenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getPos() throws NumberFormatException{
if(pos == -1){
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
}
return pos;
}
/**
* {@inheritDoc}
*/
@Override
public int getMappingQuality() throws NumberFormatException {
if(mappingQuality == -1){
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
}
return mappingQuality;
}
/**
* {@inheritDoc}
*/
@Override
public String getCigar() {
if(cigar == null){
cigar = XsamReadQueries.findCigar(read);
}
return cigar;
}
/**
* {@inheritDoc}
*/
@Override
public String getMateReferenceName() {
if(mateReferenceName == null){
mateReferenceName = XsamReadQueries.findElement(read, 6);
}
return mateReferenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getMatePosition() throws NumberFormatException {
if(matePosition == -1){
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
}
return matePosition;
}
/**
* {@inheritDoc}
*/
@Override
public int getTemplateLength() throws NumberFormatException {
if(templateLength == -1){
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
}
return templateLength;
}
/**
* {@inheritDoc}
*/
@Override
public String getSequence() {
if(sequence == null){
sequence = XsamReadQueries.findBaseSequence(read);
}
return sequence;
}
/**
* {@inheritDoc}
*/
@Override
public String getQuality() {
if(quality == null){
quality = XsamReadQueries.findElement(read, 10);
}
return quality;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isRepeat() {
return read.contains(REPEAT_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public boolean isMapped() {
return !read.contains(MATCH_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public String getVariableTerms() {
if(variableTerms == null){
variableTerms = XsamReadQueries.findVariableRegionSequence(read);
}
return variableTerms;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isQualityFailed() {
return read.contains(QUALITY_CHECK_TERM);
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);
}
@Override
public int hashCode() {
return Objects.hash(read);
}
@Override
public String toString() {
return read;
}
}
The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
Below is the XsamReadQuery
class
/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries {
// Suppress instantiation
private XsamReadQueries() {
throw new AssertionError();
}
/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read){
int found = 0;
for(int i = 0; i < read.length(); i++){
if(read.charAt(i) == 't'){
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')){ //guard against double-tabs
return i + 1;
}
}
}
return -1;
}
/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID){
if(!ID.startsWith("SBL")) return "";
try {
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't') {
i++;
}
String library = ID.substring(firstPos, i);
char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122){
library = library.substring(0, library.length()-1);
}
return library;
}catch (Exception e){
int i = 0;
while(ID.charAt(i) != 't'){
i++;
if(i == ID.length()){
break;
}
}
return ID.substring(0, i);
}
}
/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample){
return findElement(sample, 0);
}
/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample){
return findElement(sample, 10);
}
/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample) {
return findElement(sample, 5);
}
/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample) {
return findElement(sample, 9);
}
/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element) {
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0) {
while (sample.charAt(i) != 't') {
i++;
}
return sample.substring(0, i);
} else {
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedTabs == tabsToSkip) {
if (firstTab == 0) {
firstTab = i;
} else {
secondTab = i;
tabsFound = true;
}
} else {
skippedTabs++;
}
i++;
}
}
}
return sample.substring(firstTab + 1, secondTab);
}
/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample){
int start = findVariableRegionStart(sample);
if(start == -1) return "";
return sample.substring(findVariableRegionStart(sample));
}
/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample) {
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1) {
return -1; //return -1 if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample) {
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1) {
return ''; //return NULL if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample) {
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1) {
return Optional.empty(); //return NULL if not found.
}
int i = charStart;
while (sample.charAt(i) != 't') {
i++;
}
return Optional.of(sample.substring(charStart, i));
}
/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample) {
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedFirstTab) {
if (secondTab == 0) {
secondTab = i;
} else {
thirdTab = i;
tabsFound = true;
}
}
skippedFirstTab = true;
i++;
}
}
if(sample.substring(secondTab + 1, thirdTab).contains("/")){
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];
}
return sample.substring(secondTab + 1, thirdTab);
}
/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle) {
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;
outer:
for (int i = 0; i < maxSearch; i++) {
for (int j = 0; j < nLen; j++) {
if (haystack.charAt(i + j) != needle.charAt(j)) {
continue outer;
}
}
// If it reaches here, match has been found:
return i + nLen;
}
return -1; // Not found
}
}
My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?
Thanks in advance,
Sam
java bioinformatics lazy
$endgroup$
add a comment |
$begingroup$
I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:
SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1
Each piece of information separated by a tab is it's own field and corresponds to some type of data.
Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.
Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.
Here's my attempt:
/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record {
private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;
private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";
/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read) {
this.read = read;
}
public String getRead() {
return read;
}
/**
* {@inheritDoc}
*/
@Override
public String getId() {
if(id == null){
id = XsamReadQueries.findID(read);
}
return id;
}
/**
* {@inheritDoc}
*/
@Override
public int getFlag() throws NumberFormatException {
if(flag == -1) {
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
}
return flag;
}
/**
* {@inheritDoc}
*/
@Override
public String getReferenceName() {
if(referenceName == null){
referenceName = XsamReadQueries.findReferneceName(read);
}
return referenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getPos() throws NumberFormatException{
if(pos == -1){
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
}
return pos;
}
/**
* {@inheritDoc}
*/
@Override
public int getMappingQuality() throws NumberFormatException {
if(mappingQuality == -1){
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
}
return mappingQuality;
}
/**
* {@inheritDoc}
*/
@Override
public String getCigar() {
if(cigar == null){
cigar = XsamReadQueries.findCigar(read);
}
return cigar;
}
/**
* {@inheritDoc}
*/
@Override
public String getMateReferenceName() {
if(mateReferenceName == null){
mateReferenceName = XsamReadQueries.findElement(read, 6);
}
return mateReferenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getMatePosition() throws NumberFormatException {
if(matePosition == -1){
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
}
return matePosition;
}
/**
* {@inheritDoc}
*/
@Override
public int getTemplateLength() throws NumberFormatException {
if(templateLength == -1){
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
}
return templateLength;
}
/**
* {@inheritDoc}
*/
@Override
public String getSequence() {
if(sequence == null){
sequence = XsamReadQueries.findBaseSequence(read);
}
return sequence;
}
/**
* {@inheritDoc}
*/
@Override
public String getQuality() {
if(quality == null){
quality = XsamReadQueries.findElement(read, 10);
}
return quality;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isRepeat() {
return read.contains(REPEAT_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public boolean isMapped() {
return !read.contains(MATCH_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public String getVariableTerms() {
if(variableTerms == null){
variableTerms = XsamReadQueries.findVariableRegionSequence(read);
}
return variableTerms;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isQualityFailed() {
return read.contains(QUALITY_CHECK_TERM);
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);
}
@Override
public int hashCode() {
return Objects.hash(read);
}
@Override
public String toString() {
return read;
}
}
The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
Below is the XsamReadQuery
class
/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries {
// Suppress instantiation
private XsamReadQueries() {
throw new AssertionError();
}
/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read){
int found = 0;
for(int i = 0; i < read.length(); i++){
if(read.charAt(i) == 't'){
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')){ //guard against double-tabs
return i + 1;
}
}
}
return -1;
}
/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID){
if(!ID.startsWith("SBL")) return "";
try {
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't') {
i++;
}
String library = ID.substring(firstPos, i);
char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122){
library = library.substring(0, library.length()-1);
}
return library;
}catch (Exception e){
int i = 0;
while(ID.charAt(i) != 't'){
i++;
if(i == ID.length()){
break;
}
}
return ID.substring(0, i);
}
}
/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample){
return findElement(sample, 0);
}
/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample){
return findElement(sample, 10);
}
/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample) {
return findElement(sample, 5);
}
/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample) {
return findElement(sample, 9);
}
/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element) {
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0) {
while (sample.charAt(i) != 't') {
i++;
}
return sample.substring(0, i);
} else {
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedTabs == tabsToSkip) {
if (firstTab == 0) {
firstTab = i;
} else {
secondTab = i;
tabsFound = true;
}
} else {
skippedTabs++;
}
i++;
}
}
}
return sample.substring(firstTab + 1, secondTab);
}
/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample){
int start = findVariableRegionStart(sample);
if(start == -1) return "";
return sample.substring(findVariableRegionStart(sample));
}
/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample) {
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1) {
return -1; //return -1 if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample) {
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1) {
return ''; //return NULL if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample) {
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1) {
return Optional.empty(); //return NULL if not found.
}
int i = charStart;
while (sample.charAt(i) != 't') {
i++;
}
return Optional.of(sample.substring(charStart, i));
}
/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample) {
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedFirstTab) {
if (secondTab == 0) {
secondTab = i;
} else {
thirdTab = i;
tabsFound = true;
}
}
skippedFirstTab = true;
i++;
}
}
if(sample.substring(secondTab + 1, thirdTab).contains("/")){
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];
}
return sample.substring(secondTab + 1, thirdTab);
}
/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle) {
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;
outer:
for (int i = 0; i < maxSearch; i++) {
for (int j = 0; j < nLen; j++) {
if (haystack.charAt(i + j) != needle.charAt(j)) {
continue outer;
}
}
// If it reaches here, match has been found:
return i + nLen;
}
return -1; // Not found
}
}
My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?
Thanks in advance,
Sam
java bioinformatics lazy
$endgroup$
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
13 hours ago
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
13 hours ago
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
13 hours ago
$begingroup$
Also, are you able/willing to share the code that instantiatesSamRecord
s?
$endgroup$
– Eric Stein
12 hours ago
add a comment |
$begingroup$
I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:
SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1
Each piece of information separated by a tab is it's own field and corresponds to some type of data.
Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.
Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.
Here's my attempt:
/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record {
private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;
private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";
/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read) {
this.read = read;
}
public String getRead() {
return read;
}
/**
* {@inheritDoc}
*/
@Override
public String getId() {
if(id == null){
id = XsamReadQueries.findID(read);
}
return id;
}
/**
* {@inheritDoc}
*/
@Override
public int getFlag() throws NumberFormatException {
if(flag == -1) {
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
}
return flag;
}
/**
* {@inheritDoc}
*/
@Override
public String getReferenceName() {
if(referenceName == null){
referenceName = XsamReadQueries.findReferneceName(read);
}
return referenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getPos() throws NumberFormatException{
if(pos == -1){
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
}
return pos;
}
/**
* {@inheritDoc}
*/
@Override
public int getMappingQuality() throws NumberFormatException {
if(mappingQuality == -1){
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
}
return mappingQuality;
}
/**
* {@inheritDoc}
*/
@Override
public String getCigar() {
if(cigar == null){
cigar = XsamReadQueries.findCigar(read);
}
return cigar;
}
/**
* {@inheritDoc}
*/
@Override
public String getMateReferenceName() {
if(mateReferenceName == null){
mateReferenceName = XsamReadQueries.findElement(read, 6);
}
return mateReferenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getMatePosition() throws NumberFormatException {
if(matePosition == -1){
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
}
return matePosition;
}
/**
* {@inheritDoc}
*/
@Override
public int getTemplateLength() throws NumberFormatException {
if(templateLength == -1){
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
}
return templateLength;
}
/**
* {@inheritDoc}
*/
@Override
public String getSequence() {
if(sequence == null){
sequence = XsamReadQueries.findBaseSequence(read);
}
return sequence;
}
/**
* {@inheritDoc}
*/
@Override
public String getQuality() {
if(quality == null){
quality = XsamReadQueries.findElement(read, 10);
}
return quality;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isRepeat() {
return read.contains(REPEAT_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public boolean isMapped() {
return !read.contains(MATCH_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public String getVariableTerms() {
if(variableTerms == null){
variableTerms = XsamReadQueries.findVariableRegionSequence(read);
}
return variableTerms;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isQualityFailed() {
return read.contains(QUALITY_CHECK_TERM);
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);
}
@Override
public int hashCode() {
return Objects.hash(read);
}
@Override
public String toString() {
return read;
}
}
The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
Below is the XsamReadQuery
class
/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries {
// Suppress instantiation
private XsamReadQueries() {
throw new AssertionError();
}
/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read){
int found = 0;
for(int i = 0; i < read.length(); i++){
if(read.charAt(i) == 't'){
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')){ //guard against double-tabs
return i + 1;
}
}
}
return -1;
}
/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID){
if(!ID.startsWith("SBL")) return "";
try {
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't') {
i++;
}
String library = ID.substring(firstPos, i);
char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122){
library = library.substring(0, library.length()-1);
}
return library;
}catch (Exception e){
int i = 0;
while(ID.charAt(i) != 't'){
i++;
if(i == ID.length()){
break;
}
}
return ID.substring(0, i);
}
}
/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample){
return findElement(sample, 0);
}
/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample){
return findElement(sample, 10);
}
/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample) {
return findElement(sample, 5);
}
/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample) {
return findElement(sample, 9);
}
/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element) {
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0) {
while (sample.charAt(i) != 't') {
i++;
}
return sample.substring(0, i);
} else {
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedTabs == tabsToSkip) {
if (firstTab == 0) {
firstTab = i;
} else {
secondTab = i;
tabsFound = true;
}
} else {
skippedTabs++;
}
i++;
}
}
}
return sample.substring(firstTab + 1, secondTab);
}
/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample){
int start = findVariableRegionStart(sample);
if(start == -1) return "";
return sample.substring(findVariableRegionStart(sample));
}
/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample) {
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1) {
return -1; //return -1 if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample) {
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1) {
return ''; //return NULL if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample) {
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1) {
return Optional.empty(); //return NULL if not found.
}
int i = charStart;
while (sample.charAt(i) != 't') {
i++;
}
return Optional.of(sample.substring(charStart, i));
}
/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample) {
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedFirstTab) {
if (secondTab == 0) {
secondTab = i;
} else {
thirdTab = i;
tabsFound = true;
}
}
skippedFirstTab = true;
i++;
}
}
if(sample.substring(secondTab + 1, thirdTab).contains("/")){
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];
}
return sample.substring(secondTab + 1, thirdTab);
}
/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle) {
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;
outer:
for (int i = 0; i < maxSearch; i++) {
for (int j = 0; j < nLen; j++) {
if (haystack.charAt(i + j) != needle.charAt(j)) {
continue outer;
}
}
// If it reaches here, match has been found:
return i + nLen;
}
return -1; // Not found
}
}
My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?
Thanks in advance,
Sam
java bioinformatics lazy
$endgroup$
I'm currently writing an API to work with Bioinformatic SAM records. Here's an example of one:
SBL_XSBF463_ID:3230017:BCR1:GCATAA:BCR2:CATATA/1:vpe 97 hs07 38253395 3 30M = 38330420 77055 TTGTTCCACTGCCAAAGAGTTTCTTATAAT EEEEEEEEEEEEAEEEEEEEEEEEEEEEEE PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:30 ZS:Z:R NH:i:2 HI:i:1 IH:i:1
Each piece of information separated by a tab is it's own field and corresponds to some type of data.
Now, it's important to note that these files get BIG (10's of GB) and so splitting each one as soon as it's instantiated in some kind of POJO would be inefficient.
Hence, I've decided to create an object with a lazy loading mechanism. Only the original string is stored until one of the fields is requested by some calling code. This should minimise the amount of work done when the object is created, as well as minimise the amount of memory taken by the objects.
Here's my attempt:
/** Class for storing and working with sam formatted DNA sequence.
*
* Upon construction, only the String record is stored.
* All querying of fields is done on demand, to save time.
*
*/
public class SamRecord implements Record {
private final String read;
private String id = null;
private int flag = -1;
private String referenceName = null;
private int pos = -1;
private int mappingQuality = -1;
private String cigar = null;
private String mateReferenceName = null;
private int matePosition = -1;
private int templateLength = -1;
private String sequence = null;
private String quality = null;
private String variableTerms = null;
private final static String REPEAT_TERM = "ZS:Z:R";
private final static String MATCH_TERM = "ZS:Z:NM";
private final static String QUALITY_CHECK_TERM = "ZS:Z:QC";
/** Simple constructor for the sam record
* @param read full read
*/
public SamRecord(String read) {
this.read = read;
}
public String getRead() {
return read;
}
/**
* {@inheritDoc}
*/
@Override
public String getId() {
if(id == null){
id = XsamReadQueries.findID(read);
}
return id;
}
/**
* {@inheritDoc}
*/
@Override
public int getFlag() throws NumberFormatException {
if(flag == -1) {
flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
}
return flag;
}
/**
* {@inheritDoc}
*/
@Override
public String getReferenceName() {
if(referenceName == null){
referenceName = XsamReadQueries.findReferneceName(read);
}
return referenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getPos() throws NumberFormatException{
if(pos == -1){
pos = Integer.parseInt(XsamReadQueries.findElement(read, 3));
}
return pos;
}
/**
* {@inheritDoc}
*/
@Override
public int getMappingQuality() throws NumberFormatException {
if(mappingQuality == -1){
mappingQuality = Integer.parseInt(XsamReadQueries.findElement(read, 4));
}
return mappingQuality;
}
/**
* {@inheritDoc}
*/
@Override
public String getCigar() {
if(cigar == null){
cigar = XsamReadQueries.findCigar(read);
}
return cigar;
}
/**
* {@inheritDoc}
*/
@Override
public String getMateReferenceName() {
if(mateReferenceName == null){
mateReferenceName = XsamReadQueries.findElement(read, 6);
}
return mateReferenceName;
}
/**
* {@inheritDoc}
*/
@Override
public int getMatePosition() throws NumberFormatException {
if(matePosition == -1){
matePosition = Integer.parseInt(XsamReadQueries.findElement(read, 7));
}
return matePosition;
}
/**
* {@inheritDoc}
*/
@Override
public int getTemplateLength() throws NumberFormatException {
if(templateLength == -1){
templateLength = Integer.parseInt(XsamReadQueries.findElement(read, 8));
}
return templateLength;
}
/**
* {@inheritDoc}
*/
@Override
public String getSequence() {
if(sequence == null){
sequence = XsamReadQueries.findBaseSequence(read);
}
return sequence;
}
/**
* {@inheritDoc}
*/
@Override
public String getQuality() {
if(quality == null){
quality = XsamReadQueries.findElement(read, 10);
}
return quality;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isRepeat() {
return read.contains(REPEAT_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public boolean isMapped() {
return !read.contains(MATCH_TERM);
}
/**
* {@inheritDoc}
*/
@Override
public String getVariableTerms() {
if(variableTerms == null){
variableTerms = XsamReadQueries.findVariableRegionSequence(read);
}
return variableTerms;
}
/**
* {@inheritDoc}
*/
@Override
public boolean isQualityFailed() {
return read.contains(QUALITY_CHECK_TERM);
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
SamRecord samRecord = (SamRecord) o;
return Objects.equals(read, samRecord.read);
}
@Override
public int hashCode() {
return Objects.hash(read);
}
@Override
public String toString() {
return read;
}
}
The fields are returned by static methods in a helper class which retrieve them by looking at where the tab characters are. i.e. flag = Integer.parseInt(XsamReadQueries.findElement(read, 1));
Below is the XsamReadQuery
class
/**
* Non-instantiable utility class for working with Xsam reads
*/
public final class XsamReadQueries {
// Suppress instantiation
private XsamReadQueries() {
throw new AssertionError();
}
/** finds the position of the tab directly before the start of the variable region
* @param read whole sam or Xsam read to search
* @return position of the tab in the String
*/
public static int findVariableRegionStart(String read){
int found = 0;
for(int i = 0; i < read.length(); i++){
if(read.charAt(i) == 't'){
found++;
if(found >= 11 && i+1 < read.length() && (read.charAt(i+1) != 'x' && read.charAt(i+1) != 't')){ //guard against double-tabs
return i + 1;
}
}
}
return -1;
}
/** Attempts to find the library name from SBL reads
* where SBL reads have the id SBL_LibraryName_ID:XXXXX
* if LibraryName end's with a lower case letter, the letter will be removed.
* if SBL_LibID is not valid, return the full ID.
* @param ID or String to search.
* @return Library name with lower case endings removed
*/
public static String findLibraryName(String ID){
if(!ID.startsWith("SBL")) return "";
try {
int firstPos = XsamReadQueries.findPosAfter(ID, "_");
int i = firstPos;
while (ID.charAt(i) != '_' && ID.charAt(i) != 't') {
i++;
}
String library = ID.substring(firstPos, i);
char lastChar = library.charAt(library.length()-1);
if(lastChar >= 97 && lastChar <= 122){
library = library.substring(0, library.length()-1);
}
return library;
}catch (Exception e){
int i = 0;
while(ID.charAt(i) != 't'){
i++;
if(i == ID.length()){
break;
}
}
return ID.substring(0, i);
}
}
/** Returns the ID from the sample
* @param sample Xsam read
* @return ID
*/
public static String findID(String sample){
return findElement(sample, 0);
}
/** Returns the phred score from the sample
* @param sample Xsam read
* @return phred string
*/
public static String findPhred(String sample){
return findElement(sample, 10);
}
/**
* Returns the cigar from the xsam read
*
* @param sample read
* @return cigar string
*/
public static String findCigar(String sample) {
return findElement(sample, 5);
}
/**
* Returns the bases from the xsam read
*
* @param sample read
* @return base string
*/
public static String findBaseSequence(String sample) {
return findElement(sample, 9);
}
/**
* finds the n'th element in the tab delimited sample
* i.e findElement(0) returns one from "onettwo"
* 0 indexed.
*
* @param sample String to search
* @param element element to find
* @return found element or "" if not found
*/
public static String findElement(String sample, int element) {
boolean tabsFound = false;
int i = 0;
int firstTab = 0;
int secondTab = 0;
int tabsToSkip = element - 1 >= 0 ? element - 1 : 0;
int skippedTabs = 0;
if (element == 0) {
while (sample.charAt(i) != 't') {
i++;
}
return sample.substring(0, i);
} else {
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedTabs == tabsToSkip) {
if (firstTab == 0) {
firstTab = i;
} else {
secondTab = i;
tabsFound = true;
}
} else {
skippedTabs++;
}
i++;
}
}
}
return sample.substring(firstTab + 1, secondTab);
}
/** finds the variable region past the quality
* @param sample sam or Xsam record string
* @return variable sequence or empty string
*/
public static String findVariableRegionSequence(String sample){
int start = findVariableRegionStart(sample);
if(start == -1) return "";
return sample.substring(findVariableRegionStart(sample));
}
/** finds the xL field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxLField(String sample) {
int chartStart = findPosAfter(sample, "txL:i:");
if (chartStart == -1) {
return -1; //return -1 if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xR field
* @param sample String to search
* @return position if found, '' (null) value if not.
*/
public static int findxRField(String sample) {
int chartStart = findPosAfter(sample, "txR:i:");
if (chartStart == -1) {
return ''; //return NULL if not found.
}
int i = chartStart;
while (sample.charAt(i) != 't') {
i++;
}
return Integer.parseInt(sample.substring(chartStart, i));
}
/** finds the xLSeq field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static Optional<String> findxLSeqField(String sample) {
int charStart = findPosAfter(sample, "txLseq:i:");
if (charStart == -1) {
return Optional.empty(); //return NULL if not found.
}
int i = charStart;
while (sample.charAt(i) != 't') {
i++;
}
return Optional.of(sample.substring(charStart, i));
}
/** finds the reference name field
* @param sample String to search
* @return String if found, empty string if not.
*/
public static String findReferneceName(String sample) {
//should always appear between the second and third tabs
boolean tabsFound = false;
int i = 0;
int secondTab = 0;
int thirdTab = 0;
boolean skippedFirstTab = false;
while (!tabsFound) {
if (sample.charAt(i) != 't') {
i++;
} else {
if (skippedFirstTab) {
if (secondTab == 0) {
secondTab = i;
} else {
thirdTab = i;
tabsFound = true;
}
}
skippedFirstTab = true;
i++;
}
}
if(sample.substring(secondTab + 1, thirdTab).contains("/")){
String[] split = sample.substring(secondTab + 1, thirdTab).split("/");
return split[split.length-1];
}
return sample.substring(secondTab + 1, thirdTab);
}
/**
* Finds the needle in the haystack, and returns the position of the single next digit.
*
* @param haystack The string to search
* @param needle String field to search on.
* @return position of the end of the needle
*/
private static int findPosAfter(String haystack, String needle) {
int hLen = haystack.length();
int nLen = needle.length();
int maxSearch = hLen - nLen;
outer:
for (int i = 0; i < maxSearch; i++) {
for (int j = 0; j < nLen; j++) {
if (haystack.charAt(i + j) != needle.charAt(j)) {
continue outer;
}
}
// If it reaches here, match has been found:
return i + nLen;
}
return -1; // Not found
}
}
My question is, are there are any drawbacks to this approach? Or any alternative way that might be more effective?
Thanks in advance,
Sam
java bioinformatics lazy
java bioinformatics lazy
edited 14 hours ago
Sam
asked 14 hours ago
SamSam
21017
21017
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
13 hours ago
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
13 hours ago
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
13 hours ago
$begingroup$
Also, are you able/willing to share the code that instantiatesSamRecord
s?
$endgroup$
– Eric Stein
12 hours ago
add a comment |
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
13 hours ago
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
13 hours ago
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
13 hours ago
$begingroup$
Also, are you able/willing to share the code that instantiatesSamRecord
s?
$endgroup$
– Eric Stein
12 hours ago
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
13 hours ago
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
13 hours ago
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
13 hours ago
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
13 hours ago
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
13 hours ago
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
13 hours ago
$begingroup$
Also, are you able/willing to share the code that instantiates
SamRecord
s?$endgroup$
– Eric Stein
12 hours ago
$begingroup$
Also, are you able/willing to share the code that instantiates
SamRecord
s?$endgroup$
– Eric Stein
12 hours ago
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement
, which goes through the SAM record every time.
By loading a record, you are pretty certain that you will at least access it once.
At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.
Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :
XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)
The calls to the second and third method would be much faster than they are now.
To do this, you could add a method to XsamReadQueries
names something like IndexTabs
, that would return an array of ints.
If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.
Code style
There are one of two things that are bothering me in your code with regards to clarity and future maintenance.
You have methods named findPhred
, which call findElement
, but in your SamRecord
sometimes you call findElement
and something a specific find*
, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries
or keep only the findElement
method.
Finally, you could consider using an enum
for the element
parameter of the findElement
method.
$endgroup$
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
11 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217955%2flazy-loading-a-bioinformatic-sam-record%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement
, which goes through the SAM record every time.
By loading a record, you are pretty certain that you will at least access it once.
At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.
Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :
XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)
The calls to the second and third method would be much faster than they are now.
To do this, you could add a method to XsamReadQueries
names something like IndexTabs
, that would return an array of ints.
If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.
Code style
There are one of two things that are bothering me in your code with regards to clarity and future maintenance.
You have methods named findPhred
, which call findElement
, but in your SamRecord
sometimes you call findElement
and something a specific find*
, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries
or keep only the findElement
method.
Finally, you could consider using an enum
for the element
parameter of the findElement
method.
$endgroup$
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
11 hours ago
add a comment |
$begingroup$
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement
, which goes through the SAM record every time.
By loading a record, you are pretty certain that you will at least access it once.
At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.
Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :
XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)
The calls to the second and third method would be much faster than they are now.
To do this, you could add a method to XsamReadQueries
names something like IndexTabs
, that would return an array of ints.
If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.
Code style
There are one of two things that are bothering me in your code with regards to clarity and future maintenance.
You have methods named findPhred
, which call findElement
, but in your SamRecord
sometimes you call findElement
and something a specific find*
, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries
or keep only the findElement
method.
Finally, you could consider using an enum
for the element
parameter of the findElement
method.
$endgroup$
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
11 hours ago
add a comment |
$begingroup$
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement
, which goes through the SAM record every time.
By loading a record, you are pretty certain that you will at least access it once.
At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.
Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :
XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)
The calls to the second and third method would be much faster than they are now.
To do this, you could add a method to XsamReadQueries
names something like IndexTabs
, that would return an array of ints.
If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.
Code style
There are one of two things that are bothering me in your code with regards to clarity and future maintenance.
You have methods named findPhred
, which call findElement
, but in your SamRecord
sometimes you call findElement
and something a specific find*
, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries
or keep only the findElement
method.
Finally, you could consider using an enum
for the element
parameter of the findElement
method.
$endgroup$
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement
, which goes through the SAM record every time.
By loading a record, you are pretty certain that you will at least access it once.
At some point, maybe when creating the class, or when accessing the first property for the first time, you should "index" your SAM record.
Go through the whole file once and keep an array of where the tabs are. This way, if your code ends up calling :
XsamReadQueries.findElement(read, 1)
XsamReadQueries.findElement(read, 2)
XsamReadQueries.findElement(read, 3)
The calls to the second and third method would be much faster than they are now.
To do this, you could add a method to XsamReadQueries
names something like IndexTabs
, that would return an array of ints.
If you want more insight as to how to do this, you can write a comment and I'll add more information, but I'm pretty sure this would help you.
Code style
There are one of two things that are bothering me in your code with regards to clarity and future maintenance.
You have methods named findPhred
, which call findElement
, but in your SamRecord
sometimes you call findElement
and something a specific find*
, which is basically the same code. You should decide on one way to do things, either have specific methods for each field in the XsamReadQueries
or keep only the findElement
method.
Finally, you could consider using an enum
for the element
parameter of the findElement
method.
edited 13 hours ago
answered 13 hours ago
IEatBagelsIEatBagels
9,04323579
9,04323579
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
11 hours ago
add a comment |
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
11 hours ago
1
1
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
11 hours ago
$begingroup$
Hi @IEatBagels, thanks for the answer. Indexing is a really good idea, I'll definitely look to implement that. I agree with the find* notation - It's partially as I'm halfway through coding the API and wanted some feedback before committing to one way or the other. The enum's a good idea too, it'll definitely make it more readable!
$endgroup$
– Sam
11 hours ago
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217955%2flazy-loading-a-bioinformatic-sam-record%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Is this class, or could this class, be used in a multithreaded scenario?
$endgroup$
– IEatBagels
13 hours ago
$begingroup$
It will be, yes. However, it's unlikely to be shared between threads. It's more likely that collections of these records will be handed off to separate workers.
$endgroup$
– Sam
13 hours ago
$begingroup$
0. Do you have an actual, tested performance issue, or are you guessing at a potential problem? 1. What are you trying to minimize, execution time or memory footprint?
$endgroup$
– Eric Stein
13 hours ago
$begingroup$
Also, are you able/willing to share the code that instantiates
SamRecord
s?$endgroup$
– Eric Stein
12 hours ago