The new track in BioDalliance (BD) consists of two parts: a genetic map, and a genotype file. The genetic map contains a set of names of genetic markers along with their corresponding chromosomes and positions, while the genotype file contains a set of individuals, where each individual has a measured allele at each marker.
The R/QTL2 format combines these files using a so-called control file, which is a YAML or JSON-formatted file containing the filenames of the different types of data in the data set, as well as things such as which alleles are present, and some other data. Right now the track in BD is configured by a JSON-object in the code, rather than reading the control data from a file.
The first part was easy, as soon as I’d picked a library to do the parsing: I chose Papa Parse, which looked like it was fast and easy to use. There weren’t many other candidates, Node CSV looked good as well, but Papa has the advantage of being designed with the browser in mind, rather than Node.js.
The second module, CsvSource, communicates with both the BioDalliance browser as well as with the Csv module. It is given a set of URIs when created - defined by the control object given to the browser - and uses the Csv module to read CSV files from these URIs. Specifically, it reads the marker locations as stored in the “gmap” file, and the individual genotypes from the “geno” file.
As the genotype file is parsed, the CsvSource module creates features to be shown in the browser. The features have various properties, the most important being “min” and “max”, which are the coordinates of the left and right sides of the feature, in units of basepairs. I set “min” to be equal to the marker’s location according to the “gmap” file, and “max” to be a few basepairs to the left of the next feature’s start.
The CsvSource provides one more thing to the browser - a DAS stylesheet. This stylesheet is what the browser uses to construct its visual representations of the features it is given. At the moment, this is hardcoded, with the different alleles being represented by boxes of different colors.
The codebase is well-written, but quite complex, and lacking in documentation, with only a few comments, making it difficult to understand how the data flows through the system. It has taken a fair bit of digging and thinking for me to start to understand how it works.
Each day of coding starts with an IRC meeting, where we (me, my mentors, some other students, and other people working on related projects) hold a sort of light version of a Scrum standup, which has been very good, as it forces you to both know, concretely, what you is going to work at during the day, as well as letting you see what the others have been up to, which is also interesting. Another thing that’s been encouraged is the writing of tests, and it’s been good to spend some effort doing that.