Summer of Code - Update 1

Posted on June 8, 2016 by Christian Fischer
Tags: gsoc2016

The first milestone in my Summer of Code project was to read and plot R/QTL2 genotype data in BioDalliance. A demo can be found here, and a screenshot is seen below.

BioDalliance with R/QTL genotype track

BioDalliance with R/QTL genotype track

The new track in BioDalliance (BD) consists of two parts: a genetic map, and a genotype file. The genetic map contains a set of names of genetic markers along with their corresponding chromosomes and positions, while the genotype file contains a set of individuals, where each individual has a measured allele at each marker.

The R/QTL2 format combines these files using a so-called control file, which is a YAML or JSON-formatted file containing the filenames of the different types of data in the data set, as well as things such as which alleles are present, and some other data. Right now the track in BD is configured by a JSON-object in the code, rather than reading the control data from a file.

The programming part, then, was first to parse CSV files in JavaScript, followed by creating a so-called feature source that reads from a parsed CSV file and implements the interface expected by the BD browser, creating features that can be displayed in the browser.

The first part was easy, as soon as I’d picked a library to do the parsing: I chose Papa Parse, which looked like it was fast and easy to use. There weren’t many other candidates, Node CSV looked good as well, but Papa has the advantage of being designed with the browser in mind, rather than Node.js.

Then I wrote a module, named Csv, which takes care of all the fetching and parsing of CSV files into JavaScript objects and arrays. The fetch function takes a set of filtering parameters, which currently do not do much, as well as two callbacks, one which is called once per parsed line, with the parsed line as a parameter. The second callback is given the array of all parsed lines as a parameter, and is called once parsing is complete.

The second module, CsvSource, communicates with both the BioDalliance browser as well as with the Csv module. It is given a set of URIs when created - defined by the control object given to the browser - and uses the Csv module to read CSV files from these URIs. Specifically, it reads the marker locations as stored in the “gmap” file, and the individual genotypes from the “geno” file.

As the genotype file is parsed, the CsvSource module creates features to be shown in the browser. The features have various properties, the most important being “min” and “max”, which are the coordinates of the left and right sides of the feature, in units of basepairs. I set “min” to be equal to the marker’s location according to the “gmap” file, and “max” to be a few basepairs to the left of the next feature’s start.

The CsvSource provides one more thing to the browser - a DAS stylesheet. This stylesheet is what the browser uses to construct its visual representations of the features it is given. At the moment, this is hardcoded, with the different alleles being represented by boxes of different colors.

There have been two problems: JavaScript and the BioDalliance codebase. JavaScript is a dynamically typed language, and I wasted a fair few hours chasing bugs caused by typos in a function parameter - twice an error in the place where I called Papa Parse’s parse function led to the browser reporting an error at some seemingly random place in the middle of Papa’s code.

The codebase is well-written, but quite complex, and lacking in documentation, with only a few comments, making it difficult to understand how the data flows through the system. It has taken a fair bit of digging and thinking for me to start to understand how it works.

Each day of coding starts with an IRC meeting, where we (me, my mentors, some other students, and other people working on related projects) hold a sort of light version of a Scrum standup, which has been very good, as it forces you to both know, concretely, what you is going to work at during the day, as well as letting you see what the others have been up to, which is also interesting. Another thing that’s been encouraged is the writing of tests, and it’s been good to spend some effort doing that.