skip to primary navigationskip to content

InterMine mentors at Summer of Code 2018 : Presentations on 16 August

last modified Jul 23, 2018 01:42 PM

Google Summer of Code (aka GSoC) is a program that is designed to give students experience working on open source code. Students are paid to work with an open source organisation over the months of May, June and July.

InterMine, an open source biological data warehouse based here at the Department of Genetics, has participated as a mentor organisation for the last two years, with some great results. Projects completed by students include a mixture of practical applied projects as well as explorative research projects. In 2017, five students from across the globe created an iOS app that searches all InterMines for biological entities of interest (e.g. genes), ran a machine learning project to detect similarities between biological entities, worked on extending an R package to perform biological queries in InterMine (now in bioconductor), a programatically queryable registry of InterMines (and the organisms in them), and an exploratory port of InterMine to the graph database Neo4j.

Two months into GSoC 2018, and InterMine’s students are producing another exciting set of projects - If you’d like to hear more about the projects, students will be presenting them during an InterMine community call on the 16th of August 2018.

Natural language biological queries

Have you ever wished you could just ask a question about your research, the same way you’d speak it, and get a result? Jake Macneal is investigating if it’s possible to make that a reality for InterMine, figuring out how to translate questions such as “Which Drosophila genes are associated with Alzheimer’s?” into a genuine data query with meaningful results. Read a recent blog post of his on the topic.

Super-charging the InterMine python library

The InterMine data warehouse has a Python client library that allows bioinformaticians to search, analyse and export data from any InterMine instance using Python scripts. Our GSoC student Nupur Gunwant has extended this library by adding several new much-requested features. You can now easily tell which InterMine to use by organism, and find out what sort of data is in each mine. You can also manage your searches (if you sign up for an account) so you can save queries across sessions. Finally she has used the Python graphing library matplotlib to enable data visualisations. You can plot your expression data as a heatmap or view enrichment data as a bar chart via the command line.

For tutorials and more information, see here:

Search tens of InterMines at once with the Cross-InterMine Search

Why search a single biological database when you could search dozens at once? Aman Dwivedi has been working on a site that searches all InterMines, covering a range of animals, plants, microbes, and more.

Better searches with the InterMine-Solr GSoC project

InterMine uses PostgreSQL to store data and Apache Lucene library to index it. Arunan Sugunakumar is working on the InterMine-Solr project, aiming to replace the old version of Apache Lucene with the modern search system Solr (which is built on top of Apache Lucene) in order to improve the search functionality provided by the webapp and the REST API.  This will allow InterMines to do more accurate fuzzy searches, e.g. searching both “homologues” and “homologs”.

InterMine Data Browser

A lot of the projects this year are related to searching across data, and this project’s no exception. Given the complex nature of biological data, InterMine’s data models are also somewhat complicated. We’d like to make it easier to know what types of data are present in an InterMine without forcing users to learn complex query languages. Adrian Bazaga has created a visual data browser you can try out today - Can you manage to quickly find a set of genes associated with your favourite disease, or all the proteins annotated with an interesting GO term?

Buzzbang Biological Search

Bioschemas is a broadly supported initiative to embed life sciences markup in the web, in order to make data for research more findable. The Buzzbang project is a framework for components that crawl and present this data, so that machines and humans can search it.  Ankit Lohani has been working hard on creating a new Scrapy based Python web crawler for Buzzbang, which will be more maintainable, reliable and scalable than the existing hand-crafted crawler.  Now he’s working on improving the indexing of collected markup so that we can provide a more effective Google-like human-oriented search interface.

Interested in GSoC yourself now?

Members of the InterMine team would be happy to have a chat in person or via email ( about GSoC if you’d like to learn more. The only real requirement to participate as a mentor is having an open source codebase. Other GSoC mentor organisations of interest might include the Open Bioinformatics Foundation, which serves as an umbrella organisation for multiple bioinformatics-related software projects, and OpenAstronomy.