Profile cover photo
Profile photo
xBio:D Cyberinfrastructure
88 followers -
Biodiversity information not limited to Hymenoptera
Biodiversity information not limited to Hymenoptera

88 followers
About
xBio:D Cyberinfrastructure's interests
xBio:D Cyberinfrastructure's posts

This is +Joe Cora, Biodiversity Informatics Manager for xBio:D at Ohio State University. I recently took another position and will regretfully be leaving the university this Friday, August 7th. It has been a great pleasure to work in the insect collection and advance systematics research; and I would like to thank Norm for the his mentorship and for giving me an opportunity to contribute to the cause of science.

If you have any questions related to data management or the xBio:D web applications, please direct them to +Norman Johnson  <johnson.2@osu.edu>. I will still be available at my Ohio State email address, <cora.1@osu.edu>, although my new work duties will not make me available to contribute in a very substantive way.

Thanks to all the xBio:D users for your tireless contributions and please continue to use this resource in the future.

+Joe Cora 

Post has attachment
Metrics are fun.

xBio:D users who are contributing occurrence records all want to know the same answers:

How much time will it take to enter these records?
How many records has X entered?
How quickly can another user process Y records?

Asking any of these questions to an administer of many other biodiversity systems will result in some hocus-pocus then a result of suspect origin. With xBio:D, however, we can present an answer that is grounded in real data that does not rely upon cherry picked examples either.

Since the inception of DEA2 <http://xbiod.osu.edu/dea/>, every user action has been recorded to not only keep track of progress within a file being processed but also to evaluate the amount of time required to complete a given task. After over a year of accumulating user actions, the time has come to set these data free – or at least present some meaningful results from them on a page within DEA2.
 
DEA2 users can access the Usage Statistics page by clicking Help->Usage Statistics from the menu. By default, the page will display measurables for each processing act that occurred within the past month and provide an “mpk” average beside it. The minutes needed to process 1000 entries, or mpk, is a value that is useful to project expectations and evaluate efficiency. The mpk is calculated by taking the total amount of time expended when performing the processing act, dividing it by the total number of entries used to calculate the elapsed time, then converting it into the appropriate units. Before calculating the mpk, any outlying actions that may be a characteristic of non-contiguous processing by the user, e.g., bathroom break delay, is omitted so as to present a measure that is reflective of effort.
 
Along with the metrics table, a couple of graphs are displayed to provide a visual aid to interpret these data. One graph, a donut graph, represents a breakdown of the number of entries performed by all of the users or a specific user during the given time interval. The other is a scatter plot graph of the selected processing act showing all of the users. In the scatter plot graph, the x-axis is the total number of entries and the y-axis is the mpk average with a lower average being indicative of a more efficient use of time. By selecting a processing act in the metrics table, the selected act will pop out of the donut graph and the underlying data will change for the scatter plot graph. Likewise, a user, which I used Amy Dolan from Montana State as an example below (sorry, Amy!), may be selected within the user list to update the metrics table, update the donut graph, and highlight the user within the scatter plot graph. The default processing act when switching user results is occurrence records entered.
 
One item of note: as more and more records with verbatim label data are processed within DEA2, I expect to see a downward trend in the amount of time necessary to process collecting event-related acts (Localities, Dates, Collecting Methods, Collectors, Habitats and Field Codes Set), which will be reflected in the mpk average. So far – so good (see Excel graph below). But, will the trend continue…?
 
+Joe Cora 
PhotoPhotoPhotoPhotoPhoto
DEA2 User Metrics
7 Photos - View album

Post has attachment
Last year, I created an updated version of the xBio:D managed data entry web application that is titled the Data Entry Assistant 2.0 or DEA2. DEA2, http://xbiod.osu.edu/dea/, was rebuilt from scratch using the Python web framework Django and a few other Django packages in which I will save you all the details until a bit later. The important point of the upgrade is that it made the entire data entry procedure much faster, more reliable, and scalable. Other collections can now easily process and enter their own specimen records following the established DEA2 data entry protocols.

The DEA2 user procedures have also been updated on the xBio:D wiki, http://xbiod.osu.edu/wiki/Data_Entry_Assistant_%28DEA%29_Help. The new user guide contains information on how to process specimen records throughout every stage of dealing with specimen data. The guide now includes some xBio:D controlled vocabulary terms like life statuses, preparation types and contents, and association types. Enforcing controlled vocabularies ensure that data elements conform to the same semantics so data sets are homogeneous in structure.

Some other changes from the original DEA and the new DEA2 are:
* Recorded actions - each user actions from loading to entering a record is recorded. This allows users to arbitrarily switch between files, process a file from multiple computers, and give detailed metrics of how long each processing stage takes and how efficient a user is relative to his/her peers.
* Background processing - by processing events in the background, the limitations levied by connection timeout issues are no longer applicable. The previous record upload limit of ~500 has been effectively increased to roughly 8,000 specimen records in testing. The only file restriction is only the size of the Excel file, which must be under 30MB.
* "new_comments" field collation - the verbatim label data with the comments field are combined into a single field called "new_comments", which should represent all of the known information about a condition and collection of a specimen. The "new_comments" field was previously required to be created using an Excel formula prior to upload into DEA. This requirement has been removed, since DEA2 now performs the collation within the application. Also, the limit on the number of potential labels has been removed as long as the field header matches the expected format, i.e., "Label" then the numerical position of the label.
* consistency checks - DEA2 now performs many new consistency checks on the incoming data to ensure that can be valid, e.g., date format verification, cuid uniqueness check, etc.
* and many more!

Here are some technical details on DEA2 and background processing with the application.

DEA2, being a Django-based web application, interacts with the local DEA2 database through a data model that maps to all of the contents of the data entry template Excel file. Rather than make individual SQL statements, Django handles database interaction through an object-oriented model-based approach, which is incredibly convenient. Once entered, records from the DEA2 local database can be retrieved, updated, and deleted with ease. However, the Django data management approach does not necessarily scale well to Big Data problems for which I think large biodiversity challenges can be viewed. Because of this, DEA2 communicates with the xBio:D database through a series of APIs to retrieve existing information and enter new specimen records. The xBio:D database, which is an Oracle 11g enterprise database, has the speed and scalability to deal with these data dilemmas, but this is an issue to address in the future...

When dealing with thousands of records and multiple concurrent users, a single, iterative DEA2 request system is not feasible. Celery, which is a Python distributed messaging system, can work well with Django to queue multiple requests or tasks. This allows DEA2 to efficiently process actions from many users working on very large files at the same time while not placing an undue burden on the server. Celery uses RabbitMQ, which supports the Advanced Message Queuing Protocol, to broker messages sent from Django to be queued and processed in the background. After all of the records associated with the request have been completed, DEA2 will receive a message notifying it of completion, then the rest of the processing of the file may continue, and the application as a whole can purr along smoothly.

If you are interested in using DEA2 and contributing occurrence records to xBio:D, please contact me and we can get started right away!

+Joe Cora 
PhotoPhotoPhoto
2014-11-24
3 Photos - View album

Post has attachment
I have established a pretty consistent trend of making new posts every 15 months. Will this trend continue, we shall see... (suspense)

Let us begin with some housekeeping information. The name of the Google+ page has been changed from "Hymenoptera Online (HOL)" to "xBio:D Cyberinfrastructure" to better reflect the nature of the posts that will come. xBio:D is the name for the cyberinfrastructure housed in the Museum of Biological Diversity at the Ohio State University. More than just a set of servers and a database, xBio:D represents a cloud-based services model for biodiversity research in any taxonomic discipline. More on xBio:D and its services to come.

Now onto the meat of this post. One of the limitations of our occurrence records was that xBio:D did not support multiple specimens of varying sex, life stage, preparation, etc. associated with a single unique identifier. This created inconveniences for mite workers with different life stages of a single species on a slide or fish workers where a single catalog number was shared between specimens in a jar of ethanol and dried stored. This issue has now been resolved with the implementation of specimen groups. Using a custom format for separating the individual specimen group components upon data transcription (see "spm_num" description on the xBio:D wiki - http://osuc.osu.edu/osucWiki/Data_Transcription_Procedures#Data_Entry_Template_Information), reflecting specimen groups in an Excel spreadsheet and processing/entering the records with DEA2 is straightforward and assured to conform to existing conventions. Below are some screenshots illustrating the use of specimen groups from data entry, to dissemination, to record management. Also, all specimen group information is available in the Excel and Darwin Core Archive exports from HOL framework portals. Upon the next harvesting, any biodiversity data aggregators using xBio:D data, such as iDigBio, +GBIF, and +VertNet Project, will be enriched with these extra occurrence features.

Another addition to occurrence records within xBio:D that was mentioned earlier is preparations. Preparations seem to be a catch-all for collections and interpreted in a number of different ways. In essence, a preparation represent the manner in which a vouchered occurrence record is stored and possibly some additional information about the composition of its contents. In xBio:D, a specimen or specimen group can have any number of preparations associated with it. For example, a single occurrence record could have five female adults in a vial and on two slides, a male adult on a slide, and 20 unspecified nymphs in a jar of 70% ethanol (5|F|adult[vial;2|slide|]; 1|M|adult[slide]; 20|U|nymph[1|jar|70% ethanol]). As long as the occurrence record is associated with a single collecting event and identified as a single taxonomic entity, the combination of specimen groups and preparations can be limitless. Scary thought!

+Joe Cora 
PhotoPhotoPhoto
2014-09-03
3 Photos - View album

Post has attachment
Photo

Post has attachment
This is a long overdue post.

After a lengthy wait and some administrative travails, the HOL app is now available for the iPhone and iPad. The app connects to the same data-providing API that the HOL web application uses but with some extra features specific to the mobile platform. Using GPS or WiFi-based location finding, you can gather all of the occurrence records that were collected near your current location. Then, in a few taps, you can get the driving directions to an interesting collecting spot. The same applies to distribution maps for taxa. Additionally, you can save PDFs which are in the public domain to the HOL app for offline reading. Expect these new features and much of the existing functionality of the web application in this iPhone/iPad app.

If you have an iPhone or iPad, download the HOL app for free from the Apple App Store (https://itunes.apple.com/us/app/hol/id646020193?mt=8).

I am eager to get feedback on the HOL app, so please leave me comments here or through email (hol-help@osu.edu). Your suggestions will dictate what will be included in the upcoming version release!

+Joe Cora 
PhotoPhotoPhotoPhotoPhoto
HOL iOS
5 Photos - View album

Post has attachment
Photo

Post has attachment
Feature Updates:

While undergrads transcribe specimen labels and process them for database entry, they many times ask if an incomplete date is a year and month or merely a month and a day (e.g. 'III-22'). My suggestion is generally to look up the collector in HOL and evaluate whether there is a consistent pattern for this collector. When the collector is not consistent himself, this leads to an extremely incomplete date like MAR. In order to give another tool in order to evaluate dates as well as provide some very useful collecting event/publication date checking, I added the year of birth and death to some well represented collectors and well known authors in HOL. In John Caldwell's case (http://hol.osu.edu/agent-full.html?id=2298), we had a few specimens that were entered as being collected in 1920 and 1922 when he was respectively 9 and 11. This turned out to be incorrect and was promptly fixed. When students evaluate his dates from now on, they will recognize that Caldwell could not have collected a specimen in 1910 regardless of what a little Arizona law states.

Another feature update is to the pagination of certain large data tables found on a collector's page, collection page, and such. When trying to scroll through many, many pages of information when looking for specimens collected at a particular date, the old page-by-page system was tedious to say the least. Now, upwards to 20 jumpable pages will be displayed at the top and bottom of large data tables. This addition should greatly improve the browsing experience within HOL and reduce the strain on the index finger by rapidly clicking 'next'.

One last minor update is to the way that HOL handles the sorting of incomplete dates. I was always kind of irked by not showing specimens that were collected within a certain year being placed at the end of the dataset or the oldest specimen listed in a collection being the oldest specimen with a complete date. So, I took care of it. If you look at the Cornell University Insect Collection (http://hol.osu.edu/inst-full.html?id=11), you will notice that the oldest specimen is properly listed as from 'SEP-1893' rather than what was previously shown to be '2 August 1894'. Once again, not a huge deal, but it does make everything displayed more in line with reality - which is good.

+Joe Cora 
Photo
Photo
2 Photos - View album

Post has attachment
Feature Update and Modification:

We would like to live in a collections-based universe where each specimen or lot is uniquely identified by a single identifier, but that is not always the case. In order to deal with this conundrum, we have added some structure to our database to support the association of multiple alternate IDs to be linked to a master unique ID. The support for this feature is long overdue but the demand seemed to be too great to ignore any further.

The alternate IDs for unique identifiers can be searched but are only displayed on the search and specimen information pages. Currently, the alternate IDs are omitted from map exportation - for now at least. If you would like to see alternate IDs present in export formats or elsewhere on HOL, let me know.

One final note in the reoccurring drama of dealing with the ArcGIS mapping API. If you noticed the pub-ready maps not working recently, that was because of changes within the ArcGIS API that break previous API functionality. I can appreciate the need for keeping an API lean, but the lack of any sort of deprecation to extinction protocol is frustrating. In this most recent bout, ArcGIS once again changed the path to their map layers. This has happened a few times in the past - no big deal. However, the distribution points were all clustering at the origin of the map. After quite a bit of headache, I realized that a new translation had to be applied to the coordinates to allow them to be properly rendered on a web-based map (geographicToWebMercator). ArcGIS: when a new version of the API is released, don't change the functionality in prior API versions. Please.

+Joe Cora 
Photo
Photo
2 Photos - View album

Post has attachment
Feature Modification:

There has been a large amount of demand for single page access to literature through annotations. The HOL browsing functionality was crippled to only facilitate publications that were either in the public domain or were open access. Now, single page browsing is available even to publications that are presumably under copyright. Those publications, however, have a browse limit of a maximum of two pages after the initial landing page to provide for the reading of a full description. Check out the two browsing styles from within HOL below.

+Joe Cora 
Photo
Photo
2 Photos - View album
Wait while more posts are being loaded