ekzhang / classes.wtf Goto Github PK
View Code? Open in Web Editor NEWA course catalog with extremely fast full-text search
Home Page: https://classes.wtf
License: MIT License
A course catalog with extremely fast full-text search
Home Page: https://classes.wtf
License: MIT License
In order to get all courses offered in an academic year, we query the fall and spring term. However, we currently encode the term specification using the incorrect key, Term
, instead of STRM
.
Our mistake is using the "localized" name rather than the "unlocalized" one. For example, each course has a Course Level
that indicates for which group of students a course is intended, and this localized name is displayed in my.harvard's search input for readability. However, the key actually sent in the HTTP request is the unlocalized name, CRSE_ATTR_VALUE_HU_LEVL_ATTR
.
my.harvard's frontend supports either version of the key, transparently translating from localized to unlocalized. We do not do this, and the REST API only supports unlocalized keys; unrecognized search components are silently ignored. For example, the query, anime (Course Level:"PRIMUGRD") (STRM:"2238" | Term:"2242")
, is the same as just anime (STRM:"2238")
.
Queries of the form, ("A":"1" | "B":"2") ("C":"3" | "D":"4")
, do not yield the same results as the union of the queries, ("A":"1") ("C":"3" | "D":"4")
and ("B":"2") ("C":"3" | "D":"4")
. I'm hoping this issue can lay out systematically how they differ.
We send concurrent requests to my.harvard and aggregate the results, removing duplicates. (The key to uniquely identify courses is aptly named, Key
.)
However, I don't believe we should be seeing duplicates; when I manually search for allegedly duplicate courses, only one result shows up. Also, each time the course download script is run, a different number of duplicates are removed.
For example, FRSEMR 60r which meets from 12:00-2:45pm is displayed as 24:00-14:45.
Courses which meet off the hour also have this issue. For example, PORTUG 220.
I am unsure if courses which end in the noon hour also have the end time affected.
Some course entries on my.harvard are incorrect such as ECON 2909 which meets from 10:30-11:45am but is listed as 10:30am-11:45pm. As such, it is correctly displayed as 10:30-23:45.
This bug only applies to courses in AY 2022-2023.
my.harvard appears to use Oracle WebLogic Server, which in turn uses Elasticsearch to service course search queries. Elasticsearch's pagination API allows paging to the 10,000th course at most.1
When we reach this limit, my.harvard returns an error message (albeit with a 200 code) instead of a JSON object.
The title is quite self explanatory - The website spams the browser history, adding a new entry every time the user types a character.
We incorrectly assume that every course has a Course Level
, when in fact some don't. This is confirmed by exhausting all course levels—(Course Level:"UGRDGRAD" | Course Level:"GRADCOURSE" | Course Level:"INTRO" | Course Level:"NOLEVEL" | Course Level:"PRIMGRAD" | Course Level:"PRIMUGRD")
—and then doing a wildcard search within FAS. The difference is about 50 courses.
The fix specific to our use case of excluding graduate-level courses is to query all courses that do not match (Course Level:"GRADCOURSE")
. This can be done efficiently for my.harvard by setting the Exclude300
flag and in GraphQL by using a filter.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.