yalies / api Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 23.0 14.38 MB

👥 The best directory of Yale personnel, with a clean API to match. Used by 70% of undergrads!

Home Page: https://yalies.io

HTML 11.62% Python 71.80% Shell 0.46% CSS 3.63% JavaScript 12.16% Mako 0.24% Procfile 0.02% Dockerfile 0.08%

api scraping yale

api's People

Contributors

Stargazers

Watchers

api's Issues

Allow copying email list with button

Allow passing include parameter listing fields to include in response

Consider scraping social media to find people's IG/Twitter profiles

Capitalized 'True' used in JSON on API docs page

Don't send query and filters props unless they're occupied

Some emeritus professors have longer, alphabetical-only netids

They usually get caught coincidentally right now, but theoretically if they started with a common prefix they could get missed.

Fix 'Law School' and 'School of Law' being separate schools/organizations

Manage API tokens and keep track of who is using them for what applications

In filters endpoint, support listing filters already applied and return results based on what other options would be supported

So, for example, you could pass {'school_code': ['YC']} and it would first filter down to just the options found on the resulting rows.

This will allow us to move towards building the filters call in JS from the front end rather than through jinja.

Use people endpoint from front end

Jesus, we should at least follow our own advice

Stop scraper if there aren't any students found to prevent a bad page load from emptying database

List properties that should be included in JSON serialization of object

Rather than excluding certain properties. This way we can make it ordered also.

Add 'repeat search without filters' button

Add information about YCS partnership on about and splash pages

Since this isn't solely my project anymore, it would be nice to update the About page to explain that this is a YCS project. Maybe add a logo too. And then consider adding similar content to the pre-login homepage (splash.html) as well.

Allow expanding room numbers for more explanation

Fix inconsistency of g.me and g.user

Tag Eli Whitney students

Require API users to apply for access to certain fields

Some fields like address, residence, etc. are somewhat private. It could be nice to support a review system where people can be approved for access to those fields, but don't get them by default.

Give clearer error when a user has been banned, isn't eligible to view website, etc.

Currently we just abort their request with a vague error message. We should give an explanation of why they can't access the information, so it doesn't just look like the website broke.

Some people have school but not school_code, or organization but not organization_id

Support page_size option API requests

To allow fetching a page of size other than the currently hardcoded 20

Use Directory email if not included in facebook

Come up with a new way to tell if people are on leave

Currently we just check if the graduation year of each student has increased since the saved copy we have from last year. Once this semester ends, we'll no longer have a reliable way to tell whether people are still on leave or if they only took one semester off. We'll either need to find another way to get leave data, or change the labeling to signify that this student HAS taken a leave but may not necessarily still be on it.

Block non-undergrads from viewing website

Automate CAS login in scraper to remove need for manually providing cookies

Automatically put current user's profile first

Oftentimes people just want to see what their own profile does, so why not sort it to the top automatically? Or at least provide the option?

Scrape more faculty information from department websites

Lots of academic departments at Yale have People pages that list (in apparently a somewhat consistent format) all the people (grad students, faculty, staff, etc.) in the department.

These websites have lots of extra information, such as:

Suffixes (M.D., Ph.D, etc.)
Links to personal or lab webpages
Full professorship titles (for example "Sterling Prof of Sociology, Director, Urban Ethnography Project; Prof African American Studies")
Pictures

Examples:
https://ling.yale.edu/people
https://cpsc.yale.edu/people
https://afamstudies.yale.edu/people
https://math.yale.edu/people
https://mcdb.yale.edu/people
https://medicine.yale.edu/anesthesiology/people/

Many more... full list here: https://www.yale.edu/academics/departments-programs

Executing a search twice in sequence may result in duplicate students showing up

Refactor scraper into multiple files

It's huge, and once we implement #44, it's only gonna get huger. We should split different components into multiple files, some ideas for divisions:

face_book
directory
department_websites
util (for example clean_* functions)

Support filtering/searching by birthdays

Add admin and banned columns to users table

Currently, when checking that a user is permitted to do certain privileged operations (i.e. running the scraper), we just check if the user's CAS NetID is equal to my NetID (ekb33). We should add a boolean admin column to the users table that would allow users to be set as administrators, and then check if the current user is an admin when attempting to perform privileged operations, rather than checking against my hardcoded NetID. If you really want to be fancy, you could try to figure out how to add a decorator for this (like @admin_required, comparably to how flask-cas and flask-login implement @login_required).

For banned, it would be good to be able to ban individual users who we don't want using the site. Just in case.

Re-show residence filters once room numbers return to Face Book

The Face Book has removed all room numbers this semester. This may be because of us. It also may be because of the irregularity of COVID. For this reason, I hid the filters that use room numbers (building code, entryway, floor, etc.) in app/templates/index.html. If/when room numbers are put back, we should show these filters again.

Fix database lockup

Write more complete API documentation

I think it would be really cool to have a Swagger docs system like this, where you can test the API in-browser and see what the responses are like. At minimum we should document the filters endpoint and add a list of fields Person has.

Disable clear filters button if no filters are selected

Currently you can just click it repeatedly and it'll just refresh over and over. Seems like a clumsy behavior, would be better if it just did nothing.

Make sure we can properly scrape people with the same name

/auth generated tokens will be rejected because they aren't added as keys

Clean up major names

If no people were found on face book page and we abort scraping, delete the saved page file

This failure is caused when authentication has failed. Currently, if we change the passed token so that it's valid, and then immediately rerun the scraper, it'll use the existing page.html file from when the request failed, and the problem won't be fixed until we restart the heroku dyno (which resets the ephemeral filesystem).

Add code to the failing case to delete the page.html file.

When 'Other' is selected, it sometimes causes SQL error

ElasticSearch can return different results each search, causing different ordering for each page of results

Pretty much the title. If you do a broad search that returns many results ("Hopper", for instance), you may notice that some people are duplicated across pages, or possibly omitted. This only is an issue for very large searches (which few people do, apparently preferring to use filters), but it's a very obvious problem once you notice it. One solution to this might be to use ES's scrolling tracking features.

Raise error when invalid filter passed

Separate web interface into a different repository

Persist query in URL parameters

One kind of nice (although very bugged) thing that the Yale Face Book does is that when you run a search, the search information is stored in the URL. That way, if you want to send someone the results of your search, you can just copy and paste the URL, which would be something like:

yalies.io/?query=Some+name&filters=...

Split office column into office_building and office_room

Make scraper asynchronous

If we multithread the process, we could probably finish a lot faster.

Create automatic tool to extract keys for scraper, such as a Chrome extension

Use fetch API for token request

Support sorting in request to API

Use more secure filenames for S3 images

Right now, someone could theoretically iterate through every number from 1-100,000 and get all the user images. Rather than using Yale's naming scheme for the files, we should generate a securely random name for the image file based on some properties of the user that aren't likely to change. For example, we can append UPI, image ID, netid, etc. together and then hash that somehow and name the file thus.

yalies / api Goto Github PK

api's People

Contributors

Stargazers

Watchers

Forkers

api's Issues

Recommend Projects

Recommend Topics

Recommend Org