The API <a href="https://github.com/titipata/pubmed_parser/blob/master/pubmed_parser/p

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Actually this file is probably better with 4,892,265 rows: <code class="notranslat

PMID to PMC API from Medline cannot convert all provided PMID about pubmed_parser HOT 6 OPEN

titipata commented on August 18, 2024

PMID to PMC API from Medline cannot convert all provided PMID

from pubmed_parser.

Comments (6)

chengkun-wu commented on August 18, 2024 1

@nick-hahner Yes! I used the ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz file for my local conversion.

from pubmed_parser.

titipata commented on August 18, 2024

I uploaded PMID-PMC pairs (size of 91 MB, not bad not bad) where we can download as follow:

wget https://s3-us-west-2.amazonaws.com/science-of-science-bucket/nih/pmid_pmc_pair.csv

For given file, you can convert PMID to PMC on your own. From here, we can modify parse_citation_web function to receive just PMC be as below.

def parse_citation_web(pmc):
    """
    Parse citations from given PMC 
    Parameters
    ----------
    pmc: str, PMC of the document e.g. 'PMC1217341'
    Returns
    -------
    dict_out: dict, contains following keys
        pmc: Pubmed Central ID
        n_citations: number of citations for given articles
        pmc_cited: list of PMCs that cite the given PMC
    """

    link = "http://www.ncbi.nlm.nih.gov/pmc/articles/%s/citedby/" % str(pmc)
    page = requests.get(link)
    tree = html.fromstring(page.content)
    n_citations = extract_citations(tree)
    n_pages = int(n_citations/30) + 1

    pmc_cited_all = list() # all PMC cited
    citations = tree.xpath('//div[@class="rprt"]/div[@class="title"]/a/@href')[1::]
    pmc_cited = list(map(extract_pmc, citations))
    pmc_cited_all.extend(pmc_cited)
    if n_pages >= 2:
        for i in range(2, n_pages+1):
            link = "http://www.ncbi.nlm.nih.gov/pmc/articles/%s/citedby/?page=%s" % (pmc, str(i))
            page = requests.get(link)
            tree = html.fromstring(page.content)
            citations = tree.xpath('//div[@class="rprt"]/div[@class="title"]/a/@href')[1::]
            pmc_cited = list(map(extract_pmc, citations))
            pmc_cited_all.extend(pmc_cited)
    pmc_cited_all = [p for p in pmc_cited_all if p is not pmc]
    dict_out = {'n_citations': n_citations,
                'pmc': pmc,
                'pmc_cited': pmc_cited_all}
    return dict_out

from pubmed_parser.

titipata commented on August 18, 2024

Also, we also want to add Copyright Notice for scraping function so that users don't scrape too much and get blocked https://www.ncbi.nlm.nih.gov/pmc/about/copyright/#copy-PMC

from pubmed_parser.

nick-hahner commented on August 18, 2024

What about ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.csv ?
Not every PMCID has a corresponding PMID though according to this list.

from pubmed_parser.

titipata commented on August 18, 2024

@nick-hahner, nice! It contains ~ 1.8M rows of PMID/ PMC pairs of Open Access Subset. I'm still thinking about how to update the list regularly by not hurting the repository. I mean, I could upload PMC-PMID pairs from MEDLINE somewhere as I mentioned. Do you have any preference or suggestions on how to make it available on the repository?

from pubmed_parser.

nick-hahner commented on August 18, 2024

Actually this file is probably better with 4,892,265 rows:
ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz

First rsync or wget -c -N ... the file to some directory like ~/.pp_data
Then you can use an sqlite3 db

# Create an indexed sqlite db 
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///pmid_to_pmcid.db')  # but some better location
df = pd.read_csv('~/.pp_data/PMC-ids.csv.gz', dtype=str)
df[['PMCID', 'PMID']].to_sql('pmc_pmid', engine, index=False, if_exists='replace')
engine.execute('create index pmc_idx on pmc_pmid(PMCID)')
engine.execute('create index pmid_idx on pmc_pmid(PMID)')

# then later you can fetch like so:
from sqlalchemy import create_engine, text as sqa_text
def get_pmcid_from_pmid(pmid):
    engine = create_engine('sqlite:///pmid_to_pmcid.db')
    ret = engine.execute(sqa_text('select pmcid from pmc_pmid where pmid = :pmid;'), pmcid=pmcid).fetchone()
    return ret[0] if ret else None

How's that sound?

from pubmed_parser.

PMID to PMC API from Medline cannot convert all provided PMID about pubmed_parser HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent