neo220: NCBI PubMed information retreival via PHP

6.11.2008

NCBI PubMed information retreival via PHP

A while back I had to use NCBI's entrez system to collect and process publication and genome records. While the system is designed to support Perl and Java api with ease, very little if any documentation is available for PHP api.

Below is the code that I've created to connect and collect data via Entez Utilities. Generally, this bit is designed based on user's query to retrieve a set of matching PIDs (Pubmed record IDs), and then utilizes Entrez Fetch to pull the actual records in xml format. The code is easily modified to suit any searching/retrieval needs. Utility and Parameter documentation can be found here: Entrez Utilities. While it is designed for Perl programmers, the parameter specifications are identical for PHP.

PHP5 code:

$query = ''; //your query term $dnum = 100; // total number of documents here it's set to 100 $pids = ''; // PubMED record ID's from e-search initialize to NULL $term = 360; // time interval of when documents were published - this one is one year=360days //retreive PID's of all articles published withing past year that contain query term $esearch = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=$query&reldate=$term&datetype=edat&retmax=100&usehistory=y"; $handle = fopen($esearch, "r"); $rettype = "abstract"; //retreives abstract of the record, rather than full record $retmode = "xml"; $utils = "http://www.ncbi.nlm.nih.gov/entrez/eutils"; if (!$handle) {die();} //collect returned pubmed PID's while (!feof ($handle)) $pids .= fgets($handle, 4096); fclose($handle); //Get query string from eSearch preg_match("/(\w+)<\/QueryKey>/i",$pids,$match); $queryKey = $match[1]; //get webenv preg_match("/(\S+)<\/WebEnv>/i",$pids,$match); $webEnv = $match[1]; $retstart = 0; //fetch xml docs from PUBMED for returned PID's $efetch = "$utils/efetch.fcgi?rettype=$rettype&retmode=$retmode&retstart=$retstart&retmax=$dnum&db=pubmed&query_key=$queryKey&WebEnv=$webEnv&email=abc@xyz.com"; $pids = ''; $handle = fopen($efetch, "r"); if(!$handle) { die(); } while (!feof ($handle)) $pids .= fgets($handle, 4096); fclose($handle);

11 comments:

nico said...: Great! This is just what I was looking for!!! Thank you a lot, you saved me loooots of time :); 11/06/2008
skyMyrka said...: yay, great to know it helped out.

sorry about br tags in text box (:; 11/06/2008
Unknown said...: Hi. I have tried to format this for PHP but with no success. Do you have another reference for the script?

Thanks

Kurt; 2/08/2011
skyMyrka said...: this post is pretty old, but script still works great. eventually i'll modify the html to take out br tags, as old old old blogger editor for some reason entered those automatically, don't ask, i can't remember why or how.

anyway, this is script with br tags removed. if it doesn't work, you can google for something else

$query = ''; //your query term
$dnum = 100; // total number of documents here it's set to 100
$pids = ''; // PubMED record ID's from e-search initialize to NULL
$term = 360; // time interval of when documents were published - this one is one year=360days

//retreive PID's of all articles published withing past year that contain query term
$esearch = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=$query&reldate=$term&datetype=edat&retmax=100&usehistory=y";
$handle = fopen($esearch, "r");
$rettype = "abstract"; //retreives abstract of the record, rather than full record
$retmode = "xml";
$utils = "http://www.ncbi.nlm.nih.gov/entrez/eutils";
if (!$handle) {die();}

//collect returned pubmed PID's
while (!feof ($handle))
$pids .= fgets($handle, 4096);
fclose($handle);

//Get query string from eSearch
preg_match("/(\w+)<\/QueryKey>/i",$pids,$match);
$queryKey = $match[1];

//get webenv
preg_match("/(\S+)<\/WebEnv>/i",$pids,$match);
$webEnv = $match[1];
$retstart = 0;

//fetch xml docs from PUBMED for returned PID's
$efetch = "$utils/efetch.fcgi?rettype=$rettype&retmode=$retmode&retstart=$retstart&retmax=$dnum&db=pubmed&query_key=$queryKey&WebEnv=$webEnv&email=abc@xyz.com";
$pids = '';
$handle = fopen($efetch, "r");
if(!$handle) { die(); }
while (!feof ($handle))
$pids .= fgets($handle, 4096);
fclose($handle);; 2/08/2011
Anonymous said...: Hi I tried using your comment and got the following error!

Warning: fopen(http://...@xyz.com) [function.fopen]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request in C:\wamp\www\Zend\newEmptyPHP1.php on line 32

I am a newbie. Can you suggest what do I need to do?; 4/06/2011
skyMyrka said...: i ran quick pull from the address on pubmed, and it's not a permissions issue. even simple query via will return all results (e.g. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=panda)

i would double check your syntax. use the code i pasted in the previous comment, it's cleaned up from extra formatting in textbox from the original post.; 4/06/2011
Anonymous said...: This is really nice to find. I'm something of a PHP newbie but I built my own Pubmed search from scratch. My issue with it is that it's horribly slow. Here's the basic code:

$dtd = 'pubmed_110101.dtd'; //sets the local DTD for verification
$searchterm = Search term here;
//Creates an XML file from the Pubmed query of at most 100,000 items with the keyword search specified by the user.
$searching = file_get_contents('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&reldate=7300&retmax=100000&retmode=xml&term='.$searchterm);
$pubmedsearchxml = simplexml_load_string($searching);

foreach($pubmedsearchxml->IdList->children() as $child)
{ //validates the XML against the DTD

$doc = new DomDocument;
$doc->Load('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id='.$child.'&retmode=xml');
$dtd = 'pubmed_110101.dtd';
if ($doc->validate()) $authorlistxml=simplexml_import_dom($doc);
}

You can see it's pretty basic - it creates an XML of the original search query, then sequentially goes through each pubmed ID from that query and retrieves the record from eutils using efetch, then validates it against the pubmed DTD (I've had some issues with pulling XML files that ended up being invalid and causing fatal errors as a result.

Here's the problem - it's amazingly slow, I've timed it and it works out to about one search every 4 seconds. Since some search terms can generate ten to twenty thousand hits you can imagine how long this takes to run.

I noticed that your search using the WebEnv parameter. I've looked around the net for more info on using this (the eutils documentation is horrendous) as I thought it might be a way to speed up retrieval of search terms, but I can't quite figure out the code that you're using. Is that what it does? Do you know how about how long it takes to pull out each record using that method? Anything to speed mine up would be greatly appreciated.

Thanks!; 4/09/2011
skyMyrka said...: see the code i posed
it contains comments that will tell you exactly what part of the script does what.

generally, it's a two step process.
first you must find matching PID (record IDs). Then based on the set that you had retrieved, you'd go back to pull the actual record. Operation is quick and response time really depends on how large the data set that you're retrieving.

I suspect you're getting slow because you keep comparing each record against the file set.

It would make sense to only pull records that you need in initial query, instead of getting everything, and then seeing if it fits your search parameters via local resource.

G'luck.; 4/09/2011
Jack said...: Great!!! Thanks "Deoris", you saved my lot of time :); 8/03/2011
zen77 said...: I tried using the reposted code with the carriage returns. I am getting an error for fopen. Here is the error:

Warning: fopen(http://...@xyz.com) [function.fopen]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request in [my path and URL] on line 70

Any idea on how I can fix this? Also, will this set of code print the results in HTML?; 9/16/2011
skyMyrka said...: 1. make sure that $efetch params are set correctly
$efetch = "$utils/efetch.fcgi?rettype=$rettype&retmode=$retmode&retstart=$retstart&retmax=$dnum&db=pubmed&query_key=$queryKey&WebEnv=$webEnv&email=abc@xyz.com";

2. script fetches are records in xml format. if you want to print 'em out, you have to write your own parser as per your unique needs; 9/16/2011

neo220

6.11.2008

NCBI PubMed information retreival via PHP

11 comments:

moi moi

..::subscribe::..

..::Blog List::..

Labels

Blog Archive