mondego / sourcerer Goto Github PK
View Code? Open in Web Editor NEWTools, services and applications for source code analysis and search
Home Page: http://sourcerer.ics.uci.edu/
Tools, services and applications for source code analysis and search
Home Page: http://sourcerer.ics.uci.edu/
Which version of Java are people using to compile this? I am asking since I got 100+ errors when compiling the build using the directions on the website's tutorial.
Thanks!
Build every project with no jars included. For every project, list the imported FQNs that cannot be resolved. Aggregate these together into a prefix tree and cache to disk.
Add the utilization metrics to SourcererDB
I've been cleaning up our ant buildfile a bit, and need to verify with Sushil that the changes won't impact his contributions.
based on the fact that there are no internal=1 relations in the db currently, in soucerer_eclipse_redux you can test this..
there are no relations with internal=1 in project id = 228 azureus; not sure if you know about this or if you tried to fix this
mysql> select count() from relations where project_id=228;
+----------+
| count() |
+----------+
| 1291613 |
+----------+
1 row in set (6.33 sec)
mysql> select count() from relations where project_id=228 and internal is null;
+----------+
| count() |
+----------+
| 246623 |
+----------+
1 row in set (20.76 sec)
mysql> select count() from relations where project_id=228 and internal=0;
+----------+
| count() |
+----------+
| 1044990 |
+----------+
1 row in set (22.04 sec)
Add a downloader for SourceForge using the Notre Dame Dataset.
Add a way to populate the base repo with the java library code. Then move all access methods to the Repository rather than the ExtractedRepository.
A lot of people when embedding dependencies append a prefix to the entire package structure. We need to handle this.
I wanted to add more to this effect of jars being on classpath.
The artifact db now just contains eclipse entities. I wonder if that is actually better. I thought that would be better for the similarity model thinking I'll restrict the relations to the eclipse entties. But, it the projects come with their own libraries the call to those libraries will be eventually go in the relations table (w/o the need to look up artifact db). And, when I build the similarity model, those relations (non-eclipse) would be included. Therefore it seems that by limiting the artifact db with eclipse only jars, we are introducing the risk of failures/problems due to libraries not being found, whereas still not fully avoiding the capture of non-eclipse relations. I think that the artifact db should have all the jar entities, so that failures due to unresolved entities remain minimal.
Also related to the same thing was my request to introduce the following values for column'internal' in the relations table.
JDK (or JSL)
ECLIPSE
I am wondering how would you know the relation ends to an eclipse entity ? What if the relation ends to a non jdk/ non eclipse entity, for example an entity from some FOO.jar ? Please let me know how were you thinking of determining the value for internal (JDK Vs Eclipse)
Perhaps the relaions should actually be
JDK (or JSL)
LIB (or non JDK)
It occured to me that there were couple of projects that were using eclipse parts that are not part of the standard eclipse distribution (for example GEF, EMF etc). Is it possible to fetch these updates, and then add them as candidate eclipse jars in the new DB. I am not talking about external plugins, but standard eclipse extensions such as those available here: http://www.eclipse.org/gef/downloads/?hlbuild=R201002241200
change urls from:
svn checkout__http://nethelp.googlecode.com/svn/trunk/ nethelp-read-only
to
svn checkout_http://nethelp.googlecode.com/svn/trunk/ nethelp-read-only
(_ representing a space)
In this new db (sourcerer_eclipse_redux) how do I select all the projects that were picked from the eclipse installation, there is not a placeholder 'eclipse' project as in the older db.
Debug this inconsistency:
also in sourcerer_eclipse_redux I donot see an 'Eclipse' project, so I cannot do such hash matching. And, in sourcerer_eclipse there are 147 jars from Eclipse that have source unlike 142 in sourcerer_eclipse_redux (that was calculated using the following query: select * from projects where name like 'org.eclipse%' and has_source=1;)
more info:
here is the list of jars that were used for FSE evaluation: http://github.com/sourcerer/Sourcerer/raw/master/research/api-location/eclipse_jars.txt
querying the DB (sourcerer_t2) that was used at that time show me ther are 346/156 (total jars/jars with source)
I was comparing the number in current soucerer_eclipse, sourcerer_eclipse_redux with those numbers.
I am not sure why all the numbers in the three database are inconsistent.
In sourcerer_t2 the schema was quite different but the jars had path 2/0
In soucerer_eclipse, and sourcerer_eclipse_redux the jars and files belonging to eclipse project has null as path.
So I am wondering how do you get a path for an eclipse jar in the new DB ?
However when I go to the folder in the repo that has the eclipse jars, I do see 345 jars. So it seems not all jars from Eclipse were imported.
Please check.
Sometimes there are no CVS modules in sourceforge with the same project in name, in such case download everything.
Example:
cvs -z3 -d:pserver:[email protected]:/cvsroot/jwaste co -P ./
Storing content in the Description field of the Hit is inconsistent. This field is persisted as contentDescription in the project.properties during repository folder creation.
This field should store other project metadata not stored in the existing fields in project.properties
downloader.log in each repository folder contains logs from downloads in other folders too. Either create a single downloader.log or make downloader.log contain download log info for the containing project only
../sf.2/42/579/content/
-- svnco.folder.0
-- syscheck
|-- .project
|-- related-enabled
| -- .svn | |-- entries | |-- format | |-- lock | |-- log | |-- log.1 | |-- log.2 |
-- tmp
| -- 923-rsync-to-remote-machine.sh.tmp.tmp -> ../related-available/923-rsync-to-remote-machine.sh |-- resources.sh
-- syscheck.sh
The tokenizer-muse contains various bug fixes (including a serious control flow bug), and improved logging and fault tolerance. These have to be applied to the generic tokenizer.
looking at various problems I have discovered and the effect of having libraries in the classpath seeming to have a big role in them; I am wondering what happens when a project is extracted with some libraries found via the artifact db and the included jars, but while some are missing.
For example, project A uses swt.jar and say some ftp.jar. What happens in the extraction of a class C in A that uses entities from both swt.jar and ftp.jar under the following scenarios:
CASE 1 - swt.jar, ftp.jar found in library distributed with A
CASE 2 - swt.jar (say a different version than that was used in A originally) found via artifact db, ftp.jar found as a library distributed with A
CASE 3 - Special case of CASE 2, where a file using swt and ftp, is using some entity that was not even found in swt.jar that was found via the artifact db (consider that swt.jar is a different version than A was using)
CASE 4 - swt.jar found through artifact db, ftp.jar is missing (not included as library, not found via artifact db)
Can you please verify/confirm/test these above scenarios to see how many entities/relations go missing and unresolved under these circumstances. And, see if things could be improved or are working as you had expected?
Can you also confirm how does the extractor handles errors? When it encounters problems in class C in above scenarios. does it stop extraction once a problem is found skipping other entities relations ? Or does it still extract all the entities and relations fully, just failing the resolution resulting in 1UNKNOWN entities.
This should be simple, filter out mercurial links for now
Add a tool to go through the repository and delete all non-java non-jar files.
On Tue, Jun 15, 2010 at 7:35 PM, Joel Ossher [email protected] wrote:
ok, now these ones were a bit trickier, and I don't have a good answer
for them all yet. I'm going to have to copy the project to my local
machine so that I can debug the extraction of the project in order to
find out what's going on with the unknown JDK entities.
when you load the projects and debug, also load the project overture (project_id=229, db = sourcerer_eclipse_redux); and see why it has about 1000 relations to unresolved eclipse entities
for example:
mysql> select distinct r.rhs_eid, r.relation_type, r.internal, e.project_id as rproj, e.fqn, e.entity_type from relations as r inner join entities as e on r.rhs_eid=e.entity_id inner join entities as e2 on r.lhs_eid=e2.entity_id where e2.entity_type in ('CLASS', 'METHOD', 'CONSTRUCTOR') and e.fqn like '(1UKNOWN)%' and r.project_id=229 ;
+---------+---------------+----------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| rhs_eid | relation_type | internal | rproj | fqn | entity_type |
+---------+---------------+----------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| 1836402 | CALLS | NULL | 2 | (1UKNOWN).AddAssociation((1UKNOWN)) | UNKNOWN |
| 1836424 | CALLS | NULL | 2 | (1UKNOWN).AddAssociationMp((1UKNOWN))
...
.. more ..
Another example of problematic resolution, taken from project_id 229.
In this case it seems jdt core was not resolved, but some swt classes were
-- 11382 calls and instantiates with entity_type of used entity <> UNKNOWN
select distinct r.rhs_eid, r.relation_type, r.internal, e.project_id as rproj, e.fqn, e.entity_type from relations as r inner join entities as e on r.rhs_eid=e.entity_id inner join entities as e2 on r.lhs_eid=e2.entity_id where e2.entity_type in ('CLASS', 'METHOD', 'CONSTRUCTOR') and e.entity_type <> 'UNKNOWN' and relation_type in ('CALLS','INSTANTIATES') and r.project_id=229 ;
-- 1070 calls and instantiates with entity_type of used entity = UNKNOWN
select distinct r.rhs_eid, r.relation_type, r.internal, e.project_id as rproj, e.fqn, e.entity_type from relations as r inner join entities as e on r.rhs_eid=e.entity_id inner join entities as e2 on r.lhs_eid=e2.entity_id where e2.entity_type in ('CLASS', 'METHOD', 'CONSTRUCTOR') and e.entity_type='UNKNOWN' and relation_type in ('CALLS','INSTANTIATES') and r.project_id=229 ;
public void setURL(final String url) {
urlToUse = url;
if (browser == null) {
return;
}
Utils.execSWTThread(new AERunnable() {
public void runSupport() {
if (url == null) {
browser.setText("");
} else {
String urlToUse = url;
if (UrlFilter.getInstance().urlCanRPC(url)){
urlToUse = context.getContentNetwork().appendURLSuffix(urlToUse,
false, true);
}
if (browser != null) {
browser.setUrl(urlToUse);
if(browser.isVisible()) {
browser.setFocus();
}
}
}
if (sStartURL == null) {
sStartURL = url;
if (browser != null) {
browser.setData("StartURL", url);
}
}
//System.out.println(SystemTime.getCurrentTime() + "] Set URL: " + url);
}
});
}
the above source comes from the file id 45917 (sourcerer_eclipse_redux):
mysql> select * from files where file_id=45917;
+---------+-----------+---------------------------+----------------------------------------------------------------------+------+------------+
| file_id | file_type | name | path | hash | project_id |
+---------+-----------+---------------------------+----------------------------------------------------------------------+------+------------+
| 45917 | SOURCE | SWTSkinObjectBrowser.java | /package.2/com/aelitis/azureus/ui/swt/skin/SWTSkinObjectBrowser.java | NULL | 228 |
+---------+-----------+---------------------------+----------------------------------------------------------------------+------+------------+
Find and fix why there still are relations with targets to (!UKNOWN).xxx entities.
| (1UKNOWN).appendURLSuffix(
java.lang.String,boolean,boolean) | UNKNOWN | CALLS | NULL |
| (1UKNOWN).getContentNetwork() | UNKNOWN | CALLS | NULL |
looking at source it seems these are unresolved because of nested calls for example:
urlToUse = context.getContentNetwork().appendURLSuffix(urlToUse,
false, true);
So, are the fqns unresolved because of this ? Can you check why did the extractor (or was it eclipse) failed to proprely resolve (1UKNOWN).getContentNetwork() as com.aelitis.azureus.ui.swt.browser.BrowserContext..getContentNetwork() that seems to be a local entity.
also for the above code, one of the methods are incorrectly linked.
runSupport() (entity_id 1699135) inside the AERunnable should have those relations I was talking about earlier. From the db I see that runSupport() calls org.eclipse.swt.browser.Browser.setText(java.lang.String) and org.eclipse.swt.browser.Browser.setUrl(java.lang.String) , but instead of calling org.eclipse.swt.browser.Browser.setData(java.lang.String,java.lang.Object); it calls org.eclipse.swt.widgets.Widget.setData(java.lang.String,java.lang.Object)
perhaps this is due to inheritance? But see if that was an error.
Please refer to email for more background on this.
Look into odd results for parametrized types for java.lang.Object%
JDK (meaning using JDK entities externally)
LIB (meaing using non JDK entities externally)
INTERNAL (internal == 1 previously, relation terminates to a local entity)
-- and introduce proper enums for other conditions EG: --
MISSING_TYPE (or whatever error condition caused it to be null previously)
ANYTHING_ELSE (anything else you need)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.