mondego / sourcerer Goto Github PK

View Code? Open in Web Editor NEW

61.0 61.0 23.0 92.69 MB

Tools, services and applications for source code analysis and search

Home Page: http://sourcerer.ics.uci.edu/

Java 41.72% Shell 0.17% Python 0.70% HTML 53.70% CSS 0.69% XSLT 2.62% JavaScript 0.40%

sourcerer's People

Contributors

Stargazers

Watchers

sourcerer's Issues

Java version

Which version of Java are people using to compile this? I am asking since I got 100+ errors when compiling the build using the directions on the website's tutorial.

Thanks!

Track external import statements within the repository

Build every project with no jars included. For every project, list the imported FQNs that cannot be resolved. Aggregate these together into a prefix tree and cache to disk.

Utilization Metrics

Add the utilization metrics to SourcererDB

Altering build scripts

I've been cleaning up our ant buildfile a bit, and need to verify with Sushil that the changes won't impact his contributions.

check if internal relations are indeed tagged as internals

based on the fact that there are no internal=1 relations in the db currently, in soucerer_eclipse_redux you can test this..

there are no relations with internal=1 in project id = 228 azureus; not sure if you know about this or if you tried to fix this

mysql> select count() from relations where project_id=228;
+----------+
| count() |
+----------+
| 1291613 |
+----------+
1 row in set (6.33 sec)

mysql> select count() from relations where project_id=228 and internal is null;
+----------+
| count() |
+----------+
| 246623 |
+----------+
1 row in set (20.76 sec)

mysql> select count() from relations where project_id=228 and internal=0;
+----------+
| count() |
+----------+
| 1044990 |
+----------+
1 row in set (22.04 sec)

Download SourceForge

Add a downloader for SourceForge using the Notre Dame Dataset.

Add libraries to base repo, rather than extracted repo

Add a way to populate the base repo with the java library code. Then move all access methods to the Repository rather than the ExtractedRepository.

Deal with package renames

A lot of people when embedding dependencies append a prefix to the entire package structure. We need to handle this.

include all entities from jars in the artifact db

I wanted to add more to this effect of jars being on classpath.

The artifact db now just contains eclipse entities. I wonder if that is actually better. I thought that would be better for the similarity model thinking I'll restrict the relations to the eclipse entties. But, it the projects come with their own libraries the call to those libraries will be eventually go in the relations table (w/o the need to look up artifact db). And, when I build the similarity model, those relations (non-eclipse) would be included. Therefore it seems that by limiting the artifact db with eclipse only jars, we are introducing the risk of failures/problems due to libraries not being found, whereas still not fully avoiding the capture of non-eclipse relations. I think that the artifact db should have all the jar entities, so that failures due to unresolved entities remain minimal.

Also related to the same thing was my request to introduce the following values for column'internal' in the relations table.

JDK (or JSL)
ECLIPSE

I am wondering how would you know the relation ends to an eclipse entity ? What if the relation ends to a non jdk/ non eclipse entity, for example an entity from some FOO.jar ? Please let me know how were you thinking of determining the value for internal (JDK Vs Eclipse)

Perhaps the relaions should actually be

JDK (or JSL)
LIB (or non JDK)

include extended eclipse plugins as eclipse candidate projects and to the artifact db

It occured to me that there were couple of projects that were using eclipse parts that are not part of the standard eclipse distribution (for example GEF, EMF etc). Is it possible to fetch these updates, and then add them as candidate eclipse jars in the new DB. I am not talking about external plugins, but standard eclipse extensions such as those available here: http://www.eclipse.org/gef/downloads/?hlbuild=R201002241200

svn url in google code project properties should only have one space after each part

change urls from:
svn checkout__http://nethelp.googlecode.com/svn/trunk/ nethelp-read-only
to
svn checkout_http://nethelp.googlecode.com/svn/trunk/ nethelp-read-only

(_ representing a space)

verify the number of eclipse projects, ensure proper mechanism in the db to select eclipse projects

In this new db (sourcerer_eclipse_redux) how do I select all the projects that were picked from the eclipse installation, there is not a placeholder 'eclipse' project as in the older db.

Debug this inconsistency:

there are 379 jar files under project 20671 in sourcerer_eclipse
there are 197 projects that are of 'jar' type whose hash match the files belonging to project 20671
so there are clearly more projects in the sourcerer_eclipse than in sourcerer_eclipse_redux

also in sourcerer_eclipse_redux I donot see an 'Eclipse' project, so I cannot do such hash matching. And, in sourcerer_eclipse there are 147 jars from Eclipse that have source unlike 142 in sourcerer_eclipse_redux (that was calculated using the following query: select * from projects where name like 'org.eclipse%' and has_source=1;)

more info:

here is the list of jars that were used for FSE evaluation: http://github.com/sourcerer/Sourcerer/raw/master/research/api-location/eclipse_jars.txt
querying the DB (sourcerer_t2) that was used at that time show me ther are 346/156 (total jars/jars with source)

I was comparing the number in current soucerer_eclipse, sourcerer_eclipse_redux with those numbers.

I am not sure why all the numbers in the three database are inconsistent.

In sourcerer_t2 the schema was quite different but the jars had path 2/0
In soucerer_eclipse, and sourcerer_eclipse_redux the jars and files belonging to eclipse project has null as path.
So I am wondering how do you get a path for an eclipse jar in the new DB ?

However when I go to the folder in the repo that has the eclipse jars, I do see 345 jars. So it seems not all jars from Eclipse were imported.
Please check.

fix Cvs source retriever to work when module name not equals project name

Sometimes there are no CVS modules in sourceforge with the same project in name, in such case download everything.

Example:

cvs -z3 -d:pserver:[email protected]:/cvsroot/jwaste co -P ./

fix project metadata content in hit.Description in codecrawler

Storing content in the Description field of the Hit is inconsistent. This field is persisted as contentDescription in the project.properties during repository folder creation.

This field should store other project metadata not stored in the existing fields in project.properties

Fix logging of download - downloader.log

downloader.log in each repository folder contains logs from downloads in other folders too. Either create a single downloader.log or make downloader.log contain download log info for the containing project only

Quality Metrics

Bytecode Metrics
- Cyclomatic number (done)
- Number of statements (done)
- Number of instructions (done)
- Vocabulary size (done)
Computed bytecode metrics
- Average size of statements (number of instructions / number of statements) (done)
Source only metrics
- Lines of code variants (done)
- Number of unconditional jumps (done)
- Number of nested levels (done)
Computed source code metrics
- Comments frequency (done)
- Class comments frequency (done)
DB Metrics
- Number of base classes (done)
- Number of base interfaces (done)
- Number of derived classes (done)
- Number of derived interfaces (done)
- Ratio of derived classes to base classes (done)
- Ratio of derived classes to base interfaces (done)
- Weighted Methods per Class (sum of cyclomatic complexities of every method) (done)
- Average cyclomatic complexity per method (done)
- Vocabulary frequency (number of instructions / vocabulary size) (done)
- Afferent Coupling (done)
- Efferent Coupling (done)
- Lack of cohesion (LCOM) (done)
- Depth of inheritance tree (DIT) (done)
- Number of class children (NOC) (done)
- Number of interface children (NOC) (done)
- Number of interface parents (NOC) (done)
- Response for a class (RFC)
Skipped
- Number of entry nodes
- Number of exit nodes
- Number of exits of conditional structs
- Null dereferences
- Undefined values

Migrate from Integer -> Long for id columns

svn checkout fails on repositories with symlinks

tokenizer-muse features -> tokenizer

The tokenizer-muse contains various bug fixes (including a serious control flow bug), and improved logging and fault tolerance. These have to be applied to the generic tokenizer.

ant task to create solr server configurations

verify effect of jars being/absent on classpaths

looking at various problems I have discovered and the effect of having libraries in the classpath seeming to have a big role in them; I am wondering what happens when a project is extracted with some libraries found via the artifact db and the included jars, but while some are missing.

For example, project A uses swt.jar and say some ftp.jar. What happens in the extraction of a class C in A that uses entities from both swt.jar and ftp.jar under the following scenarios:

CASE 1 - swt.jar, ftp.jar found in library distributed with A

CASE 2 - swt.jar (say a different version than that was used in A originally) found via artifact db, ftp.jar found as a library distributed with A

CASE 3 - Special case of CASE 2, where a file using swt and ftp, is using some entity that was not even found in swt.jar that was found via the artifact db (consider that swt.jar is a different version than A was using)

CASE 4 - swt.jar found through artifact db, ftp.jar is missing (not included as library, not found via artifact db)

Can you please verify/confirm/test these above scenarios to see how many entities/relations go missing and unresolved under these circumstances. And, see if things could be improved or are working as you had expected?

Can you also confirm how does the extractor handles errors? When it encounters problems in class C in above scenarios. does it stop extraction once a problem is found skipping other entities relations ? Or does it still extract all the entities and relations fully, just failing the resolution resulting in 1UNKNOWN entities.

Implementation of cralwer entry filter for Google Code

This should be simple, filter out mercurial links for now

Reduce repository size

Add a tool to go through the repository and delete all non-java non-jar files.

Refactor codecrawler to use java logging instead of log4j

debug extraction locally to find the reason for unresolved entities

--- previous message ---

On Tue, Jun 15, 2010 at 7:35 PM, Joel Ossher [email protected] wrote:

ok, now these ones were a bit trickier, and I don't have a good answer
for them all yet. I'm going to have to copy the project to my local
machine so that I can debug the extraction of the project in order to
find out what's going on with the unknown JDK entities.

---- Issue Request ----

when you load the projects and debug, also load the project overture (project_id=229, db = sourcerer_eclipse_redux); and see why it has about 1000 relations to unresolved eclipse entities

for example:

mysql> select distinct r.rhs_eid, r.relation_type, r.internal, e.project_id as rproj, e.fqn, e.entity_type from relations as r inner join entities as e on r.rhs_eid=e.entity_id inner join entities as e2 on r.lhs_eid=e2.entity_id where e2.entity_type in ('CLASS', 'METHOD', 'CONSTRUCTOR') and e.fqn like '(1UKNOWN)%' and r.project_id=229 ;
+---------+---------------+----------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| rhs_eid | relation_type | internal | rproj | fqn | entity_type |
+---------+---------------+----------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| 1836402 | CALLS | NULL | 2 | (1UKNOWN).AddAssociation((1UKNOWN)) | UNKNOWN |
| 1836424 | CALLS | NULL | 2 | (1UKNOWN).AddAssociationMp((1UKNOWN))
...
.. more ..

Another project that needs to be debugged, as relations to JDT were not properly resolved

Another example of problematic resolution, taken from project_id 229.

In this case it seems jdt core was not resolved, but some swt classes were

-- 11382 calls and instantiates with entity_type of used entity <> UNKNOWN
select distinct r.rhs_eid, r.relation_type, r.internal, e.project_id as rproj, e.fqn, e.entity_type from relations as r inner join entities as e on r.rhs_eid=e.entity_id inner join entities as e2 on r.lhs_eid=e2.entity_id where e2.entity_type in ('CLASS', 'METHOD', 'CONSTRUCTOR') and e.entity_type <> 'UNKNOWN' and relation_type in ('CALLS','INSTANTIATES') and r.project_id=229 ;

-- 1070 calls and instantiates with entity_type of used entity = UNKNOWN
select distinct r.rhs_eid, r.relation_type, r.internal, e.project_id as rproj, e.fqn, e.entity_type from relations as r inner join entities as e on r.rhs_eid=e.entity_id inner join entities as e2 on r.lhs_eid=e2.entity_id where e2.entity_type in ('CLASS', 'METHOD', 'CONSTRUCTOR') and e.entity_type='UNKNOWN' and relation_type in ('CALLS','INSTANTIATES') and r.project_id=229 ;

target eclipse entity not reolved, incorrectly linked


public void setURL(final String url) {
urlToUse = url;
if (browser == null) {
return;
}
Utils.execSWTThread(new AERunnable() {
        public void runSupport() {
            if (url == null) {
                browser.setText("");
            } else {
                String urlToUse = url;
                if (UrlFilter.getInstance().urlCanRPC(url)){
                    urlToUse = context.getContentNetwork().appendURLSuffix(urlToUse,
                            false, true);
                }
                if (browser != null) {
                    browser.setUrl(urlToUse);
                    if(browser.isVisible()) {
                        browser.setFocus();

                    }
                }
            }
            if (sStartURL == null) {
                sStartURL = url;
                if (browser != null) {

                    browser.setData("StartURL", url);
                }
            }
            //System.out.println(SystemTime.getCurrentTime() + "] Set URL: " + url);
        }
    });
}


the above source comes from the file id 45917 (sourcerer_eclipse_redux):

mysql> select * from files where file_id=45917;

+---------+-----------+---------------------------+----------------------------------------------------------------------+------+------------+

| file_id | file_type | name | path | hash | project_id |

+---------+-----------+---------------------------+----------------------------------------------------------------------+------+------------+

| 45917 | SOURCE | SWTSkinObjectBrowser.java | /package.2/com/aelitis/azureus/ui/swt/skin/SWTSkinObjectBrowser.java | NULL | 228 |

+---------+-----------+---------------------------+----------------------------------------------------------------------+------+------------+
Find and fix why there still are relations with targets to (!UKNOWN).xxx entities.
| (1UKNOWN).appendURLSuffix(

java.lang.String,boolean,boolean) | UNKNOWN | CALLS | NULL |

| (1UKNOWN).getContentNetwork() | UNKNOWN | CALLS | NULL |
looking at source it seems these are unresolved because of nested calls for example:
urlToUse = context.getContentNetwork().appendURLSuffix(urlToUse,

false, true);
So, are the fqns unresolved because of this ? Can you check why did the extractor (or was it eclipse) failed to proprely resolve (1UKNOWN).getContentNetwork() as com.aelitis.azureus.ui.swt.browser.BrowserContext..getContentNetwork() that seems to be a local entity.

also for the above code, one of the methods are incorrectly linked.
runSupport() (entity_id 1699135) inside the AERunnable should have those relations I was talking about earlier. From the db I see that runSupport() calls org.eclipse.swt.browser.Browser.setText(java.lang.String) and org.eclipse.swt.browser.Browser.setUrl(java.lang.String) , but instead of calling org.eclipse.swt.browser.Browser.setData(java.lang.String,java.lang.Object); it calls org.eclipse.swt.widgets.Widget.setData(java.lang.String,java.lang.Object)
perhaps this is due to inheritance? But see if that was an error.
Please refer to email for more background on this.

Parametrized Types

Look into odd results for parametrized types for java.lang.Object%

add enumeration to 'internal' column in the relations table

JDK (meaning using JDK entities externally)
LIB (meaing using non JDK entities externally)
INTERNAL (internal == 1 previously, relation terminates to a local entity)

-- and introduce proper enums for other conditions EG: --

MISSING_TYPE (or whatever error condition caused it to be null previously)
ANYTHING_ELSE (anything else you need)

make svn retriever work on urls ending with trunk that does not exist in the server (google code)

Example: http://turanar.googlecode.com/svn/trunk

mondego / sourcerer Goto Github PK

sourcerer's People

Contributors

Stargazers

Watchers

Forkers

sourcerer's Issues

--- previous message ---

---- Issue Request ----

Another project that needs to be debugged, as relations to JDT were not properly resolved

Recommend Projects

Recommend Topics

Recommend Org