alexkuhnle / shapeworld Goto Github PK

View Code? Open in Web Editor NEW

58.0 58.0 19.0 583.3 MB

License: MIT License

Python 98.78% Shell 1.22%

shapeworld's People

Contributors

Stargazers

Watchers

Forkers

codeaudit willanxywc jayelm lgraesser xkuang davidmoeljadi xiehuiyuan mannykayy roma-patel hughperkins huiyuanxie robindelearde mengdi-li furkanbiten dschaehi gstoica27 prokil dearborn-open-ai

shapeworld's Issues

Optimizing util.Point to speed up dataset creation

Creating examples with multiple worlds_per_instance is really slow. When I profile this code, it takes around 54 seconds:

dataset = Dataset.create(dtype='agreement', name='spatial',
                         worlds_per_instance=20, correct_ratio=0.5)
generated = dataset.generate(n=1, mode='train', include_model=True,
                             alternatives=True)

         177197561 function calls (173005269 primitive calls) in 53.820 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 15903371   13.730    0.000   24.102    0.000 util.py:231(__new__)
 99570143    7.172    0.000    7.172    0.000 {built-in method builtins.isinstance}
  5006861    5.544    0.000   14.541    0.000 util.py:325(__sub__)
  2339708    2.704    0.000    7.356    0.000 util.py:439(positive)
    25196    2.702    0.000   48.189    0.002 entity.py:192(not_collides)
 15903676    2.278    0.000    2.278    0.000 {built-in method __new__ of type object at
 0xa3ee00}
  2093463    1.896    0.000    5.374    0.000 util.py:415(__abs__)
  1282992    1.359    0.000   12.936    0.000 shape.py:83(distance)
  2657742    1.352    0.000    1.554    0.000 util.py:262(length)
  1351178    1.320    0.000    5.996    0.000 util.py:464(range)
  6920101    1.257    0.000    1.257    0.000 {built-in method builtins.max}
  1551979    1.158    0.000    3.490    0.000 util.py:461(rotate)
6600261/2413335    0.901    0.000    6.067    0.000 {built-in method builtins.abs}
  1359072    0.860    0.000   19.195    0.000 entity.py:59(distance)
   190662    0.569    0.000    1.958    0.000 shape.py:244(distance)
  1359072    0.501    0.000    3.546    0.000 entity.py:53(rotate)
   163036    0.479    0.000    2.103    0.000 shape.py:177(distance)
        4    0.456    0.114    0.456    0.114 {method 'poll' of 'select.poll' objects}
  1282992    0.447    0.000   13.382    0.000 world.py:54(distance)
   326850    0.357    0.000    3.505    0.000 shape.py:105(distance)

It seems like a lot of time (~24s) is spent creating points:

class Point(PointTuple):

    def __new__(cls, x, y):
        assert isinstance(x, float) or isinstance(x, int) or isinstance(x, bool) or isinstance(x, str)
        assert isinstance(y, float) or isinstance(y, int) or isinstance(y, bool) or isinstance(y, str)
        if isinstance(x, str):
            x = float(x)
        if isinstance(y, str):
            y = float(y)
        return super(Point, cls).__new__(cls, x, y)

Can we speed this up by removing all the isinstance calls (which take 7s)? Specifically perhaps by creating different functions to handle the different possible arguments Point takes? Same with Point.__sub__, Point.positive, Point.__abs__, etc.

That should hopefully speed things up considerably, but still, generating a single example will likely still be pretty slow. It seems like not_collides can probably be optimized as well, but hard to tell how slow it is until the Point class above is sped up.

I'm happy to work on a fix if you agree with these points above, and perhaps point me to where Point tends to be called.

Error when `worlds_per_instance > 1`

I was toying around with worlds_per_instance:

from shapeworld import Dataset

dataset = Dataset.create(dtype='agreement', name='multishape',
                         worlds_per_instance=5)
generated = dataset.generate(n=3, mode='train', include_model=True)

But I get the following error:

Traceback (most recent call last):
  File "test.py", line 5, in <module>
    generated = dataset.generate(n=3, mode='train', include_model=True)
  File "/local/scratch/jlm95/ShapeWorld/shapeworld/dataset.py", line 901, in generate
    batch['agreement'][i].append(float(correct))
AttributeError: 'numpy.float32' object has no attribute 'append'

This seems to be because batch['agreement'] is initialized to a single float for each instance, e.g. array([1., 1., 1.], dtype=float32), and can't be appended to.

batch['agreement'] is initialized in zero_batch based on the datatypes provided by self.values, does the datatype for agreement need to be changed to a vector(float) when worlds_per_instance > 1?

Adding new shapes

Hi @AlexKuhnle, I wanted to add additional shapes to the existing ones (such as letters and numbers as shapes) , so I added some Shape subclasses to shape.py and corresponding entries in english.json, but it seems that the shape names have to be in the english.dat file as well so that they are covered by the grammar. Is this correct? Then I am wondering whether there is an easy way to add the shapes to english.dat. I spend quite a bit of time to figure this out, but failed to do so. Any help is appreciated. Thanks!

How to custom our own dataset?

HI~
ShapeWorld is really a inspriational work. But I have some tips on the dataset building. Since it's published as a python library, maybe some annotations or demonstations for the captioner and realizers on how to use them to produce customed datasets would be better?
For example I need to generate simple captions like "There are 3 more triangles than circles". I didn't find such style in given agreement dataset. And I can't really figure out which captioners should I use, or how to configure the given dataset.pys. Even when I change the given vocabulary in datasets/agreement/**.py I find I should also change the vocabulary in realizer and world builder. Other issues come like this, the interaction between the captioner/realizer/world_generator.
I think other users may also apprectiate a demo with simple annotations on how to use the submodels, if it's not too much bother.
Sincerely

module object is not callable

When I follow the instructions in the README,

from shapeworld import dataset

dataset = dataset(dtype='agreement', name='multishape')

I get the following error: TypeError: 'module' object is not callable. Are the instructions correct?

As a workaround, I instead tried:

ds = shapeworld.dataset.Dataset.create(dtype='agreement', name='multishape')

But then I get the error:

ModuleNotFoundError: No module named 'shapeworld.datasets.agreement.multishape'

The parsing of sentences to ShapeWorld semantic space

I will just copy and paste what I wrote on previous issue just to keep it as an anchor:

I downloaded the latest commit on analyzer and here are some weird behaviour I am getting, maybe you can help me with those. For example, when I am running analyze = analyzer.Dmrs_analyzer(), analyze([''A pentagon is not smaller than a triangle.'']), I am getting [None].

I will try to find some pattern in the sentences that cause this. Let's see if I can find it.

Controlling the amount of correct captions

Hi Alex,

Great work! I was just curious if you could choose the amount of correct captions to produce. For example, I would like to have all correct captions for certain images but all incorrect captions for others.

I hope I didn't miss something obvious.

Clarification of dtype and name?

The README provides examples for several types of dtype and name, but I can't find documentation anywhere about (1) what the complete set of options is and (b) what each of them mean.

How do I generate all possible images uniformly at random?

Add wget to setup.py

Hi Alex,

I wanted to suggest that the wget package is added to the requirements in setup.py as I've found new clones of the repo can't generate new data with generate.py without it.

I would do this as a PR but I'm only allowed one fork from yours and I'm not up to date/I've made other (minor) changes that you don't need.

Thanks :)

Sentence structure seems broken in Selection Dataset

Hi Alex,

Me again :) Hope everything is going well. I was trying to use Selection Dataset to generate examples however, it seems there are some discrepencies. For example, for the "x-two" relationship, I sometimes get sentences such as "The left shape is a magenta circle." or for the "x-max", I get "The leftmost shape is a circle." This behaviour is repeated for "proximity" and "y" relationships as well.

From examples I see, the correct behaviour should be "the left (insert color or shape) shape is (insert correct attribute)".

What do you think? Am i missing something? Is this an expected behaviour?

Thanks as usual :)

EDIT: Markups was wrongly used so whole message wasn't understandable.

can't import the necessary code from pydmrs

When I install pydmrs from PyPi I get an error. When I install it from github, I get another error.

`No module named 'pydmrs.mapping.paraphrase'`

Hi Alex,

I've pip install -e'd the repository. Then I try doing:

from shapeworld.dataset import Dataset

dataset = Dataset.create(dtype='agreement', name='existential')

This said module pydmrs not found. So I pip installd pydmrs. Now I get:

~/git/ShapeWorld/shapeworld/realizers/dmrs/dmrs.py in <module>()
     13 from pydmrs.core import Link, ListDmrs
     14 from pydmrs.graphlang.graphlang import parse_graphlang
---> 15 from pydmrs.mapping.paraphrase import paraphrase
     16 
     17 

ModuleNotFoundError: No module named 'pydmrs.mapping.paraphrase'

This is on a Mac, using python 3.5. What am I missing here?

Controlling what predicates to include in the captions

First of all, thanks for this nice package!
I am currently actively using it as a test bed for visual reasoning models.
One question I have at the moment is how to control what predicates to include in the captions. For example I'd like to turn off the color predicate when doing relational reasoning. A hacky solution what I am using now is replacing

ShapeWorld/shapeworld/captioners/captioner.py

Line 54 in 3bfcfab

return self.sample_values(mode=mode, predication=LogicalPredication())

with

return self.sample_values(mode=mode, predication=LogicalPredication(blocked_preds=["color"]))

and then replacing in shapeworld/captioners/relation.py

ref_predication = predication.copy(reset=True)

with

ref_predication = predication.copy(reset=False)

But I assume there is a better way to solve this problem, e.g., by providing a child class of CaptionAgreementDataset. But at the moment I don't know how (I spent already quite a bit of time trying to make sense of the code...)

Could you give me some tips for this problem?

Custom captions

Is there an easy way to leverage attributes/descriptors to create custom captions? I would like all my captions to have a specific format like "There is a {color} {object}".

[BUG] Reading dataset with alternatives infinite loop

Hello!

I believe I've found the following bug in the dataset.py file under "ShapeWorld/shapeworld/dataset.py":

When reading a command-line generated dataset with alternatives (e.g. worlds_per_instance), the program encounters an infinite loop. Below I have placed the "offending" code. These are found on lines 563-583 of "dataset.py".

while True:
  if alts:
      i = 0
      v = list()
      while True:
          image_bytes = read_file('{}-{}-{}.{}'.format(value_name, n, i, image_format), binary=True)
          if image_bytes is None:
              break
          image_bytes = BytesIO(image_bytes)
          image = Image.open(image_bytes)
          v.append(World.from_image(image))
          i += 1
      value.append(v)
  else:
      image_bytes = read_file('{}-{}.{}'.format(value_name, n, image_format), binary=True)
      if image_bytes is None:
          break
      image_bytes = BytesIO(image_bytes)
      image = Image.open(image_bytes)
      value.append(World.from_image(image))
  n += 1

The problem is that while the "if image_bytes is None" exits the first while statement as there is a read attempt on a non-existent world, it does not exit the top-level while statement. Consequently, the code will attempt to read non-existent files infinitely.

However, I believe the fix is very simple. For instance, the following seems sufficient:

flag = True
while flag:
  if alts:
      i = 0
      v = list()
      while True:
          image_bytes = read_file('{}-{}-{}.{}'.format(value_name, n, i, image_format), binary=True)
          if image_bytes is None:
              flag = False
              break
          image_bytes = BytesIO(image_bytes)
          image = Image.open(image_bytes)
          v.append(World.from_image(image))
          i += 1
      value.append(v)
  else:
      image_bytes = read_file('{}-{}.{}'.format(value_name, n, image_format), binary=True)
      if image_bytes is None:
          break
      image_bytes = BytesIO(image_bytes)
      image = Image.open(image_bytes)
      value.append(World.from_image(image))
  n += 1

I.e., replacing the top-level "True" with a flag boolean variable, and updating the variable in the corresponding break conditional.

Reading pre-loaded datasets tries to open `world_features` unsuccessfully

[Not urgent]

Hi Alex,

I've noticed that if you open a dataset that has saved the ResNet features using generate() not tf_util.batch_records() then there is an exception thrown as deserialize_value() tries to open world_features.txt (the offending line is 436 in dataset.py). This doesn't seem to be a useful file as the full features aren't saved in non-TFRecord form so a condition to ignore them could be included.

I know you require the generation of TFRecords for the ResNet features but in cases such as designing a manual object detector (such as what I am doing now) and I need the world_model I have to use the generate() call.

Hope this helps!

TypeError in shapeworld/realizers/dmrs/dmrs.py

Hello~
I ran the demo to generate dataset:
python generate.py -d examples/agreement/multishape -U -t agreement -n multishape -i 10 -p 0.1 -M -H
then got error:

File "generate.py", line 142, in <module>
    generated = dataset.generate(n=args.instances, mode=mode, noise_range=args.pixel_noise, include_model=args.include_model, alternatives=True)
  File "/home/jysen/code/shapeworld/shapeworld/dataset.py", line 773, in generate
    captions = self.caption_realizer.realize(captions=captions)
  File "/home/jysen/code/shapeworld/shapeworld/realizers/dmrs/realizer.py", line 217, in realize
    mrs_list.append(dmrs.get_mrs() + '\n')
  File "/home/jysen/code/shapeworld/shapeworld/realizers/dmrs/dmrs.py", line 146, in get_mrs
    for nodeid in labels[lbl]:
TypeError: 'int' object is not iterable

I m new to dmrs and can't figure this out. I m using python2.7 with the latest shapeworld, and the pydmrs in https://github.com/delph-in/pydmrs/tree/python2
BTW there are lots of bugs in the https://github.com/delph-in/pydmrs/tree/master. For example in mapping/paraphrase.py. And the v1.6 pydmrs from pip doesn' t even have paraphrase.py which is frequently called in the shapeword/realizers.

Alternatives in Records breaks batch_records

Hi Alex,

I'm playing around with your tf_util interface for loading batches of data and I find that if I generate a small set as:

python3 generate.py -d some_dir -a tar:bzip2 -t agreement -n oneshape -s 5,1,1 -i 100 -M -T  --config-values --correct_ratio 1.0 --captions_per_instance 5

Then when running the example data loading:

dataset = Dataset.create(dtype='agreement', name='oneshape', config='some_dir')
generated = tf_util.batch_records(dataset=dataset, mode='train', batch_size=128)

I get the error

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-58fc10a45e5c> in <module>()
      1 dataset = Dataset.create(dtype='agreement', name='oneshape', config='some_dir')
----> 2 generated = tf_util.batch_records(dataset=dataset, mode='train', batch_size=128)

~/code/acs/ShapeWorld/shapeworld/tf_util.py in batch_records(dataset, mode, batch_size)
     77                 batch[value_name] = tf.clip_by_value(t=(batch[value_name] + noise), clip_value_min=0.0, clip_value_max=1.0)
     78             elif value_type == 'int' or value_type == 'vector(int)' or value_type in dataset.vocabularies:
---> 79                 batch[value_name] = tf.cast(x=batch[value_name], dtype=tf.int32)
     80         return batch
     81 

KeyError: 'alternatives'

I find this is because of the loop on line 75 that iterates through the key, value pairs from the dataset.values dict but the batch dict no longer contains alternatives due to the call of records.pop('alternatives'). I added a breaking condition to fix this as:

for value_name, value_type in dataset.values.items():
     if value_name=='alternatives':
         break

I have this as a PR from my fork that I can submit but I'm finding a larger problem with loading data in this way as the call to evaluate a batch:

with tf.Session() as sess:
    batch = sess.run(generated)

hangs for an unreasonably long time. I've not measured exactly because it might never recover but it appears to demand at least 10 minutes of setup time whereas the data loading modules I've written take almost no time to evaluate a batch. I'm not sure where the issue is but I'm happy to look somewhere if you can point me in the right direction if you find the same issue trying to evaluate a batch.

[Using Mac OSX 10.13.2, Python 3.5.4, Tensorflow 1.5.0]

Conflict between mode "relational" and negated relation / implication

With mode agreement and relational ('-t agreement -n relational'),

with "negation" to None/True in config file,
set_realizer for captioners/negation_relation.py falls in error in "assert -1 in realizer.relations['negation']": AssertionError
ex: python generate.py -d examples/test1 -U -t agreement -n relational -i 1 -M -H
with "negation" to False in config file (or if using python -O),
correct for captioners/regular_type.py falls in error in "sub_predication.implies(predicate=caption)": AttributeError: 'generator' object has no attribute 'implies'
ex1: python -O generate.py -d examples/test1 -U -t agreement -n relational -i 1 -M -H
ex2: python generate.py -d examples/test2 -U -t agreement -n relational -c configs/agreement/relational/spatial_twoshapes.json -i 100 -M -H -G

NB1: obtained after correction of captions/pragmatical_predication.py (line 62: yield predication.get_sub_predications())
NB2: deleting line 62 in captions/pragmatical_predication.py seems to oust this issue.

ShapeWorld_issue.txt

Size-rel in Selection Dataset

Hi Alex,

I have seen some examples for size relation in the Selection dataset, however, in the language json, there seems to be no key for these.

Was there a reason you left out the size-rel in selection dataset? If not, would it possible for you to share it if it is not too much trouble of course.

No module named mrs_load

Hi Alex,

Thank you for the analyzer part. However, whenever I try to run "mrs.convert_to(cls=Dmrs, copy_nodes=True)", it throws an No module named mrs_load error inside analyzer.mrs.

Also, I think in analyzer.analyzer in line 239, there is no dmrs_list created. I fixed it just by creating dmrs_list = list(). But I thought you would like to know.

How to generate multiple examples for one language description?

Hi, awesome project :)

Looking around the examples, eg https://rawgit.com/AlexKuhnle/ShapeWorld/master/examples/agreement/relational-full/data.html it looks like the way it works is that one image is generated, and then multiple descriptions are created for this image?

Is there any way to do the opposite, ie sample one description, and then draw multiple examples that match that description? (and also ideally, some examples that are guaranteed to not match the description)

Edit: I tried using worlds_per_instance, but that didnt seem to be it?

"no such instance as `root_gen' available"

Hi,

Thanks for this great dataset!

I am not able to generate/load any data. Running the provided "Integration into Python code" code, I receive this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/georgestoica/Desktop/Research/ShapeWorld/shapeworld/dataset.py", line 1360, in generate
    captions = self.caption_realizer.realize(captions=captions)
  File "/Users/georgestoica/Desktop/Research/ShapeWorld/shapeworld/realizers/dmrs/realizer.py", line 317, in realize
    assert len(caption_strings) == len(captions), '\n'.join(stdout_data) + '\n' + '\n'.join(stderr_data)
AssertionError: 
roots: no such instance as `root_gen' available
NOTE: transfer did 0 successful unifies and 0 failed ones

Upon closer inspection, it appears that "root_gen" is an argument passed to this function call,

ace = subprocess.Popen([self.ace_path, '-g', self.erg_path, '-1e'] + self.ace_arguments, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

On line 260 of "ShapeWorld/shapeworld/realizers/dmrs/realizer.py".

I haven't been able to find any solutions online -- can't find anything related to "root_gen" and python subprocesses.

I was wondering if by any chance there were any known solutions to this? I really wanted to use this dataset for our research.

Any help would be very greatly appreciated! Thanks very much!

alexkuhnle / shapeworld Goto Github PK

shapeworld's People

Contributors

Stargazers

Watchers

Forkers

shapeworld's Issues

Recommend Projects

Recommend Topics

Recommend Org