Comments (4)
I am also looking into an answer as the dataset looks a bit messy.
By any chance could you provide the script where you cleaned up the title
and/or description
?
from open_clip.
@Ascn89 A first suggestion for you would be something like this:
import pandas as pd
import urllib
df["title"] = df["title"].replace(np.nan, "", regex=True)
df["description"] = df["description"].replace(np.nan, "", regex=True)
df["title"] = df["title"].apply(lambda x: str.replace(urllib.parse.unquote(x), "+", " "))
df["description"] = df["description"].apply(
lambda x: str.replace(urllib.parse.unquote(x), "+", " ")
)
However, the level of cleaning necessary to match the published results is uncertain, and I hope authors of this repo can provide some insights there.
from open_clip.
Hi @Lauler, in our experiments we used a concatenation of title
and description
, with a little bit of cleaning. Hope this helps!
Useful functions:
def remove_html(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
def clean_text(x):
return remove_html(ftfy.fix_text(urllib.parse.unquote_plus(x)))
What we use for the text side for each sample:
text = clean_text(table['title'][idx] + ' ' + table['description'][idx])[:500]
from open_clip.
Thank you that is very helpful!
from open_clip.
Related Issues (20)
- SigLIP quality degradation HOT 9
- On the bug of sampling the same shards between different nodes in `ResampledShards2` when using `resample` HOT 2
- Is it possible to perform the preprocess step on GPU? HOT 1
- Final `logit_scale` for SigLip trained models HOT 3
- TypeError: CLIPTextCfg.__init__() got an unexpected keyword argument 'hf_proj_type' HOT 1
- Whatβs the format of "train_data.csv" and "validation_data.csv" when I finetune on custom dataset HOT 4
- Combining 2 S3 sources HOT 2
- Question about pretrained = 'openai' HOT 2
- the inference results of finetuned coca model is not as expected HOT 2
- AssertionError: Please install tensorboard. HOT 5
- How to save checkpoints? HOT 2
- Fast alternative to text tokenization with `SimpleTokenizer` HOT 4
- RuntimeError: Error(s) in loading state_dict for CLIP HOT 2
- Set patch-dropout during inference
- Loading model of OpenAI HOT 1
- Loading OpenAI's pretrained model on the hugging face HOT 1
- Gradient accumulation may requires scaling before backward HOT 1
- Batching when using multiple datasets HOT 2
- SigLIP memory issue with a large batch size HOT 1
- how to convert my trained openclip model "epoch_499.pt" to huggingface (hf) format (.bin) ? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from open_clip.