Giter VIP home page Giter VIP logo

Comments (4)

rodneykinney avatar rodneykinney commented on August 12, 2024

RedPajama's code produces raw LaTeX. Some cleaning, but un-parsed for the most part. Bibliography is discarded.

UnArXive uses tralics, a third-party C++ tool that translates LaTex into XML. The unArXive code parses the XML into S2ORC-like format. Bibliography is included. Math gets converted into a mixture of MathML and TeX expressions.

<formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>&#x1D433;</mi><mo>&#x02208;</mo><mi>&#x1D4B5;</mi></mrow></math><texmath>{\mathbf {z}}\in \mathcal {Z}</texmath></formula>

It looks like the math expressions are given in both MathML and TeX formats, so you can choose either one.

from olmo.

rodneykinney avatar rodneykinney commented on August 12, 2024

The math processing feels like a wash to me, but the XML format seems more useful if you want to produce natural language. You also get control over what to do with figures, etc.

RedPajama example:

Finally, in the \emph{Multi-Task Aggregation} stage, the different policies are integrated into a multi-task controller that can be directed using language commands to perform a specific task using a desired skill.

\begin{figure}
    \centering
    \includegraphics[width=0.9\columnwidth]{figures/overview_v2.png}
    \caption{The PADL framework consists of three stages. 1) In the Skill Embedding stage, a dataset of motion clips and corresponding text captions are used to learn a joint embedding of motions and captions. 2) In the Policy Training stage, the learned skill embedding is used to train a collection of policies to perform various tasks, while imitating behaviors in the dataset. 3) Finally, in the Multi-Task Aggregation stage, policies trained for different tasks are combined into a multi-task controller that can be directed to perform different tasks and skills via language commands.}
    \vspace{-0.4cm}
    \label{fig:overview}
\end{figure}

\section{Skill Embedding}
\label{sec:skill-embedding}

In the Skill Embedding stage, our objective is to construct an embedding space that aligns motions with their corresponding natural language descriptions. To do this, we follow a similar procedure as MotionCLIP \citep{MotionClipTevet2022}, where a transformer autoencoder is trained to encode motion sequences into a latent representation that ``aligns'' with the language embedding from a pre-trained CLIP text encoder \citep{ClipRadford2021}. Given a motion clip $\hat{{\mathbf{m}}} = (\hat{{\mathbf{q}}}_1, ..., \hat{{\mathbf{q}}}_n)$ and its caption $c$, a motion encoder ${\mathbf{z}} = \mathrm{Enc}_m(\hat{{\mathbf{m}}})$ maps the motion to an embedding ${\mathbf{z}}$. The embedding is normalized to lie on a unit sphere $||{\mathbf{z}}|| = 1$. Following~\citet{MotionClipTevet2022}, $\mathrm{Enc}_m\left({\mathbf{m}} \right)$ is modeled by a bidirectional transformer \citep{bert2018}. A motion decoder is jointly trained with the encoder to produce a reconstruction sequence ${\mathbf{m}} = ({\mathbf{q}}_1, ..., {\mathbf{q}}_n)$ to recover $\hat{{\mathbf{m}}}$ from ${\mathbf{z}}$. The decoder is also modelled as a birectional transformer ${\mathbf{m}} = \mathrm{Dec}({\mathbf{z}}, {\mathbf{U}})$, which decodes all frames of in parallel using a learned constant query sequence ${\mathbf{U}} = ({\mathbf{u}}_1, ..., {\mathbf{u}}_n)$, similar to the final layer of \citet{detr}. The autoencoder is trained with the loss:


\begin{align}
\mc{L}_{\text{auto}} = \mc{L}_{\text{recon}} + 0.1\mc{L}_{\text{align}} .
\end{align}

Equivalent unArXive example:

Finally, in the <hi rend='it'>Multi-Task Aggregation</hi> stage, the different policies are integrated into a multi-task controller that can be directed using language commands to perform a specific task using a desired skill.</p>
<figure width='384.2974pt' file='figures/overview_v2' extension='png' id-text='1' id='uid6'><head>The PADL framework consists of three stages. 1) In the Skill Embedding stage, a dataset of motion clips and corresponding text captions are used to learn a joint embedding of motions and captions. 2) In the Policy Training stage, the learned skill embedding is used to train a collection of policies to perform various tasks, while imitating behaviors in the dataset. 3) Finally, in the Multi-Task Aggregation stage, policies trained for different tasks are combined into a multi-task controller that can be directed to perform different tasks and skills via language commands.</head>
</figure>
</div0>
<div0 id-text='5' id='cid5'><head>Skill Embedding</head>
<p>In the Skill Embedding stage, our objective is to construct an embedding space that aligns motions with their corresponding natural language descriptions. To do this, we follow a similar procedure as MotionCLIP MotionClipTevet2022, where a transformer autoencoder is trained to encode motion sequences into a latent representation that “aligns” with the language embedding from a pre-trained CLIP text encoder ClipRadford2021. Given a motion clip <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mover accent='true'><mi>&#x1D426;</mi> <mo>&#x5E;</mo></mover><mo>=</mo><mrow><mo>(</mo><msub><mover accent='true'><mi>&#x1D42A;</mi> <mo>&#x5E;</mo></mover> <mn>1</mn> </msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mover accent='true'><mi>&#x1D42A;</mi> <mo>&#x5E;</mo></mover> <mi>n</mi> </msub><mo>)</mo></mrow></mrow></math><texmath>\hat{{\mathbf {m}}} = (\hat{{\mathbf {q}}}_1, ..., \hat{{\mathbf {q}}}_n)</texmath></formula> and its caption <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mi>c</mi></math><texmath>c</texmath></formula>, a motion encoder <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>&#x1D433;</mi><mo>=</mo><msub><mi> Enc </mi> <mi>m</mi> </msub><mrow><mo>(</mo><mover accent='true'><mi>&#x1D426;</mi> <mo>&#x5E;</mo></mover><mo>)</mo></mrow></mrow></math><texmath>{\mathbf {z}}= \mathrm {Enc}_m(\hat{{\mathbf {m}}})</texmath></formula> maps the motion to an embedding <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mi>&#x1D433;</mi></math><texmath>{\mathbf {z}}</texmath></formula>. The embedding is normalized to lie on a unit sphere <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mo>|</mo><mo>|</mo><mi>&#x1D433;</mi><mo>|</mo><mo>|</mo><mo>=</mo><mn>1</mn></mrow></math><texmath>||{\mathbf {z}}|| = 1</texmath></formula>. Following MotionClipTevet2022, <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><msub><mi> Enc </mi> <mi>m</mi> </msub><mfenced open='(' close=')'><mi>&#x1D426;</mi></mfenced></mrow></math><texmath>\mathrm {Enc}_m\left({\mathbf {m}}\right)</texmath></formula> is modeled by a bidirectional transformer bert2018. A motion decoder is jointly trained with the encoder to produce a reconstruction sequence <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>&#x1D426;</mi><mo>=</mo><mo>(</mo><msub><mi>&#x1D42A;</mi> <mn>1</mn> </msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>&#x1D42A;</mi> <mi>n</mi> </msub><mo>)</mo></mrow></math><texmath>{\mathbf {m}}= ({\mathbf {q}}_1, ..., {\mathbf {q}}_n)</texmath></formula> to recover <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mover accent='true'><mi>&#x1D426;</mi> <mo>&#x5E;</mo></mover></math><texmath>\hat{{\mathbf {m}}}</texmath></formula> from <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mi>&#x1D433;</mi></math><texmath>{\mathbf {z}}</texmath></formula>. The decoder is also modelled as a birectional transformer <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>&#x1D426;</mi><mo>=</mo><mi> Dec </mi><mo>(</mo><mi>&#x1D433;</mi><mo>,</mo><mi>&#x1D414;</mi><mo>)</mo></mrow></math><texmath>{\mathbf {m}}= \mathrm {Dec}({\mathbf {z}}, {\mathbf {U}})</texmath></formula>, which decodes all frames of in parallel using a learned constant query sequence <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>&#x1D414;</mi><mo>=</mo><mo>(</mo><msub><mi>&#x1D42E;</mi> <mn>1</mn> </msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>&#x1D42E;</mi> <mi>n</mi> </msub><mo>)</mo></mrow></math><texmath>{\mathbf {U}}= ({\mathbf {u}}_1, ..., {\mathbf {u}}_n)</texmath></formula>, similar to the final layer of detr. The autoencoder is trained with the loss:</p>
<formula id-text='2' id='uid7' textype='align' type='display'><math mode='display' xmlns='http://www.w3.org/1998/Math/MathML'><mtable displaystyle='true'><mtr><mtd columnalign='right'><mrow><msub><mi>&#x2112;</mi> <mtext>auto</mtext> </msub><mo>=</mo><msub><mi>&#x2112;</mi> <mtext>recon</mtext> </msub><mo>+</mo><mn>0</mn><mo>.</mo><mn>1</mn><msub><mi>&#x2112;</mi> <mtext>align</mtext> </msub><mo>.</mo></mrow></mtd></mtr></mtable></math><texmath>
\mathcal {L}_{\text{auto}} = \mathcal {L}_{\text{recon}} + 0.1\mathcal {L}_{\text{align}} .
</texmath></formula>

from olmo.

rodneykinney avatar rodneykinney commented on August 12, 2024

Another third-party tool, pandoc, gives similar results:

  Finally, in the <italic>Multi-Task Aggregation</italic> stage, the
  different policies are integrated into a multi-task controller that
  can be directed using language commands to perform a specific task
  using a desired skill.</p>
  <fig id="fig:overview">
    <caption><p>The PADL framework consists of three stages. 1) In the
    Skill Embedding stage, a dataset of motion clips and corresponding
    text captions are used to learn a joint embedding of motions and
    captions. 2) In the Policy Training stage, the learned skill
    embedding is used to train a collection of policies to perform
    various tasks, while imitating behaviors in the dataset. 3) Finally,
    in the Multi-Task Aggregation stage, policies trained for different
    tasks are combined into a multi-task controller that can be directed
    to perform different tasks and skills via language
    commands.</p></caption>
    <graphic mimetype="image" mime-subtype="png" xlink:href="figures/overview_v2.png" xlink:title="" />
  </fig>
  <p><milestone-start id="fig:overview" />[fig:overview]<milestone-end /></p>
</sec>
<sec id="sec:skill-embedding">
  <title>Skill Embedding</title>
  <p>In the Skill Embedding stage, our objective is to construct an
  embedding space that aligns motions with their corresponding natural
  language descriptions. To do this, we follow a similar procedure as
  MotionCLIP , where a transformer autoencoder is trained to encode
  motion sequences into a latent representation that “aligns” with the
  language embedding from a pre-trained CLIP text encoder . Given a
  motion clip <inline-formula><alternatives>
  <tex-math><![CDATA[\hat{{\mathbf{m}}} = (\hat{{\mathbf{q}}}_1, ..., \hat{{\mathbf{q}}}_n)]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mover><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo accent="true">̂</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mo stretchy
="false" form="prefix">(</mml:mo><mml:msub><mml:mover><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo accent="true">̂</mml:mo></mml:mover><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi>.</mml:mi><mml:mi>.</m
ml:mi><mml:mi>.</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mover><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo accent="true">̂</mml:mo></mml:mover><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false" form="postfix"
>)</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>
  and its caption <inline-formula><alternatives>
  <tex-math><![CDATA[c]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>c</mml:mi></mml:math></alternatives></inline-formula>,
  a motion encoder <inline-formula><alternatives>
  <tex-math><![CDATA[{\mathbf{z}}= \mathrm{Enc}_m(\hat{{\mathbf{m}}})]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant="normal"><mml:mi>E</mml:mi><mml:mi>n
</mml:mi><mml:mi>c</mml:mi></mml:mstyle><mml:mi>m</mml:mi></mml:msub><mml:mo stretchy="false" form="prefix">(</mml:mo><mml:mover><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo accent="true">̂</mml:mo></mml:mover><m
ml:mo stretchy="false" form="postfix">)</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>
  maps the motion to an embedding <inline-formula><alternatives>
  <tex-math><![CDATA[{\mathbf{z}}]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle></mml:math></alternatives></inline-formula>.
  The embedding is normalized to lie on a unit sphere
  <inline-formula><alternatives>
  <tex-math><![CDATA[||{\mathbf{z}}|| = 1]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mo stretchy="false" form="prefix">|</mml:mo><mml:mo stretchy="false" form="prefix">|</mml:mo><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></m
ml:mstyle><mml:mo stretchy="false" form="prefix">|</mml:mo><mml:mo stretchy="false" form="prefix">|</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></alternatives></inline-formula>.
  Following , <inline-formula><alternatives>
  <tex-math><![CDATA[\mathrm{Enc}_m\left({\mathbf{m}}\right)]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:msub><mml:mstyle mathvariant="normal"><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi></mml:mstyle><mml:mi>m</mml:mi></mml:msub><mml:mrow><
mml:mo stretchy="true" form="prefix">(</mml:mo><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo stretchy="true" form="postfix">)</mml:mo></mml:mrow></mml:mrow></mml:math></alternatives></inline-formula>
  is modeled by a bidirectional transformer . A motion decoder is
  jointly trained with the encoder to produce a reconstruction sequence
  <inline-formula><alternatives>
  <tex-math><![CDATA[{\mathbf{m}}= ({\mathbf{q}}_1, ..., {\mathbf{q}}_n)]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo>=</mml:mo><mml:mo stretchy="false" form="prefix">(</mml:mo><mml:msub><mml:mstyle
 mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi>.</mml:mi><mml:mi>.</mml:mi><mml:mi>.</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:m
style><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false" form="postfix">)</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>
  to recover <inline-formula><alternatives>
  <tex-math><![CDATA[\hat{{\mathbf{m}}}]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mover><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo accent="true">̂</mml:mo></mml:mover></mml:math></alternatives></inline-formula>
  from <inline-formula><alternatives>
  <tex-math><![CDATA[{\mathbf{z}}]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle></mml:math></alternatives></inline-formula>.
  The decoder is also modelled as a birectional transformer
  <inline-formula><alternatives>
  <tex-math><![CDATA[{\mathbf{m}}= \mathrm{Dec}({\mathbf{z}}, {\mathbf{U}})]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo>=</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>D</mml:mi><mml:mi>e</mml:mi><
mml:mi>c</mml:mi></mml:mstyle><mml:mo stretchy="false" form="prefix">(</mml:mo><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo stretch
y="false" form="postfix">)</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>,
  which decodes all frames of in parallel using a learned constant query
  sequence <inline-formula><alternatives>
  <tex-math><![CDATA[{\mathbf{U}}= ({\mathbf{u}}_1, ..., {\mathbf{u}}_n)]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo>=</mml:mo><mml:mo stretchy="false" form="prefix">(</mml:mo><mml:msub><mml:mstyle
 mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi>.</mml:mi><mml:mi>.</mml:mi><mml:mi>.</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:m
style><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false" form="postfix">)</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>,
  similar to the final layer of . The autoencoder is trained with the
  loss:</p>
  <p><disp-formula><alternatives>
  <tex-math><![CDATA[\begin{aligned}
  \mathcal{L}_{\text{auto}} = \mathcal{L}_{\text{recon}} + 0.1\mathcal{L}_{\text{align}} .\end{aligned}]]></tex-math>
  <mml:math display="block" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mtable><mml:mtr><mml:mtd columnalign="right"><mml:msub><mml:mstyle mathvariant="script"><mml:mi>ℒ</mml:mi></mml:mstyle><mml:mtext mathvariant="normal">auto</mml:mtext></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant="script"><mml:mi>ℒ</mml:mi></mml:mstyle><mml:mtext mathvariant="normal">recon</mml:mtext></mml:msub><mml:mo>+</mml:mo><mml:mn>0.1</mml:mn><mml:msub><mml:mstyle mathvariant="script"><mml:mi>ℒ</mml:mi></mml:mstyle><mml:mtext mathvariant="normal">align</mml:mtext></mml:msub><mml:mi>.</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></alternatives></disp-formula>

The main difference I see between pandoc and tralics is the handling of inline citations (\cite{abc}, etc.). tralics is inserting the reference ID: modeled by a bidirectional transformer bert2018. while pandoc drops it: modeled by a bidirectional transformer . pandoc produces XML in the JATS schema, while tralics seems to be a custom format.

from olmo.

dumitrac avatar dumitrac commented on August 12, 2024

Marking the items prior to Feb 29th as "closed".

from olmo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.