<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Superset</title>
    <link>https://superset.be</link>
    <description>Serendipitous research and artisanal products.</description>
    <language>en-GB</language>
    <atom:link href="https://superset.be/rss.xml" rel="self" type="application/rss+xml"/>
    <item>
      <title>Instant concept annotation with ConNER</title>
      <link>https://superset.be/news/instant-concept-annotation-with-conner/</link>
      <guid isPermaLink="true">https://superset.be/news/instant-concept-annotation-with-conner/</guid>
      <pubDate>Thu, 28 Nov 2024</pubDate>
      <description><![CDATA[<h2>Introduction</h2>
<p>We're announcing the release of ConNER (Concept Named Entity Recognition), a lightweight and efficient model designed to extract concepts from text. The model is optimized for edge devices, with a size of just a few megabytes, making it suitable for running directly on laptops, tablets, or phones without requiring significant computational resources.</p>
<h2>Motivation</h2>
<p>Extracting concepts from text often relies on large language models that demand substantial computational power and can be slow. ConNER offers a lightweight alternative, specifically trained for concept annotation. It achieves over 90% accuracy on our validation set and is compact enough to run on edge devices, providing real-time predictions.</p>
<h2>Use Cases</h2>
<p>ConNER can be used as a building block for various applications:</p>
<ul>
<li>Analyzing lecture notes and textbooks</li>
<li>Building concept maps from educational content</li>
<li>Creating searchable concept indices</li>
<li>Supporting educational software</li>
<li>Enhancing educational content management systems</li>
</ul>
<h2>Examples</h2>
<ol>
<li>
<p>Input: <code>Microeconomics focuses on individual markets and consumer behavior.</code>
Output: <code>[&quot;Microeconomics&quot;]</code></p>
</li>
<li>
<p>Input: <code>Understanding mental health and brain chemistry requires studying psychology.</code>
Output: <code>[&quot;mental health&quot;, &quot;brain chemistry&quot;]</code></p>
</li>
<li>
<p>Input: <code>Machine learning is a subset of artificial intelligence that enables systems to learn from data.</code>
Output: <code>[&quot;Machine learning&quot;]</code></p>
</li>
<li>
<p>Input: <code>The human brain is the most complex organ in the body.</code>
Output: <code>[&quot;human brain&quot;]</code></p>
</li>
</ol>
<h2>Technical Details</h2>
<h3>Architecture</h3>
<p>ConNER is built on <code>prajjwal1/bert-tiny</code> with a classification layer for BIO (Beginning, Inside, Outside) tagging. The model:</p>
<ul>
<li>Processes sequences up to 128 tokens</li>
<li>Uses WordPiece tokenization</li>
<li>Outputs three classes: O (Outside), B-CONCEPT (Beginning), I-CONCEPT (Inside)</li>
<li>Includes dropout (0.1) for regularization</li>
</ul>
<h3>Training Data</h3>
<p>The model was trained on a proprietary dataset of academic content and notes, including:</p>
<ul>
<li>Student notebooks with highlighted concept annotations</li>
<li>OCR-processed handwritten notes</li>
<li>Synthetic data generated by open-source LLMs</li>
</ul>
<h3>Training Configuration</h3>
<ul>
<li>Optimizer: Adam (learning rate: 2e-4, weight decay: 0.01)</li>
<li>Loss: <code>SparseCategoricalCrossentropy</code></li>
<li>Batch size: 16</li>
<li>Epochs: 300</li>
</ul>
<h2>Model Card</h2>
<h3>Key Information</h3>
<ul>
<li>Size: 18,6 MB</li>
<li>Accuracy: over 90% on validation set</li>
<li>Input: Text sequences up to 128 tokens</li>
<li>Output: List of extracted academic concepts</li>
<li>Platform: Runs on CPU, optimized for edge devices</li>
</ul>
<h3>Limitations</h3>
<ul>
<li>English text only</li>
<li>Best performance on academic/educational content</li>
<li>Maximum sequence length of 128 tokens</li>
<li>The model sometimes misses complex multi-word concepts, especially in definitional contexts</li>
<li>Occasional false positives on pronouns and general terms</li>
<li>Performance varies based on sentence structure and context</li>
<li>Some domain-specific concepts might be missed depending on training data coverage</li>
<li>Best performance on clear, direct academic writing</li>
</ul>
<h4>Example Limitations</h4>
<ul>
<li>
<p>Input: <code>In psychology, cognitive dissonance describes the mental stress from holding contradictory beliefs.</code></p>
<p>Output (Misses &quot;cognitive dissonance&quot; as a concept): <code>[]</code></p>
</li>
<li>
<p>Input: <code>It can be found by adding horizontally the individual supply curves.</code></p>
<p>Output (Incorrectly labels pronoun as concept): <code>[&quot;It&quot;]</code></p>
</li>
</ul>
<h2>Resources</h2>
<ul>
<li><a href="https://github.com/superseted/conner">GitHub repository</a></li>
<li><a href="https://github.com/superseted/conner/blob/main/inference.py">Example inference code</a></li>
<li><a href="https://github.com/superseted/conner/blob/main/train.ipynb">Example training code</a></li>
<li><a href="https://github.com/superseted/conner/issues/">Issue tracker</a></li>
</ul>
]]></description>
    </item>
  </channel>
</rss>
