Smart DRUGan

Task

Extent DRUGan in order to be able to provide a personalized Data Science

Roadmap

at least what we've done

  • Research on existing solutions and resources
  • Define a source list of information
  • Define the structure the source should be mapped to
  • Define an NLP architecture for data processing
  • Create data scrappers
  • Join modules
  • Create interaction module with slack interface

Example of a mapping structure

In [5]:
ex1={"someblog.com":
    {   
        "source":7,
        "title":"New tyoe of robotics", 
        "time_to_read":"approximately 40 minutes",
        "short_summary":"Some lines",
        "our_tags": ["ML", "NLP"],
        "given_by_blog_tags":["Disussion","Project","Reserch"],
        "date": "19 Sep 2018",
        "author_name": "name", 
        "number_of_comments":34, 
        "preview_picture":"Here will be a png" ,
        "github_link":"https://github.com/",
        "arxiv_link":"https://arxiv.org/abs/1807.02033v2",
        "reddit_link":"https://www.reddit.com/r/MachineLearning/comments/",
        "amount_of_facebooks":450,
        "amount_of_twits":450
    }}
In [6]:
ex2={"arxiv" : {
        "id" : "http://arxiv.org/abs/1806.01660v4",
        "date" : "19 Sep 2018",
        "category" :  "cs.[CV|CL|LG|AI|NE]/stat.ML",
        "category1" : ["nlp", "CV"],
        "title" : "Mask R-CNN",
        "authors" : [{"name" : "Andrew Ng", "id" : 123}],
        "trackbacks" : "www.kaggle.com",
        "arxiv_summary" : "We present an auxiliary task to Mask R-CNN, an instance segmentation network, which......",
        "our_summary":"We present an auxiliary task to Mask R-CNN, an instance segmentation network, which......",
        "pages_count" : 50,
        "Bibliographic":{"data":{ 
            "references" : "another arxiv article",
            "citations" : "direct quote",
            "similar_abstract" : [],
            "also_read" : []}
        },
        "num_of_submission" : 4,
        "Comments" : {
            "availability" : "yes",
            "text" : "arXiv admin note: text overlap with" ,
            "link" : "https://arxiv.org/abs/1802.05155"
        },
        "figures" : {                     
            "caption_boundary": {
                "x1": 152.66566806369357,
                "x2": 693.7513987223307,
                "y1": 273.42425452338324,
                "y2": 284.6669514973958
            },
            "caption_text": "Table I. Objects carrying charges m\u2032\u00b5 and n\u00b5 in each theory related by dualities.",
            "dpi": 100,
            "figure_boundary": {
                "x1": 262.0,
                "x2": 584.0,
                "y1": 285.0,
                "y2": 433.0
            },
            "figure_type": "Table",
            "name": "name",
            "page": 2
        },
        "video_summary" : {
            "videos" :"https://www.youtube.com/user/keeroyz/videos",
            "text" : "http://www.shortscience.org/?s=cs"
        },

        "pdf" : {
            "id" : "https://arxiv.org/pdf/1806.01660v4",
            "results_сonclusion" : "",
            "bold_item" : "",
            "pages" : 13

        }
    }}

Defying NLP architecture in our case consists of:

  • Making a good enough tagger
  • Realizing an article summarization and a time estimation

Tagger

What do we have in tagger

  • A list of all DS terms
  • Automatic Keyphrase Extraction technology
  • Function that checks whether found keyphrase is a DS term

A list of all DS terms

At the beginning it wasn't a list but a 2350 lines dictionary with an hierarchical structure
Fragment of that dict:

In [2]:
 dict_ex={
     "structures used in natural language processing":{
     "anaphora":{},
     "context-free language":{},
     "controlled natural language ":{},
     "corpus":{
         "text corpus":{},
         "speech corpus":{}},
     "grammar":{
         "context-free grammar (cfg)":{},
         "constraint grammar (cg)":{},
         "definite clause grammar (dcg)":{},
         "functional unification grammar (fug)":{},
         "generalized phrase structure grammar (gpsg)":{},
         "head-driven phrase structure grammar (hpsg)":{},
         "lexical functional grammar (lfg)":{},
         "probabilistic context-free grammar (pcfg)":{},
         "stochastic context-free grammar (scfg)":{},
         "systemic functional grammar (sfg)":{},
         "tree-adjoining grammar (tag)":{}},
     "natural language":{},
     "n-gram ":{
         "bigram":{},
         "trigram":{}},
     "ontology":{
         "taxonomy":{
             "hyponymy and hypernymy":{},
             "taxonomy for search engines":{}}},
     "textual entailment":{},
     "triphone":{}}
 }

Automatic Keyphrase Extraction(terminology extraction)

  • It seems like a very hard task and it is true but thankfully while trying to get it done we found a ready to use library named textacy.
  • However, that wasn't the end and we faced a new problem of chosen a right extraction method (from 8 existing).

Finishing up our tagger

  • Mesuaring hamming a distance between found keyphrase and terms from our list
  • If found close enough term from list, we declare it as a TAG(close enough = preset threshold)
  • Than we again mesuare a hamming distance but now between our tags to get rid of simillar ones

Summarization and a time estimation

Time estimation and readability

  • Time estimation counted through simple medium formula where we just divide number of words by average WPM and every picture adds 12 seconds
  • Readability counted through more complex Flesch reading ease formula

Summarization

We've used a library called sumy which has a whole bunch of summarizers and challenge was to choose the right one.
We've ended up choosing Luhn.

Data Engineering

Joining modules

  • Combining all parts of the script
  • Сonnecting script to the DB

Final example

{"source": "FastAI", "title": "Andrew Ng says Deep learning is the \"New Electricity\"; what this means to your organization fast.ai", "timeToRead": 5, "Readability":"easy to read article","summary": "Deep learning models provide deeper insight and greater accuracy, make existing products better, improve operations (e.g. Google used deep learning to reduce data center cooling requirements by 40%!)\nDeep learning is particularly effective at handling noise in data, and in handling unstructured data - so if your data infrastructure is not in a good state, it is even more important that you invest in deep learning.\nLooking externally for deep learning experts, rather than developing deep learning expertise within your existing staff, means that you will be creating a gap between your domain experts and your new data experts.\nThe best approach, of course, is to do both: hire existing deep learning experts if you can, whilst developing skills of your own team at the same time.\n", "tags": ["ai-in-society", "meta learning", "deep learning", "neural network"], "date": "2016-10-11T00:00:00", "picture": null, "githubLink": null, "arxivLink": null}

Thanks for attention!