Now I have 3 steps in the pipeline and functioning , I want to add 4th - tokenisation using Bert Model.
Unfortunately, tokeniser depends on Pytorch which is ~800 MB download.
It seems after installation pytorch cluster becomes unstable:
161808:M 22 May 2020 15:01:13.516 * <module> GEARS: Successfully spellchecked sentence sentences:bafab6b3dd88dcdefe111698d02f81998c9accdb:236:{1x3}
161783:S 22 May 2020 15:03:42.420 * <module> Processing ./torch-1.4.0-cp37-cp37m-manylinux1_x86_64.whl
161783:S 22 May 2020 15:03:51.325 * <module> Installing collected packages: torch
161783:S 22 May 2020 15:04:02.674 * <module> Successfully installed torch-1.4.0
161783:S 22 May 2020 15:04:09.381 # <module> disconnected : 10.144.17.211:30006, status : -1, will try to reconnect.
161783:S 22 May 2020 15:04:09.402 # <module> disconnected : 10.144.17.211:30005, status : -1, will try to reconnect.
161783:S 22 May 2020 15:04:09.422 # <module> disconnected : 10.144.17.211:30003, status : -1, will try to reconnect.
161783:S 22 May 2020 15:04:09.443 # <module> disconnected : 10.144.17.211:30002, status : -1, will try to reconnect.
161783:S 22 May 2020 15:04:09.464 # <module> disconnected : 10.144.17.211:30004, status : -1, will try to reconnect.
The command I am trying to run gears-cli --host 10.144.17.211 --port 30001 tokenizer_bert_run.py --requirements requirements_tokenizer.txt
where requirements:
torch==1.4
transformers==2.9.1
and the code
tokenizer = None
def loadTokeniser():
global tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
return tokenizer
def tokenise_sentence(record):
global tokenizer
if not tokenizer:
tokenizer=loadTokeniser()
sentence_key=x['key']
sentence_orig=x['value']
# sentence_key=record['value']['sentence_key']
# sentence_orig=record['value']['content']
shard_id=hashtag()
log(f"Tokeniser received {sentence_key} and my {shard_id}")
tokens = tokenizer.tokenize(record['value']['content'])
key = "tokenized:bert:%s:{%s}" % (sentence_key,shard_id)
for token in tokens:
execute('lpush', key, token)
execute('SADD','processed_docs_stage3_tokenized', sentence_key)
bg = GearsBuilder()
bg.foreach(tokenise_sentence)
bg.count()
bg.run('sentences:*')
I don’t think it reaches point where it runs code.
gears-cli times out with
Results
-------
Errors
------
%d) %s (1, 'Execution max idle reached')